Gabmap: cluster validation - demonstration

This demonstration uses data from Germany

The problem

The result of clustering can be deceptive.

You ask for six clusters, and you get six clusters, no matter if there are actually more clusters in the data, or even less.

Sometimes it is clear that the clusters you get are useless, that you are only clustering noise. You can see this in the map below:

But usually, the result looks quite acceptable, like in the following map:

What you get not only depends on how many clusters you ask for. Different clustering methods give different results. And clustering is highly sensitive to small amounts of noise. Furthermore, you can't see the difference between major and minor cluster devisions. (Even dendrograms, that suggest a hierarchy, are not realiable in this respect.)

Validation with the help of multi-dimensional scaling

An MDS plot in two dimension, using the same colours as those used in the clustering, shows you how the major devisions of the data are.
(Numbers indicate the most exceptional points in each cluster.)

You can see a clear devision between north and south. This plot also shows that the dark blue cluster seems to belong partly to the north group and partly to the south group. So this colour actually is not a single cluster.

Let's increase the number of clusters from seven to eleven:

Here are the map and the MDS plot of the devision in eleven clusters:

The previous dark blue cluster is now split in two, a yellow cluster that belongs to the north, and a dark blue cluster that belongs to the south.

We have identified a north group and a south group. What about the clusters within these groups? Are they valid?

First, let's have a closer look at the north. We remove all clusters from the south from the plot:

This gives MDS the chance to redistribute a smaller part of the data, giving more room to clusters in the data to separate themselves. Indeed, colours that in the previous plot looked jumbled now appear spatially ordered:

However, are these really four distinct clusters?

Here's is what the above plot looks like if you don't have colours:

The cluster that was yellow seems alright as a real cluster, but it is not possible to point to a distinct cluster in the other parts of the plot.

So, what is happening here?

It seems (based on the data) that the north is not a region of separate dialects, but a continuum, with language changing gradually over distance. Pink is clearly at the west as one end of the continuum, and yellow is the other end in the east. But borders within this continuum are arbitrary, especially between pink and light blue, and between light blue and light purple.

Now, let's have a closer look at the south. There we have seven clusters at the moment, but in the complete MDS plot, it all looks jumbled.

So we make an MDS plot with data from the north omitted:

This shows another two major groups. One in the middle of Germany with dark and light orange, dark purple and dark blue. And the other group in southern Germany with red and dark and light green.

The location marked "28" is one of the exceptional locations in the dark green cluster, as you can see in both map and MDS plot. It doesn't belong to the south. The clustering algorithm got it wrong.

Zooming in even more. In the previous MDS plot, red and green looked like a single cluster mixed together. This is what it looks like if we make an MDS plot of only these two clusters:

Red and green are no longer mixed, but the spatial distribution suggests there is no real cluster border. (Imagine this plot without colours.) So this is probably an area with a gradual dialect transition from north to south.

You can continue this process of zooming in on smaller parts of the map, looking for, and checking sub-clusters.

A few words of caution.

MDS is much more stable than clustering, but it isn't perfect either.

The results you get are only as good as your data. You may miss dialects because the data isn't detailed enough. You may also see clusters that don't correspond to dialects, but to some artifact in the data, for instance, multiple fieldworkers, or tiny variations in the manner of data gathering, that show only in the final analysis.