search this blog

Loading...

Saturday, January 14, 2012

Eurogenes' North Euro clusters - phase 1, exploring the data


I have some preliminary results from a new intra-North Euro cluster analysis, using a cutting edge tool called ChromoPainter. More than 400 samples and 270K SNPs were tested, in linkage mode, and then the output processed in fineSTRUCTURE at 200K burn-ins and iterations. Like I say, the results should be treated as preliminary, but they already look better than any other cluster analysis I've ever seen dealing with Europe north of the Alps, Pyrenees and Balkans. The algorithm identified 21 clusters, with most located in Eastern and Northeastern Europe (see spreadsheet for details). Below are two plots showing how the clusters relate to each other via a tree diagram and heat maps – the first shows an aggregate view, and the second the individual samples.





It's interesting that the Baltic Finns seem to create clusters at a drop of a hat, but they also share the highest number of chunks, and the longest chunks, than any other group. Indeed, all of the Finnish clusters are closely related, and many of the individuals, especially from East Finland, even look like distant relatives on the heat map (note the ultra-hot, blue squares). On the other hand, the large Northwestern European cluster, featuring samples from across the UK, as well as from several nearby countries, is holding firm, and might be tough to break up in this analysis.

I have some theories about the reasons for the obvious genetic homogeneity and diversity in Western Europe, and these include the effects of the Black Death. It decimated many populations in the western half of the continent, thus encouraging migrations into emptied areas, and eventually leading to more open, mobile societies. It's an interesting subject, and I might write much more on it in the future. Meantime, here's a PCA plot from the ChromoPainter chunk counts data. Note the large distances spanned by groups from Northern and Eastern Europe, and the tight bundle of samples from the west, mostly from the UK, Ireland, France and the Low Countries. Interestingly, and perhaps counter-intuitively, it's the closely related Finns who take up most of the space on the plot.



The first component picked up by this PCA appears to be an Atlantic one. It peaks among the Cornish samples, but shows similar levels in all the British, Irish, French, Dutch and Belgians (post-Black Death mobility?). If we are to assume that I identified the component correctly, then it appears as if the East Finns, Vologda Russians, Erzya from the Middle Volga, and Lithuanians are the least “Atlantic” samples in this analysis. These groups, especially the East Finns, also happen to act like relative genetic isolates in many of my experiments (such as ADMIXTURE and MDS analyses). Thus, it seems they've been sheltered from significant gene flow from outside in recent times, including from the west, like German migrations to East Central Europe, and Scandinavian influence in Western and Southwestern Finland.

The analysis also produced a lot of detailed data showing phased half-segment matches between all individuals. In theory, it should be possible to use this information to create chromosome paintings for the people involved - much like the Ancestry Painting feature at 23andMe, but obviously with 21 potential North European reference groups, instead of 3 inter-continental ones. We shall see how that works out.

I'll stop rambling at this point, and attempt to break up that large Northwestern cluster (Pop21), and perhaps also the French cluster (Pop7). If they don't budge this time, perhaps they will in future runs with more samples? Indeed, I'd like to try a Eurasian-wide analysis, but might need more powerful hardware for that sort of an undertaking.


Update: Eurogenes' North Euro clusters - phase 2, final results


1 comment:

Baldric said...

I've been reading for several months now this blog (and Dienekes, and others), but I still can't understand the clustering algorithm used by this program STRUCTURE. I could not find the source code, does anyone have a link to it? I'm particularly interested in the way that it chooses the value of K and how it manages sub-clusters, etc.

Also, when performing dimensionality reduction into 2D, How much variance is lost? Is there any principal eigenvector that is present exclusively in a subset of clusters?

I'm very interested in the inner workings of the calculations, the vectorized representation of discrete valued DNA, etc. Unfortunately there is very few readily available information.