search this blog


Saturday, December 8, 2012

23andMe’s Ancestry Composition – a preliminary review

Update 22/11/2013: 23andMe announces a fix and update to the AC. More info here.

Update 26/01/2013: The scientists at 23andMe have found a solution to the "overfitting" problem, and those affected will see their results updated within a week. See here.

Update 12/12/2012: Here's an explanation from a 23andMe scientist for some of the dodgy AC results described in my post below.

We've heard from some customers that they get suspiciously high ancestry proportions from the population that they declared themselves to be part of in the 23andMe ancestry survey. For example, a customer that told us they have four grandparents from Italy might see nearly 100% Italian, when other Italian customers rarely see such large values. This is not completely surprising, since Ancestry Composition uses a mix of customer and public data in its reference database. We certainly don't want to discourage folks from participating in the ancestry survey, because increasing the size of the reference database is the single best way for Ancestry Composition's results to improve. We're exploring modifications to our method, to ensure that customers who contribute their ancestry information have a great experience.

Source: 23andMe Community


I’m calling this a preliminary review, because despite the fact that the Ancestry Composition is officially out of beta, it’s still very much a work in progress. Let’s see what happens in a couple of months after thousands of customers digest their results and send feedback to 23andMe. I expect the scientists there will take note and make some important changes, after which I’ll be able to put together a more useful review and guide.

Indeed, one of the best things about the Ancestry Composition is that it’ll always be a work in progress to some extent, as its algorithms are perfected and new reference samples added. The quotes below come from a page at 23andMe broadly describing the technical aspects of the new tool: The Science Behind Ancestry Composition.

Ancestry Composition, as you've seen, has a modular design. This was intentional, because it makes it possible to improve the components of the system independently. We can upgrade Finch's phasing reference database, or the SVM's reference database of people of known ancestry.

Most of us have become accustomed to the idea of semi-regular software updates, and we hope to apply this same model to Ancestry Composition. When we improve some component of the system or upgrade one of the reference databases, your results will automatically be updated and you'll see a note about what has changed and why. .

The biggest gripe I have with the current Ancestry Composition (which I shall call the AC from now on) is that it appears to perform very well for many white Americans, but produces dubious results for many Europeans, specifically those from ethnic groups used to define the AC biogeographic clusters.

For instance, European users are often classified as belonging entirely to one of the ten European clusters, despite the fact that a wide range of ancestry tools, including those at 23andMe - like the Relative Finder, Ancestry Finder and Global Similarity - suggest they ought to show a variety of admixtures. Indeed, it’s often the case that two genetically alike individuals, who appear very similar in a variety of ancestry tests, receive significantly different outcomes via the AC. It’s difficult for me to comment why this is happening, because I don’t know enough about the methodology behind the AC, but it looks like a serious technical issue that needs to be corrected.

[Edit: As per the update above, we've since learned this problem is called "overfitting", and affects people who were chosen as reference samples based on their self-reported ancestry at 23andMe.]

Another issue is that many Europeans receive results with large areas of their genomes classified as “nonspecific Northern European” or even “nonspecific European”. This happens when the Ancestry Composition algorithm can’t match chromosomal segments to any one of the European reference groups with a high enough degree of confidence (see the technical guide to the AC linked to above).

It’s a problem that seems to affect people with ancestry from the edges of the AC biogeographic clusters. For instance, Swedes often get around 50% of their genomes classified as “nonspecific Northern European”. Now, if we look at one of the PCA plots used by 23andMe to delineate the aforementioned clusters, we can see that many Swedes overlap with samples from Norway, Britain, Poland and/or Germany. So they’re basically straddling the borders of up to four AC zones – Scandinavian, British/Irish, Eastern European and French/German, respectively.

No wonder the AC's confused. This “nonspecific” stuff appears to be the result of attempting to divide Northern and Western Europe into too many biogeographic zones. But the hope is that this issue can be overcome in the near future with the addition of more reference samples.

So let’s wait and see how the AC develops. Dienekes criticized the concept largely because it’s a supervised ancestry test, and he’s got a thing for unsupervised ADMIXTURE runs (see here). I don’t see a problem with the supervised approach, and I’ve got big hopes for the AC if 23andMe shows the will to improve it.

See also...

23andMe gears up for major ancestry updates


mikej2 said...

One thing that bothers me most is that the composition can change remarkably only in one or two generations. For examply my 10% nonspecific changes to full Finnish in my daughter's results. How come all this nonspecific result suddenly is assigned as Finnish? This looks for me the most dubious systematic error and sounds bad being like that. Having a high amount of nonspecific is not a systematic error, but as you mentioned, maybe a result of the used reference sampling.

mikej2 said...

I thought this newly and have a theory. Think that you have segments like this

70% Russian and 30% Finnish

It is classified as a Russian segment.

Now childred with a full Finnish spuce gives

70 / (70+30+100)= 35% Russian and 65% Finnish

After this cild has childred with a fully Finnish spouce we have

35 / (35+65+100)= 17.5% Russian and 82.5% Finnish.

The Russian disappeared totally in one or two generations when using threshold values used by AC.

In practice with numerous segments the statistics doesnt follow this kind of simple rule, but it can happen in some cases. How often depends on the segmentation rules.

Richard Robbins said...

I got my results from 23 and Me the other day. 100% European, with 23 percent unknown. Geez. That's pretty much mailing it in, isn't it? Eurogenes gives me more detail, but I'm still not getting the Polish that I'm looking for. Mom is full blooded Polish, as far as I know.