search this blog


Thursday, September 12, 2013

Geographic Population Structure (GPS) prediction

Update 13/09/2013: Actually, to run GPS users can input what seem to be Geno 2.0 autosomal ancestry proportions into the relevant fields on this page. Now, for various reasons I don't have a very high opinion of the Geno 2.0 autosomal test, so it's not something I'll ever pay for. However, I just quickly put together a K=9 test that roughly approximates the Geno 2.0 analysis. It's a bit noisy, but here's the GPS result it generated (using 10 reference populations), which is fairly accurate...

The K=9 test is available at GEDmatch (Ad-Mix page > Eurogenes > Eurogenes K9b), but I can't guarantee it'll produce accurate results for everyone. My outcome might have been a bit of a fluke. I'd say the best thing to do is to wait until the site starts accepting raw genotype data from users (see here).


This thing will apparently predict your biogeographical origins down to your "home village". That's unlikely to be relevant for most personal genomics customers, but it just so happens that I do come from a small Polish village, so I'll certainly be able to test the veracity of this claim. I tried uploading my 23andMe data just now, but it told me to come back later, so I guess it's not ready yet.

The search for a biogeographical method that utilizes biological information to predict one's place of origin has occupied scientists for millennia. Modern biogeographical algorithms achieve an accuracy of 700 km in Europe but are highly inaccurate elsewhere, particularly in Southeast Asia and Oceania. Here, we present the admixture-based Geographic Population Structure (GPS) method that accurately infers the biogeography of worldwide individuals down to their village of origin. GPS' accuracy is demonstrated on three datasets: worldwide populations, Southeast Asians and Oceanians, and Sardinians (Italy) using 40,000-130,000 GenoChip markers. GPS correctly placed 80% of worldwide individuals within their country of origin with an accuracy of 87% for Asians and Oceanians. Applied to over 200 Sardinians villagers of both sexes, GPS placed a quarter of them within their villages and most of the remaining within 50 km of their villages, allowing us to identify the demographic processes that shaped the Sardinian society. Finally, we We further demonstrate additional three applications of GPS in tracing the biogeographical origin of the Druze population and uncovering the European origins of North Americans. The accuracy and power of GPS underscore the promise of admixture-based methods to biogeography and has important ramifications for genetic ancestry testing, forensic and medical sciences, and genetic privacy.

Ahh, OK, so you have to be a Sardinian villager to get the most out of this tool. Well, I'm still looking forward to putting it through its paces. That link again: Geographic Population Structure (GPS) prediction.


Seinundzeit said...

Hi David,

How does one run this K=9 calculator, there is no .par file? I would really appreciate any assistance. Thank you.

Davidski said...

I sent the link to John at GEDmatch but haven't heard back yet. I suppose if enough people inquire about the possibility of this K=9 test being amongst the Eurogenes tools there he might put it up.

Otherwise, try and find someone who's really handy with these sorts of files and can make use of them.

Seinundzeit said...

Thank you very much for your prompt response.

Also, I have a general question about this particular calculator. Did you exclude the HGDP samples from your dataset? I think that would lead to clusters that approximate Geno 2.0 very closely (I think the lack of HGDP reference populations is why their components have such different modalities from the one's we've become used to. But, I've just assumed this. Obviously, your knowledge of these things is vastly superior to anything I can claim, so I was wondering if you could offer any insight on this).

Davidski said...

I included the HGDP samples, but I dropped all South Indian and Siberian samples. This prevented the appearance of South Asian and Siberian clusters, which is what the Geno 2.0 is sorely lacking.

Larry Couture said...

What are the differences in the 10 reference population? How do I know which one to use?

Davidski said...

You fill in all the fields for the ancestral components, and then specify the number of reference populations you'd like to be compared to.

Fran said...

For the GPS predictor - should I use the k9b percentages from GEDmatch to plug into his percentage fields? Or what's the best GEDmatch tool to get the percentages from? Thanks!

Davidski said...

You should probably use Geno 2.0 ancestry proportions if you have them. If you don't, then try the K9b at GEDmatch and vary the number of GPS reference samples (1-10) to find the result that makes most sense.

StuardaAnita said...

I find these K9b proportions to correspond to known family facts and paper trails. So these clusters may strongly represent the applicable reference populations. What is the underlying population group for Southwest Asian? This admixiture continues to show up in my results and I'm very curious. How far back do these population samples date?

Also when do you consider an admixture result (i.e., Northeast Asian at 1.92 and other minute admixtures) too small to ignore?

Thank you for a wonderful, quite precise model. I realize these are simple questions but your responses are greatly (and humbly!) appreciated.

Davidski said...

This Southwest Asian component roughly corresponds to the component seen on this map.

But the fact that it shows up in your results need not mean that you have recent ancestry from the region where it peaks today. The admixture might be as old as the Neolithic in your part of the world, and if so, you would need to show a clearly elevated amount (relative to others of similar ancestry) for it to indicate something more recent.

StuardaAnita said...

My appreciation for your prompt response.

In an admixture consisting of the following reference populations, do you believe this amount of Southwest Asian is statistically greater than average? I have not viewed a large number of samples, but have compared my results with others. It appears it is about 33% to 50% greater than those of similar Europeans.

Perhaps you could weigh in.



Davidski said...

It's difficult to say, because I don't know your origin. What you should do is run the EUtest and check the oracle results. If you see lots of Southwest Asian reference samples popping up, that might be an indication of recent ancestry from that part of the world. But keep in mind that I have a new EUtest on the way, which should in theory be even more precise.

Volodymyr Lutsyk said...

To "StuardaAnita"

Wow. This is really close to my results. Where do you come from?
Southwest_Asian 10.16%
Native_American 0.90%
Northeast_Asian 1.89%
Mediterranean 20.52%
North_European 64.60%
Southeast_Asian 0.53%
Oceanian 0.60%
South_African 0.20%
Sub-Saharan_African 0.59%

RC Caudill said...

Do you know what population reference was used for South Africa? Was it the San people or the Dutch South Africans? The reason I ask is because I have matches to some of the San on other tests. Thanks!

Davidski said...

Various Sub-Saharan groups from South Africa, including the San.

Anaxagoras said...

Hi David,

I have just uploaded my FTDNA data on GEDmatch and used your Eurogenes calculators. Great work! I have several questions, but I do not want to overwhelm you, so I will be posting them gradually in separate posts. The first one relates to the K9b calculator, whose aim is to approximate the Geno 2.0 autosomal results. In my case, the Asian admixture is spot on but the Mediterranean and N.European admixtures are quite off. Any ideas of why that is? I am posting both here (I am 100% Greek Cypriot).

FTDNA data on k9b:
Southwest_Asian 24.45%
Northeast_Asian 1.04%
Mediterranean 46.64%
North_European 27.47%

Geno 2.0 results:
Southwest_Asian 25%
Northeast_Asian 2%
Mediterranean 63%
North_European 10%

Thanks in advance for your help!

Davidski said...

They're different tests with different allele frequencies making up the clusters, so they're unlikely to produce the same results. Only the names of the clusters are the same.

The K9b should only be used to try out the GPS tool until the site starts accepting raw data. It's not meant for anything else.

Will Tay said...

This is very accurate to my FTDNA, and Dr Doug McDonalds run,

Southwest_Asian 8.91%
Native_American 1.97%
Northeast_Asian 2.24%
Mediterranean 15.86%
North_European 68.54%
Southeast_Asian 0.49%
Oceanian 0.33%
South_African -
Sub-Saharan_African 1.66%

Kara Moore said...


This may be a very good calculator, but I am like the other poster, I do not know exactly what to insert in the boxes. I tried to insert the percentages, to make them a ratio instead of a percent, but it kept counting it higher than 1. Could you please give an example of what we are to insert?

I even downloaded the files from there but don't know what to do with them. K9b is a good online calculator.

Here are my K9b

Southwest_Asian 8.82%
Native_American 0.62%
Northeast_Asian 0.68%
Mediterranean 18.21%
North_European 69.71%
Southeast_Asian 0.45%
Oceanian 0.44%
South_African 0.35%
Sub-Saharan_African 0.72%

From there, please explain what to enter for the biogeography calculator.