Tuesday, November 6, 2012

SPatial Ancestry analysis (SPA) for 23andMe clients

A while ago I ran a few experiments with SPatial Ancestry Analysis (SPA) for selected Eurogenes members (see here). The results were very impressive, but only thanks to my large collection of samples, many of which are private. A few additions have now been made to the SPA package to make it more accessible to the average 23andMe user, including "model" files. What this means is that it's no longer necessary to have a reference dataset to analyze single samples.

SPA can predict the ancestral origins for 23andMe users with genotype file. Let's assume that you have an account with 23andMe. Then you can follow the below steps to get your prediction.

Download SPA software.

Download the models for world and europe and decompress them.

Put your 23andMe genotype file into the same directory as the above two.

Open a terminal if you are using Mac OS X, or a command window if you are using Windows.

Go to the directory where you put SPA software, models and 23andMe genotype file.

Run the following command in your terminal or command window. Make sure to replace 23andMe.txt with your 23andMe genotype file name.

spa --mfile 23andMe.txt --model-input europe.model --location-output europe.loc

Double click the resulting world.loc.html or europe.loc.html to check where your ancestral origin is. Note that if you are European population, you only need to check europe.loc.html for a better resolution and if you are non-European population, you only need to check world.loc.html.

SPA can also predict two origins in the case that your mother and father are from different locations. In order to do that, use the versions of the command lines that include -n 2 below.

spa --mfile 23andMe.txt --model-input europe.model --location-output europe.loc -n 2

My dual result can be seen below. I'm actually Polish rather than Finnish/French, but I guess it's kind of the same thing...or not. Anyway, for all the instructions and updates see the SPA website.

Also worth mentioning is that the SPA team will be making a presentation at the upcoming ASHG 2012 Annual Meeting. It looks like we can expect more interesting updates to the the program any day now.

A model-based approach for analysis of spatial structure in genetic data.

W. Yang1,4, J. Novembre3,4, E. Eskin1,2,4, E. Halperin5,6,7 1) Department of Computer Science, UCLA, Los Angeles, CA; 2) Department of Human Genetics, UCLA, Los Angeles, CA; 3) Department of Ecology and Evolutionary Biology, UCLA, Los Angeles, CA; 4) Bioinformatics IDP, UCLA, Los Angeles, CA; 5) International Computer Science Institute, Berkeley, California, USA; 6) Department of Molecular Microbiology and Biotechnology, Tel Aviv University, Tel Aviv, Israel; 7) School of Computer Science, Tel Aviv University, Tel Aviv, Israel.

Characterizing genetic diversity within and between populations has broad applications in studies of human disease and evolution. Two key step towards this objective are spatially global ancestry inference, which aims at predicting geographical locations for the ancestries of individual, and spatially local ancestry inference, which aims at predicting the geographical locations for chromosome segments, or ancestry blocks. We propose a new approach, SPALL (SPatial Ancestry analysis LocaL), for solving the two inference problems in a unified probabilistic model. This model takes linkage disequilibrium into account and can be solved efficiently by Expectation Maximization (EM) algorithm in conjunction with forward-backward algorithm. This new method allows us to assign geographical locations for parents, grandparents, and ancestries from more generations ago of an given individual. It also allows us to assign geographical locations for each locus-specific variant. We analyzed a European and a worldwide dataset, and showed that the SPALL can actually predict locations with a high accuracy. The proposed model is build as a generalization of our recently published work called Spatial Ancestry Analysis (SPA), which explicitly models the spatial distribution of each SNP by assigning an allele frequency as a continuous function in geographic space. The method allows us to assign an individual, or an admixed individual to geographical locations instead of predefined categories of population.

Fanty said...

Tried it.
Single mode it points me to Germany.
Dont know if the point inside of Germany is relevant or not? Its near Frankfurt.

Well, I am German. But if it points to a special point in Germany, I would have expected it more easternly, because from my 4 grandparents 1 is from the German/Dutch border, but the other 3 are from Eastprussia (Kaliningrad) and Silesia (Wroclaw)

In Dualmode it claims my parents to be Sardinian ("first anchestry") and Norwegian ("second anchestry").

pconroy said...

Interesting in Single Mode and Dual Mode in Europe, I'm in Cornwall - actually more like Bristol.

In Single Mode and Dual Mode for World, I'm in the Mid-Atlantic off the coast of Maine, US???

Repo Man said...

Huh. I couldn't get this to work on OSX Mountain Lion. Any of you folks using a Mac?

pconroy said...

RepoMan, I'm using a PC.

David, BTW the World model seem to place the pointers in the Atlantic, off the coast of Maine for all people I've run it on - any idea why??

Davidski said...

I don't know why the world model doesn't work? Maybe it's just a dud? Someone should e-mail the SPA programers and tell them.

BTW, I'm working on a new Europe model file which should be much better than the current one.

Ilya Savchenko said...

I'm Russian with possible minor Ukrainian ancestry, I get Germany as my result, placed somewhere near Magdeburg. And as a double result I get: Atlantic Ocean between UK and Iceland :) and Black Sea near Crimea.

Sicilianu101 said...

Can this work with familytreedna data? I tried to make it work but I got errors. I am IT 40 in the Eurogenes system.

Davidski said...

^ I haven't looked into that, but yeah, I think so. You probably just need to correctly change your file from csv to txt.

Failing that, you could e-mail the authors and and ask them to make SPA compatible with FTDNA autosomal data.

Jeremy Knowles said...

I've tried all 3 of the different models, but none are overly accurate. I have 3 British/Irish Grandparents and I from China

In n2 mode for Finland, I'm off the SW Coast of England and in Kazakhstan

East Asia n2, i'm in Denmark and Pakistan.

West Eurasia is basically the same as Finland, but slightly more northern Kazakhstan and more south in the Celtic Sea.

I'm guessing this is due to my admixture, so it seems these are not meant for anyone with non-euro dna?

Davidski said...

^ Based on your mix, I'd say one of your halves should land near the UK, or at least deep in Europe, and the other somewhere in Central Asia.

Charles said...

Thanks to this SPA software and the Eurogenes West Eurasia model, I was able to ascertain with the use of 26 genomes of Canarians, Berbers, and Latin Americans of Canarian descent that my grandmother's genome is the closest to the Canaries of all the said genomes. The 26 genomes also create a very elegant scatter plot. Your research has been invaluable for my own research. Zie zie ni!