Introducing SAPDA – a powerful new admixture inference software

 

ABSTRACT

After extensive research and development we are pleased to introduce SAPDA; a program for inference of Shared Ancestral Population Defining Alleles.

We have developed SAPDA to address some of the limitations and weaknesses in the publicly available programs such as ADMIXTURE, STRUCTURE, and PCA (including PCA based nMonte when used for admixture inference). SAPDA also empowers the customer with important additional information related to derived alleles or mutations they share with various populations. This information is facilitated via informative graphs and charts which enable users to visualize important information regarding their ancestry.

SAPDA contains hundreds of lines of code and unlike other programs which only output 1 admixture percentage table or chart, SAPDA outputs 11 graphs and charts to detail various aspects of the user’s ethnogenesis. Results shown herein depict SAPDA output plots for 2 case examples; a British individual, and an E African individual from Sudan. The following are outputted by SAPDA:

  1. Three admixture percentage pie-charts for mutations shared by the user with various calculator source populations, based on the age of those mutations, as shown in fig 1 & fig 2;
  2. The user’s Single Population Sharing or GSI (see definitions below) for various source populations, as shown in fig 3 & fig 4;
  3. Bar plots showing precisely which population defining alleles (derived mutations) the user shares with the various source populations, for the 3 classes of mutations shown in figs 5-13;
  4. The allele frequencies for those population defining alleles the user shares with the various source populations as shown in figs 5-13;
  5. The genotypes of those population defining alleles shared between the user and various source populations as shown in figs 5-13;

SAPDA offers several advantages over PCA based programs and the program ADMIXTURE. In contrast to  ADMIXTURE, the direction of geneflow can be inferred with SAPDA.

PCAs and ADMIXTURE are useful for population clustering purposes, but not informative to the amount of admixture from geographically or genetically more distant populations. For example, if the objective is to determine the amount of E Asian admixture in a W Asian subject, then that E Asian admixture is masked by admixture from more geographically and genetically more proximate W Asian populations. Thus if a W Asian individual is modeled as 70% W Asian + 20% S Asian + 9% European + 1% E Asian, then the actual amount of E Asian admixture in the W Asian subject is masked by the W Asian + S Asian + European percentages. This is further discussed below.

Fig 1 – British user admixture percentages based on population defining mutation sharing with various populations for different time scale mutations.

Fig 2 – Sudanese user showing no shared mutations with Europeans and E Asians, except a few older mutations
Fig 3 – British user population defining mutation sharing with various populations for different time scale mutations

 

Fig 4 – Sudanese user has a GSI of 100% with Africans thus indicating no shared population defining on a more recent time frame


DEFINITIONS

  • Dataset: 1000 Genomes Phase 3 dataset mapped to Hg37 human reference, containing genotypes for 80 million positions and allele frequency information for populations defined as AFR, EUR, EAS, SAS, & AMR;
  • Reference Populations
    • AFR: Yoruba & Esan (Nigeria), Luhya (Kenya), Gambians (Gambia), Mende (Sierra Leone);
    • EAS: Han (China), Japanese, Dai (China), & Kinh (Vietnam);
    • EUR: N & W Europeans, Toscani (Italy), Finnish, British, & Iberians (Spain);
    • SAS: Gujarati Indians (Houston, TX), Punjabis (Lahore), Bengalis (Bengaladesh), Sri Lankan Tamils, & Indian Telugu;
    • AMR: Mexicans (Los Angeles), Puerto Ricans, Columbians, & Peruvians.
  • GSI: Genotype Similarity Index. This should not be confused with admixture percentage. It is proportional to the number of alleles in agreement between the user and the population defining alleles for the various source populations and is a direct measure of shared derived ancestry between the user and population sources and thus a more accurate quantifier of shared ancestry between the user and the respective population.
  • SAPDA also outputs statistics on shared derived alleles between the user and various populations for 3 ancestral time frames:
    • Ancestral: This refers to mutations derived in various populations which are younger than Deep Ancestral and Deepest Ancestral categories. Shared alleles in this category may imply a more recent admixture event between the user’s ancestors and populations ancestral to the sources;
    • Deep Ancestral: Mutations derived in sources likely older than alleles in the “Ancestral” category;
    • Deepest Ancestral:Mutations derived in sources likely older than alleles in the “Deep Ancestral” category.

 

METHODOLOGY

The dataset used to infer allele frequencies and population defining alleles is the 1000 Genomes Phase 3 dataset mapped to Hg37 human reference. This dataset contain genotypes for 80 million positions and allele frequency information for populations defined as AFR, EUR, EAS, SAS, & AMR. A detailed description of the sub-populations contained withing these populations is given above in the “Definitions’ section.

Fig 5 – British user shares 18 out of 189 defining alleles with E Asians on a “Deeper Ancestral” level, and is hetrozygous for the “C” and “T” alleles at 2 E Asian derived allele positions on chr16.
Fig 6 – British user shares 7 out of 70 African derived alleles with Africans on a “Deeper Ancestral” time scale. the lower bar plot indicates that these alleles are also present at lower frequencies and thus may be associated with a late Paleolithic or early Neolithic migration out of Africa

Fig 7 – British user shares 21 out of 24 defining alleles with Europeans on a “Deeper Ancestral” level
Fig 8 – Sudanese user shares 6 out of 100 E Asian defining alleles with E Asians on a “Deeper Ancestral” level. Although present in E Asians at a higher frequency than in Africans & Europeans, they likely reflect chance mutations in Africans
Fig 9 – Sudanese user shares 28 out of 35 African defining alleles with Africans on a “Deeper Ancestral” level
Fig 10 – Sudanese user shares a “T” European derived allele at chr2-rs1567803 with Europeans on a “Deeper Ancestral” level. This allele has a relatively high frequency of about 23% in S Asians, and is likely a pre-Bronze Age mutation spread to S Asia via a Eurasian Steppe population, and to Africans via a back migration to Africa event

 

BENEFITS OF SAPDA OVER OTHER ADMIXTURE SOFTWARE

  1. The program ADMIXTURE is not informative to the direction of geneflow. Thus if a European user tested with ADMIXTURE shows 2% African, we don’t know whether this is due to an African ancestor or whether this sharing is due to historical “back to Africa” migration events transmitting Eurasian DNA from populations in the Near East to African populations. With SAPDA, on the other hand, one can determine the direction of geneflow via the allele frequency/ allele sharing bar plots as shown in figs 5 thru 22. For example, fig 4 shows the Sudanese individual shares with Europeans 4.2% of alleles derived in Europeans, on a deeper time scale level. A glance at figure 10 confirms that the Sudanese individual shares 1 copy of the “T”  European derived allele at chr2-rs1567803 with Europeans. The bottom plot in fig 4 shows this allele has a relatively high frequency of about 23% in S Asians, and is likely a pre-Bronze Age mutation spread to S Asia via a Eurasian Steppe population, and to Africans via a back migration to Africa event;
  2. Stricter guidelines over allele frequency thresholds correlated with population defining mutations. The SAPDA algorithm does not permit the use of genomic positions for which the allele frequency differential between 2 ancestral source populations is small. This is different than say ADMIXTURE where positions are permitted where the allele frequency differential between 2 source populations is small. The reason we adopted stricter guidelines is because in a Eurasian SNP panel there are many West Eurasian polymorphic positions which are ancestral in both East Eurasians and Africans.
  3. There are many positions which are polymorphic in Europeans, but predominantly homozygous ancestral in Africans and E Asians. Thus both Africans and E Asians have high frequencies of the ancestral allele at those positions. For example, if those positions were to be assigned as “African” in programs such as ADMIXTURE, then E Asians would erroneously score an increased “African” percentage due to those positions, and visa versa. SAPDA identifies and filters out those positions.
  4. PCAs and ADMIXTURE are useful for population clustering purposes, but not informative to the total amount of admixture from geographically or genetically more distant populations. For example, if the objective is to determine the amount of E Asian admixture in a W Asian subject, then that E Asian admixture is masked by admixture from geographically and genetically more proximate W Asian, S Asian, and European populations. Thus if a W Asian individual is modeled as 70% W Asian + 20% S Asian + 9% European + 1% E Asian, then the actual amount of E Asian admixture in the W Asian subject is masked by the W Asian + S Asian + European percentages. This is partly because E Asian derived alleles are included in the genetic substructure of W Asian, S Asian, and European populations, and partly due to the nature of fractions as detailed in the following section.
Fig 12 – British user does not share any defining alleles with E Asians on a relatively nearer “Ancestral” time scale
Fig 13 – British user does not share any defining alleles with Africans on a “Ancestral” level
Fig 14 – Sudanese user shares 4 out of 4 African defining G mutations with Africans on an “Ancestral” level
Fig 15 – Sudanese user shares no European defining mutations with Europeans on an “Ancestral” level
Fig 16 – Sudanese user shares no E Asian defining mutations with E Asians on an “Ancestral” level

 

DIFFERENCES BETWEEN ADMIXTURE PERCENTAGES & GSI

It’s important to understand that admixture percentages don’t accurately quantify the amount of geneflow or admixture between the test subject and the various calculator source populations. For example, 2 individuals from different parts of the world are tested. Individual A shows 5% E Asian and individual B shows 10% E Asian. Based on this most would think that B has greater E Asian geneflow or admixture than A, however, the truth is we don’t know simply based on these admixture percentages. Here is why. Let’s say that A and B share the following number of alleles with the calculator source populations:

Test subject Ethnicity Number of matching alleles with
E Asians Africans Europeans
A S Asian 30 5 15
B W Asian 40 10 50

Table 1 – Number of matching alleles between 2 users and calculator source populations

For simplicity assume the total number of population defining alleles used in the calculator is the same for each population. Thus the W Asian user shares 40 alleles with E Asians, whereas the S Asian user shares 30 alleles with E Asians as shown in the aforementioned table. Therefore we can infer that the W Asian individual has more E Asian admixture than the S Asian individual.

To calculate the calculator E Asian admixture percentage for A all we do is the following; E Asian = [30 / (30+5+15)] x 100. We do the same for all the other categories to obtain the following:

Test subject Ethnicity ADMIXTURE PERCENTAGE
E Asian African European
A S Asian 60% 10% 30%
B W Asian 40% 10% 50%

Table 2 – Admixture percentages calculated based on the results from table 1

Notice that in spite of the W Asian individual having greater E Asian admixture than the S Asian individual as shown in table 1, table 2 shows that the S Asian individual has a higher E Asian admixture percentage. This is the reason we can’t use admixture percentages to objectively quantify total geneflow or admixture from a population. Thus GSI which is a one to one comparisons of the number of matching alleles between the test individual and the calculator source populations, should be used for inferring geneflow or admixture from a source population. GSI as shown in  figs 3 & 4 is one of the metrics outputted in SAPDA.

Fig 17 – British user does not share any defining alleles with Africans on a “Deep Ancestral” level
Fig 18 – British user shares 2 out of 6 defining alleles with E Asians on a “Deep Ancestral” time scale; 1 copy of the “G’ allele at chr2-rs12477830, and 1 copy of the “C” allele at chr2-rs12476238. The lower bar plot indicates that these alleles are also present at a relatively high frequency of about 38% in AMR, thus indicating that these are relatively older mutations, likely dating to the Upper Paleolithic and predating the split between Native Americans and East Asians
Fig 19 – British user shares 2 out of 2 defining alleles with Europeans on a “Deep Ancestral” level
Fig 20 – Sudanese user shares 4 out of 4 defining alleles with Africans on a “Deep Ancestral” level
Fig 21 – Sudanese user shares no E Asian defining mutations with E Asians on a “Deep Ancestral” level
Fig 22 – Sudanese user shares no European defining mutations with Europeans on a “Deep Ancestral” level

 

SAPDA SOFTWARE AVAILABILITY

SAPDA is available at our partner site www.GenePlaza.com for individuals wishing to use for ancestry inference.

COMMERCIAL LICENSES

Please contact Admin@EurasianDNA.com for commercial software license inquiries.

 

 

 

 

 

15 thoughts on “Introducing SAPDA – a powerful new admixture inference software”

  1. Congrats Dilawer. Your work is truly cutting edge and light years ahead of the other blogs. I like the plots showing the allele matches and the fact you are able to infer the direction of geneflow unlike with ADMIXTURE and PCAs. Will the production SAPDA also use the 3 source populations you show or will there be other populations.

    1. Thanks Rudy. This was a pilot to test the software. There will likely be several production versions with different source populations.

  2. A lot of so called matching alleles with between east and west could be due to ancient origin of those alleles lies in west and not in east. Archaic humans migrated from west into east.

    1. Yes, in fact alot would be an understatement. Most have no idea how genetically similar W and W Eurasians are in general. The differences are so insignificant genomewide.However, I have programmed the software to flag and discard those alleles where the likelihood of them being common ancestral Eurasian is high, or even worse common ancestral Eurasian and African.

      The “deeper ancestral” category in SAPDA is the least stringent and can be associated with older shared alleles, but even here the requirement is not merely that E Eurasians have a high allele frequency, which as you mention could simply be shared mutations between W and E Eurasians prior to the split, but rather E Eurasians have a high allele frequency AND Africans have a low allele frequency, and since some of the African samples used as references are E African, this signicantly lessens the probability that those alleles are common Out-of-Africa shared drift.

      This is another reason why we have also the “Ancestral” and “Deep Ancestral” categories where the requirements are even more stringent; they are high allele frequency (AF) in E Eurasians AND low AF in BOTH Africans and W Eurasians (see methodology section for details)

  3. Dilawer are you seeing many positions where both Africans and East Asians have high allele frequencies? That would indicate the public datasets contain many positions where East Asians and Africans share ancestral alleles. These could skew ADMIXTURE results.

    1. Yes in fact there are many such positions where E Asians and Africans share alleles to the exclusion of W Eurasians. I’ll try to give you details when I return from my trip. Those positions are mostly shared ancestral and are excluded from any analysis here

  4. There was not only 1 Out-of-Africa specie. In the East archaic homosapins also heavily mixed with Denisovans and different species.

    There was not only 1 Out-of-Africa specie. In the East archaic Homosapiens also heavily mixed with Denisovans and some other different species who were already native to the East which were already very different from Homosapiens from Africa. It is possible that West and East share also other non-Homosapien species with each other, but not with Africans. Also, even for the last 10000 years there were many migrations from West into East. Think about the early neolithic farmers from West Asia, or even Indo-European tribes from the Iranian Plateau that migrated deep into the North-Central and Eastern Asia.

    What I’m trying to say is that the East was much more influenced by the West (early Homosapiens, neolithic farmers, ancient Iranian Plateau Indo-European tribes etc.) than vice versa.

    1. What I’m trying to say is that the East was much more influenced by the West (early Homosapiens, neolithic farmers, ancient Iranian Plateau Indo-European tribes etc.) than vice versa.

      Those are positions where E and W Eurasians share alleles. SAPDA filters those (see methodology). An allele is counted as “E Eurasian “ in a test subject only if E Eurasians carry it at high frequencies to the exclusion of W Eurasians and Africans (Deeper Ancestral and Deep Ancestral categories

  5. Most modern so called Mongoloid people (Turks, Siberians, people from China, Japan etc.) are originally from the Mongolian steppes. And the thing is that the Mongolian Steppes were (before Chinese ‘HAN’ race) heavily influenced by the West Eurasian people, mostly Western Asians.

    Later on Eastern Eurasians spread from the Mongolian Steppes into all peripheral parts of Eastern Asia. But prior to that Eastern Asia was already populated by different species, like Homo floresiensis (Hobbits) and other species. Those exotic species were very different from Homosapiens from Western Eurasia (Western Asia).

    So, so called ‘Mongoloid’ or Eastern Eurasian people mostly evolved from Western Eurasians Homosapeins, but they have got additional Denisova, Neanderthal and Homo floresiensis type of auDNA in them.
    The sahared alleles between Western Asians and Eastern Asians is due to the shared Homosapien ancestry.

    There is a study about it: https://www.biorxiv.org/content/early/2018/12/06/487983

    1. Hi Jortita,

      First create an account here, and once the “Products” page ( top menu bar) is publicly open (within about a week), you can upload your file from your shopping cart while you checkout. The 1st products which will be available are:

      1- SAPDA calculator
      2- IBD calculator to calculate a user’s relatedness to ancient and contemporary Eurasian populations.My tests have shown that it is very accurate in determining whether the samples are 1st degree or 2nd degree relatives, and if neither, then a normalized relatedness score will be given for the user to various individuals and populations.

      A 35% discount coupon will be mailed out to the 1st 50 registrants who open an account here and subsequently on the upcoming Eurasian Genetics forum. The coupon can be applied during checkout.

      Dilawer

  6. I had problems with registering, as I filled in my details but never received the confirmation email

    1. Thanks Jortita for bringing it to my attention. You and anyone else awaiting a confirmation email should be approved.

  7. Hi Dilawer,

    Big fan of your work, I hope you’re able to really grow this website over the next few months and stimulate lots of interesting discussions as new discoveries are made by yourself and others in population genetics in 2019.

    Piggy-backing somewhat off Goga’s comments above, in regards to autosomal diversity between West and East Eurasians, who’s more heterozygous? Goga brings up an interesting point, in that following the conventional Out of Africa model you would expect a cline of decreasing diversity from West to East, as you would infer successive serial bottlenecks to decrease heterozygosity in humans as they marched west across Southwest and Central Asia into East Asia over time. However, the mtDNA phylogeny of contemporary Eurasians seems to contradict this a bit, as East Eurasia harbors a staggering variety of primary M, N, and R branches, whereas West Eurasia has a comparatively smaller pool of N and R lineages, which would imply East rather than West Eurasia was the primary locus of human expansion after the Out of Africa exodus.

Comments are closed.

Scroll to Top
Scroll to Top