An Admixture Calculator for Eurasians

Genealogical testing companies use highly admixed references

In other posts, I have mentioned the problems with ancestry proportion tests from genealogical testing companies such as 23andMe, AncestryDNA, and FTDNA, and how useless the test results can be for describing the population history and admixture for individuals from various parts of Eurasia. For example, what does it mean if you are say from Pakistan, and the results from a company such as 23andMe says that you are 98% South Asian. Does it mean that all your ancestors have lived in a bubble called S Asia from time immemorial, and that you have no ancestry from other parts of the world? or say if you are from neighboring Iran, and your results say 98% Middle-Eastern, does that mean that your ancestors are from a different bubble since time immemorial, and that historically there has never been any interaction between your ancestors, and those of the ancestors of the individual from neighboring Pakistan, whose results showed 98% South Asian? Similarly, with results from a company such as AncestryDNA, showing that an individual from Iran is 80% “Caucasus” and 20% “Middle-Eastern”. The answer is of course not, because the references they use, whether they be from India, Iran, Turkey, or England, are themselves heavily admixed, and are thus “harboring” DNA from various parts of the world.

These percentages whether from AncestryDNA, 23andMe, or FTDNA should not be taken too seriously, since none use ancestral references, meaning that their references themselves are very admixed. So if a European on 23andMe scores 0.5% S Asian (which would be quite unusual), all that really means is that they are slightly more S Asian shifted than the European references they used, which could also imply some recent S Asian admixture, but that does not mean that the subject has only 0.5% TOTAL admixture from S Asia, because, if for example the European references themselves are 5% S Asian admixed (older S Asian gene flow) then the total could be 5.5% S Asian.

 

Results are not apples to apples

To complicate matters 1% S Asian for a European is not the same as 1% S Asian for a W Asian. Whereas the 1% S Asian for a European could translate to 6% total S Asian, if the European references have a 5% S Asian base, the 1% S Asian showing up in a result say for a Kurd or Iranian 23andMe subject could translate to a 16% TOTAL S Asian, since the Middle Eastern references, if they are Iranians, could already have a 15% S Asian base. Therefore, 1% S Asian for a northern European would not be equal to 1% S Asian for say an Iranian or Kurd.

What I am seeing in my analysis when I strip down these references down to more basic streams of ancestry from ancients such as Early Neolithic Farmers (ENF), Neolithic and Chalcolithic Iranians from the Kurdistan region (Iran N), Eastern European Hunter Gatherers (EHG), or Western European Hunter Gatherers (WHG) is that W Eurasians have much more east, north, or south Eurasian total admixture than someone would be led to believe looking over the results from genealogical testing companies.

How much total east, north, or south Eurasian admixture is the subject of this calculator, the Ancient Eurasia 20 admixture calculator, which I have been refining over the past few weeks.

 

Results from the program ADMIXTURE are highly volatile

Also a word about calculators created using the program ADMIXTURE4. I discovered some time ago that the results obtained using this program are highly volatile, depending on how many samples are included in the run. I should really say extremely volatile. For example, supposing we have an Italian test subject in a K=10 supervised run consisting of 10 admixture components, K1, K2, K3…….K10. I found that for example, the admixture percentage say for component K3 could vary anywhere from 0 to 20% depending on how many non-reference samples I included in the run, yes I did say non-reference samples. The reason this happens is that the allele frequencies are not entirely defined by the admixture component references, but rather, by the combination of component references, as well as the non-reference samples in the run. In other words, the admixture component allele frequencies are skewed by the non-reference samples.

I have personally created numerous admixture calculators in the past based on the program ADMIXTURE, many of which are freely available at Gedmatch.com, under project GedrosiaDNA, however, I intend to revise all of them based on my latest findings.

Generally, my experience has been that for the ADMIXTURE model the following are important criteria to keep in mind when designing a calculator:

  1. Independent samples. The dataset should be screened for related samples using IBD programs such as Beagle;
  2. There should be sufficient overlapping SNPs to resolve closely related source populations. A genotype rate of 100% for all samples is ideal, but sometimes has to be compromised due to insufficient overlapping markers between ancients and Illumina genotyped test subjects. There is a little give or take here;
  3. Outliers should be screened from the population sources.

One of the biggest problems with the ADMIXTURE based calculators out there, whether on Gedmatch or elsewhere, is that calculator creators use too many test samples in supervised runs. This practice seems to go back to the days when genome bloggers first started putting together calculators. The number of test samples in the run used to create the calculator ends up hugely outnumbering the reference/source populations. As previously mentioned, the test samples greatly affect the allele frequencies of the calculator component source populations, resulting essentially in a non-fully supervised test. This is because the ADMIXTURE model not only estimates Q, but also P, as a function of both the reference samples and the test samples. While the P values should remain stable regardless of the test samples, in practice the test samples change the P estimates from their actual values. This manifests itself with admixture percentage results fluctuating all over the place as the number of test samples is increased or decreased. Therefore, the number of test samples should be much fewer than the number of the reference samples and not the other way around.

Also, another point regarding ADMIXTURE in general, is that it is not informative as to direction of geneflow. An agreement in alleles can result from samples A and B sharing a common ancestor.

With regards to the slightly elevated SSA in some test samples here, it is important to remember that I don’t have a modern SW Asian component, which can cover some SSA. Also, in the interest of using a decent amount of markers, I do not have 100% marker overlap between Levant BA and the test samples, leaving some regions in the test samples genomes which are not covered by Levant BA. This of course assumes that any SSA into Eurasia or backflow from Eurasia to Africa predates Levant BA which may or may not be the case.

 

The Ancient Eurasia 20 admixture calclulator

I have been able to mitigate the problems described above in this calculator. Here, I only used DNA sequences from ancients and moderns that had the highest number of intersecting SNPs, with the Illumina V4 microarray, since I believe most testers use 23andMe to get genotyped. BTW, I believe that Asians are at a slight disadvantage because the Illumina is biased towards Europeans, meaning that it is not as good for picking up derived alleles in Asians, as it is for Europeans.

For my latest calculator, I use a mix of ancient and modern genomes. I reluctantly use moderns, mainly because currently, ancient DNA sequences from East, North East, and South East Asia are lacking.

This calculator is most useful for Eurasians, except perhaps if you happen to be from a population which I have used as a component reference. This calculator is not informative for sub-saharan Africans and indigenous Oceanians.

It is believed1 that most modern Europeans can almost exclusively be modeled with three primary streams of ancestry from; Western European Hunter Gatherers (WHG), Early European Farmers (EEF), and Eurasian steppe herders who contributed ANE (Ancient North Eurasian) and DNA from the Caucuses region . Consequently, for example, the WHG admixture percentage in the test subjects should not be interpreted as total WHG admixture in the test subject, because some to most WHG admixture is inferred by proxy by the references, who are themselves WHG admixed, who comprise the other components of the calculator. For example, some WHG would be included in the Steppe EMBA/LMBA components, because those populations also carried some WHG.

The following is a description of my admixture components:

  1. Altaian: Based on samples from the Altai region, this component represents the genetic contribution of the Turkic tribes as the expanded west and south into Europe and Asia over the past 1000 years.
  2. Scythian E/W: Based on the recently published sequences in “Ancestry and demography and descendants of Iron Age nomads of the Eurasian Steppe2″. This component is based on the combined allele frequencies of the western and eastern Scythian Iron Age samples, since there are not enough high coverage eastern or western samples to accurately source allele frequencies for separate eastern and western Scythian components. During the first millennium BCE, nomadic tribes spread over the Eurasian Steppe from the Altai Mountains over the northern Black Sea area as far as the Carpathian Basin. They also appear to have also ruled over areas of present day Iran,  Kurdistan, and Afghanistan.
  3. Neolithic Anatolians: This is based on the 10 highest coverage 8000 year old ancient DNA sequences from Anatolia. These Anatolians introduced farming as they expanded into Europe during the neolithic.
  4. Neolithic & Chalcolithic Iranians (Iran N/Chl): Based on ancient DNA sequences from the Kurdistan region and vicinity in Iran, and published in “Early Neolithic genomes from the eastern Fertile Crescent3″. This component of ancestry peaks in Kurds, Iranians, Baloch, Brahui, Pashtuns, Punjabis, and some NW Indians. This component is based on the combined allele frequencies of the highest coverage Neolithic and Chalcolitic Iranian samples, as there are not enough samples to accurately source allele frequencies for separate Neolithic and Chalcolithic components. In Europeans, this component most likely represents ancestry shared by Iran N/Chl and Caucasus based ancient populations, above and beyond what was received from the Eurasian steppe herders, as Iran N/Chl are not believed to have directly contributed much to the genetics of Europeans.
  5. Steppe – Middle/Late Bronze Age (MLBA): Based on the allele frequencies of the 3500 year old ancient samples from the Eurasian steppe; 3 from the Andronovo culture, and 4 from the Srubnaya culture. These types of cultures are believed to have spread Indo-European languages.
  6. Early European Farmers: Here I used high coverage 7000 year old DNA from early European farmers; 5 from the LBK culture, 5 from Hungary, 1 from Iberia, and 1 known as Stuttgart. Individuals associated with these cultures are believed to have facilitated the spread of farming from the near east to Europe, and encountered European Hunter Gatherers who had settled Europe much earlier.
  7. Steppe – Early to Middle Bronze Age (EMBA): Based on allele frequencies of the highest coverage 4300-5000 year old ancient samples from the Eurasian steppe; 3 from the Yamnaya culture, and 2 from the Poltavka culture.
  8. Western European Hunter Gatherer (WHG): Based on 3 approximately 8000 years old ancient samples from Spain, Luxembourg, and Hungary.
  9. Burmese: Based on modern individuals from Burma. Most Europeans score around zero of this, however, it reaches significant levels in Kurds, Iranians, and populations further east, and represents the non-west Eurasian admixture in south and west Asians, somewhat analogous to the hypothetical Ancestral South Indian (ASI), but likely not the same.
  10. Amerindian-South: Based on references from the Cola and Wichi tribes of Argentina. In Europeans, this could represent shared origins between Native Americans and circumpolar peoples such as Saami.
  11. The remaining components are self-explanatory, and represent genetic similarity with various north and east Asian peoples.

 

Test results

The following is a bar chart of various individuals, sorted with the highest Scythian E/W score at the bottom, and lowest at the top. It is important to remember that the Scythian E/W percentage is based on the combined allele frequencies of both eastern and western Scythians.

 

 

The following are individual results, sorted with decreasing Altaian score from left to right.

 

 

 

This is the FST matrix for the calculator

 

 

REFERENCES:

  1. Ancient human genomes suggest three ancestral populations for present-day Europeans, Lazaridis et al,Nature 513,409–413
  2. Ancestry and demography and descendants of Iron Age nomads of the Eurasian Steppe, Martina Unterlander et al, Nature, 2017.
  3. Early Neolithic genomes from the eastern Fertile Crescent, Broushaki et al, PubMed, 2016.
  4. Fast model-based estimation of ancestry in unrelated individuals, D.H. Alexander, et al, Genome Research, 19:1655–1664, 2009.

 

 

 

 

20 thoughts on “An Admixture Calculator for Eurasians”

  1. Great work Kurd! So this calculator is not based on ADMIXTURE, or have you revised your methodology in developing calculators via ADMIXTURE? Is this a departure from the DIY-calculator “standard” so to speak?

    As I understand it, the calculator will not be released? So for the benefit of those reading, one would have to send you their data/payment and you will run the calculator and send results, correct?

    1. Thanks! I have not made a decision yet on which version to release, or in fact if I will release them at all. Another version has a Kura-Araxes component, as it is likely that some steppe admixture was not directly inferred by steppe groups to W Asians. Yet another, has ancients only, except for a SE Asian component to represent S Eurasian admixture in S/W Asians.

      I am not accepting data or payments until decisions are made which version, and how it will be released.

      The calculator does use the program ADMIXTURE, however, the setup is quite different from the way other ADMIXTURE based calculators are done. Sorry that I can’t be any more specific.

  2. Kurd,

    Thanks so much for all the effort you put into these analyses. Just wondering- why use Burmese as the southerly E Eurasian reference? Aren’t they themselves W Eurasian admixed?
    The total steppe ancestry inferred for the Brahmin sample looks to be in line with qpAdmix run outputs Sein posted on the Eurogenes blog. I’m guessing the individual sample is from a northern Brahmin?

    1. The Indian sample is actually a southern Brahmin one. I forgot to post the NE Indian Brahmin sample, which had a higher Scythian percentage.

      SAMPLE Brahmin 2
      Altaian 0.00%
      Anatolia-N 4.14%
      Steppe-MLBA 10.28%
      Amerindian-South 1.10%
      Burmese 28.77%
      Buryat 0.00%
      Chukchi 0.00%
      Sub-Saharan 2.72%
      Europe-EN 5.68%
      Even 0.00%
      Nenet 0.00%
      Iran-N/Chl 12.90%
      Koryaks 0.00%
      Levant-BA 3.71%
      Mongol 0.00%
      Papuan 5.69%
      Saami 6.70%
      Scythian-E/W 5.34%
      Steppe-EMBA 10.33%
      WHG 2.62%

      I agree that Burmese may not be the best choice, but I have a limited amount of SE Asian samples with high marker overlaps. Did you have anything in mind.

  3. Kurd,

    You have Austronesian references from Indonesia, right? Or if not those, then perhaps Dai…

    1. I only have Dai from Human Origins which don’t have enough overlapping SNPs for this test. I do have Indonesia Bajo and Lebbo samples which I have used in some of the draft versions. I can re-visit this. Drift becomes a factor with very old splits, because without an adequate source pop to account for S/SE Eurasian admixture in S Asians, they tend to show artificially elevated levels of E Asian admixture, from E Asians populations with similar allele frequencies, for example Altaians, although Bajo and Lebbo could work.

  4. One other question: What is Saami tracking in S/ W Asians? The levels do not appear to fluctuate greatly, but are noticeably elevated. It’s also present at above noise levels in Saudis!
    In Europe, the first thing that comes to mind is that the Saami component is capturing Finno-Ugric type ancestry/ substrate; however, it’s interesting that the Belorussian has a lower percentage than the German. The N Italian individual’s score is also interesting (elevated). I’m curious what W, SW European populations would score..

    1. Generally speaking when a calculator has a mixture of modern and ancient population sources, the ancient admixture amounts are usually underestimated due to drift, pseudo-diploid genomes, less than 100% genotype rate with the test samples, etc. This can lead to an overestimation of the modern admixture amounts. Saami is probably representing Uralic or NE European alleles shared by SW Asians and Saami which were conveyed by some ancient W Asian population not represented in this calculator.

  5. Hi,

    Interesting blog. Could you perhaps run this calculator on British people, or even more fine-grained, like English, Scottish, Irish?

  6. As I look through the results in the admixture calculator above for the european references (italian, german, etc) I see small to medium admixture segments from burmese, papuan, altaic, saami and even some small amounts of native american in the germans! A lot of those populations themselves are derived from mongoloid populations and more related to east Asians.

    So with that said, my question is what is the accuracy of this calculator and how are some of those possible such as the papuan in europeans? Are these reference populations based on more modern references? Or more ancient ones and therefore the admixture is indicating a common ancient ancestor?

    Also, I see you are the creator of the K29 and K35 calculators on gene plaza which I have utilized. What are the accuracy of those?

    1. The Papuan, Burmese, Saami, Amerindian, and Altaians are modern references. The margin of error is within about 2%. Some E Eurasian references such as Papuan and Burmese should not be taken literally for Europeans. These should simply be viewed as generic E Eurasian admixture, although looking at the outputs from the article they appear to be negligible for Europeans.

      The K29 and K35 calculators are based on the ADMIXTURE software and there is bound to be a small margin of error due to agreement of alleles between the test subject and the references due to old ancestral alleles.

      I see that quite a bit when E Eurasians score a little African and visa versa. For example some Africans score 2% E Asian. This is likely an error caused by agreement of ancestral alleles between E Asians and Africans because the SNP panels are optimized for polymorphisms in Europeans. This is why I created the SAPDA software which uses outgroups to mitigate this issue with ancestral alleles

  7. So when you say, “These should simply be viewed as generic E Eurasian admixture,” do most Europeans have a generic E Eurasian admixture? And from what do they have that from?

    Are any of the reference populations used in the above K20 calculator in this article themselves admixed? Or are they all “pure” ancestral references?

    Do you have any calculators based on SAPDA software available for public use?

    Regarding the K29 and K35 admixture calculators you have. I did utilize them for myself… On paper my nationality is fully Southern and Central Italian. With the big testing companies I have mostly Southern European/Italian with some West/Northwest European thrown in AND up to five percent West Asian/Caucasus (not sure where that comes from)…. When I did the K29 calculator however, I got a whopping 14 percent Caucasian and 5 percent Central Asian (Tatar). With K35 I got 18% West Asian, 9% Uzbekistan groups, 1.9% Hun, 2.1% Eastern Steppe and 1.5% African. I guess I’m just little confused and not sure which test/ calculator to go by.

    Given what I told you about my known ancestry, the values I provided from the K29 and K35 and what you know about those tests since you are the creator, would you say those values/ percentages I got are acurate? Especially the 2.1% Eastern Steppe, 1.9% Hun and 1.5% African since those are smaller values? Is it possible those are errors and don’t even exist. One thing (among others) ive been trying to uncover is the possibility of minor Central/East Asian admixture and African Admixture.

    1. @Johnny R

      A genetic admixture test should be thought of an allele or SNP similarity test between the tester and the calculator’s reference populations, and will never resolve the direction of geneflow. For this one reverts to historical migrations and invasions.

      For example, let’s assume that one of the calculator’s reference populations is German. Let’s also assume that there are 2 reference populations in this calculator which are genetically somewhat similar to you; the 1st is Sicilian and the other German. Let’s say you scored 20% German with this calculator. The 20% reflects your allele/SNP similarity with the German references. With you having recent ancestry from C & S Italy do we conclude that there was a migration from Germany into Italy that caused the 20%, or do we alternatively conclude that there was a migration from Italy into Germany that caused the 20% allele similarity, or is knowing this even important. If it is, then the admixture test on its own will not tell us this. We would need detailed historical records, but since there must have been several migrations between Germany and Italy over time, this would be very difficult to resolve. So all it will tell us is that you have a 20% Genetic similarity with Germans when Sicilians are used as the other source in the calculator.

      Now if C Italians were to be used as the other source instead of Sicilians, you may show only 10% German since the C Italian references are genetically very similar to you. So is 1 calculator wrong and the other correct. I would say no they are both correct. They just give different information. The 1st conveys that you are quite a bit German shifted compared to Sicilians, the other conveys you are less German shifted compared to C Italians.

      Perhaps the safe route to take when designing a calculator is to include many references from geographically close regions. Perhaps this is the approach the larger commercial companies such as 23andMe and Ancestry take, because if I did this then you may turn out 100% Italian, but then by doing so have you really learned anything about your distant ancestors. I would say you haven’t with this sort of test.

      TO BE CONTINUED.

  8. Thank you once again, you helped to clarify ADMIXTURE calculators better for me. So with what you said, are:

    1) These reference populations uses in this K20 calculator (and K29 and K35 for that matter), are they themselves admixed populations or are they based off more “pure” ancient populations?

    2) Do you have any calculators based on SAPDA software available for public use?

    3) When you stated above regarding the papuan and burmese signals in the european references, “These should simply be viewed as generic E Eurasian admixture,” can you elaborate on that please? Do most Europeans have a generic E Eurasiana mixture? And from what do they have that from?

    1. Johnny R,

      1- The references are typical for their respective populations however naturally all populations are mixtures of more ancient populations

      2- I have not rolled out any SAPDA based public calculators yet since I’m still fine tuning. An announcement will be made when they are available for use

      3- All human genomes regardless of whether E Asian or African are contain nucleotides which have mutated at various time stages. So E and SE Asian genomes are bery similar and differ at a tiny fraction of polymorphic sites. So for example Han and some Indonesian tribals share a ton of alleles derived prior to their split

      Since Burmese genomes just like all genomes are primarily of old mutations, most of which are shared with other E Asians, then unless the European has some relatively recent Burmese or surrounds admixture, the percentage most likely reflects old E Asian admixture dating perhaps to the Iron Age

  9. 1) So since recerences are typical for respective populations, that means that like for the caucasus for example, it would include people such as Georgians, Turks, etc?

    2) Since K29 and K35 are based off of Admixture and not SAPDA, are the current issues/ limitations with Admixture based programs that you mentioned in various posts addressed in K29 and K25?

    3) Again, when you mention old/ancient East Asian admixture in Europe from the Iron age or possibly earlier, based on your understanding and what you know, do most Europeans have this basal/ancient East Asian component in their DNA structure?

    4) I know Europeans are in part decended from Acient North Eursians but there seems to be some conflict as to their origins. Some people believe phenotypically and genotypically they are similar and related to other north/ Northeast Asians like Siberians, but others believe they are more similar to western hunter gatherers.

    1. @ Johnny R

      1- Yes, but check the info page for each calculator as different populations are used for each calculator.

      2- I try to address issues and limitations but can’t address all of them as some are inherent due to the way the code is written.

      3- I would agree that most Europeans have some sort of E Asian admixture dating to to the Iron Age in addition to shared ANE admixture with E Asians

      4- I have not investigated the relative relatedness of modern Europeans to ANE vs WHG but as you hint Europeans and W Siberians do substantially derive from ANE with an ANE as well as an EHG cline from E to W Europe. More recent introgression from Uralic Siberian populations into E Europeans can obviously also contribute to the cline, but there also evidence of a shared not too distant EHG-E Asian ancestor.

Comments are closed.

Scroll to Top
Scroll to Top