SAPDA – A new publicly available admixture inference program

DISCUSSION

We are proud to announce the availability to the public of our latest genetic admixture inference program named SAPDA (Shared Ancestral Population Defining Alleles).

Eurasians, whether Europeans or Asians, are genetically very similar to each other, with genetic similarities far outweighing any differences they have. Most Eurasians have descended from a few thousand individuals who roamed Eurasia during the Upper Paleolithic and thus share a huge amount of DNA. Thus an admixture inference program needs to determine which mutations are shared due to distant common origins and which ones due to more recent introgression.

It is for this reason that some bioinformatic software such as qpDstat, qp3pop, and qpAdm, which are included in the ADMIXTOOLS software suite available at Reich Lab, use outgroups to filter out older common ancestral alleles.

An extreme example of how ancestral alleles and SNP ascertainment bias distort admixture calculations using the program ADMIXTURE is Neanderthal. When the Neanderthal is processed with an ADMIXTURE based calculator, regardless of calculator design, the results show it to be around 90% African. Of course we know this inference is incorrect because the Neanderthal lineage diverged from humans over 500,000 years ago, with some hybridization with humans occurring around 50,000 years ago in Eurasia and NOT in Africa.

To illustrate this problem, a comparison is made of ADMIXTURE program based calculator results from Gedmatch.com with SAPDA calculator results for a higher coverage Altai Neanderthal genome with Gedmatch ID XV3025795.

It is expected that a calculator output show that Neanderthal is equally related to Eurasians and Africans, with a slight shift towards Eurasians, due to its limited admixture with Eurasians around 50,000 years ago, long after its split from the ancestral Neanderthal-Human lineage. However, we see this not to be the case.

Figure 1 shows that except for the SAPDA program, the ADMIXTURE program based various calculator projects erroneously indicate that the Neanderthal sample is around 90% African.

By contrast, SAPDA indicates greater shared drift between Eurasians and the Altai Neanderthal sample vs Altai Neanderthal and Africans, specifically, Siberians, American Indians, and East Asian Eurasians. This of course is a more expected result than the results using ADMIXTURE software, which tend to show the Altai Neanderthal sample as approximately 90% African.

**Fig 1 – Neanderthal ancestry proportions using various** ADMIXTURE based calculator projects, compared with our SAPDA program, indicating inflation of African admixture by ADMIXTURE based calculators for this Altai Neanderthal sample. SAPDA mitigates ancestral allele issues via the use of multiple outgroups.

The issue of ancestral alleles confounding admixture calculations also affects more contemporaneous samples such as the Upper Paleolithic Siberian Ust-Ishim sample, Gedmatch ID LW1100923. Here too we see that ADMIXTURE based calculators inflate its African percentage as shown in figure 2.

**Fig 2** – High coverage Ust-Ishim ancestry proportions using various ADMIXTURE based calculator projects, compared with our SAPDA program, indicating inflation of African and East Asian admixture by ADMIXTURE based calculators for this Upper Paleolithic Siberian sample. SAPDA mitigates ancestral allele issues via the use of multiple outgroups.

SINGLE POPULATION SHARING vs. ADMIXTURE PERCENTAGES

Single population sharing values are a better indicator of shared drift between a test individual and a reference population than admixture percentages because admixture percentages are relative and not absolute and thus vary depending on which reference populations are used in the calculator. Qpdstat is an example of a program within ADMIXTOOLS, available at Reich Lab, which outputs single population sharing. Our SAPDA software also outputs single population sharing.

ADMIXTURE software which is based on maximum likelihood estimation based on population allele frequencies is a good program and is extensively used in for ancestry inference. However it has its limitations.

Unlike ADMIXTURE software based calculators, SAPDA outputs both single population sharing percentages (figure 3) as well as admixture percentages (figure 4).

SAPDA uses 2 different models with different demographic assumptions to model testing individuals. The mean of the 2 models is used and standard errors of the means are outputted.

SAPDA also uses allele frequencies to model more ancient admixture versus more recent admixture. For example in figure 4 we see dilution of native West Eurasian Zagrosian Bronze Age herder/farmer ancestry in favor of East Eurasian admixture for an Iraqi Kurd sample. This dilution is likely the result of population movements from Central Asian into the Iranian plateau and the Zagros mountains post Bronze Age.

Migrations and population expansions from Central Asia into West Asia would have likely resulted in hybridization between native populations related to Bronze Age Iranian herder and farmers and Parthians, Sakas, Saffarids, Samanids, Ghaznavids, Khwarizmians, Mongols (Ilkhanates), Timurids, and Ottomans.

Fig 3 – Signature allele sharing between an Iraqi Kurd sample and various populations indicating a decline of West Eurasian admixture over time. This likely correlates with introgression of Central Asian admixture into Bronze Age herders and farmers of the Iranian plateau and Zagros mountains.

Fig 4 – Admixture of an Iraqi Kurd sample indicating a dilution of endogenous West Eurasian farmer/herder admixture in favor of East Eurasian admixture over time. This likely correlates with population movements from Central Asia into the Iranian plateau and Zagros mountains post Bronze Age.

The following example illustrates that the individual who has double the admixture percentage of the other individual does NOT in fact have double the ancestry.

Assume we have 2 individuals; Bob who is British, and Rajiv who is Indian. Both individuals are tested with an admixture program containing references from the following areas:

Western Europe
Eastern Europe
Western Asia
South India
East Asia

European populations are fairly homogenous in comparison to South Asian populations due to greater mixing within Europe within the past 2000 years. This can readily be observed with lower intra-European genetic fixation distances compared with fixation distances between South Asian populations.

The following tables display the number of high-probability signature alleles Bob shares with the calculator’s references versus Rajiv.

**Table 1 – Number of alleles Bob shares with the calculator references versus Rajiv**

Table 1 shows that that Rajiv only has 2 more East Asian alleles than Bob, however, admixture percentages shown in table 2, indicate that Rajiv has more than double the East Asian admixture Bob has as shown below.

**Table 2 – Admixture percentages for Bob versus Rajiv**

This example shows that reliance on admixture percentages leads to erroneous conclusions regarding total shared genetic drift, and thus single population sharing is a better indicator of shared genetic drift between 2 populations.

Examples of one-to-one population sharing programs include IBD, IBS, D-statistics, and our proprietary SAPDA software.

Admixture clustering and PCA programs tend to decrease intra-population variation, and increase inter-population variation. We clearly see this when we compare outputs from IBS or D-stat programs to admixture software outputs.

OUTGROUP USAGE

The overwhelming majority of allele sharing between populations is due to very distant past ancestral relationships between them.

Distant past common ancestry shared among human populations can distort inferences of more recent introgression.

SAPDA uses multiple outgroups to identify and remove old common mutations shared among many world populations.

SAPDA then compares the tester’s genome with alleles ascertained in reference populations with a high probability. To decrease margins of errors many samples are used for each of the reference populations. We ensure a decent spread in allele frequencies between the reference populations for the ascertained SNPs.

Graphical outputs are produced for the user for the highest probability 100 SNPs, such as shown in figures 4 – 8 for evidence of shared ancestry between the user and the reference populations with detailed information of the population defining SNP. This is done for both inference models and the outputs are sorted by probability.

For example, in figure 4 we see that the user shares 10 population defining alleles with East/Southeast Asians. In figures 4 – 8 we also see allele frequency information for the SNPs for other populations. This also aids the user in determining with its shared mutation with the reference population is likely the result of older admixture between his population and the reference or whether it is likely the result of more recent geneflow.

There is large variation in phenotypes within a population which we expect to correspond to a large variation in genotypes within a population. We see these large variations in genotypes within populations with IBS and single population sharing programs such as Admixtools f3s and f4s, however, admixture based programs such as those used by DTC companies such as 23andMe, AncestryDNA, FTDNA, and others mask this variation and homogenize individuals within a population.

**Fig 5 – An Iraqi Kurd sample’s East Eurasian alleles.**

**Fig 6 – An Iraqi Kurd sample’s Siberian alleles.**

**Fig 7 – An Iraqi Kurd sample’s West Eurasian alleles.**

**Fig 8 – An Iraqi Kurd sample’s East African alleles.**

**Fig 9 – An Iraqi Kurd sample’s West African alleles.**

SAPDA RESULTS – SINGLE-POPULATION-SHARING

The following boxplots show single population allele sharing distributions such as population medians, maximums, and minimums for various tested populations. Although SAPDA outputs admixture as well as single-population-sharing charts, single-population-sharing and NOT admixture should be relied upon when performing a comparisons of 2 or more tested individuals or populations for recent shared genetic drift, as the former is not dependent on the chosen calculator components, whereas the latter varies depending on the components used in the calculator.

For certain West Asian populations such as Persians and Kurds, East Asian admixture likely represents accumulated admixture since the Iron Age resulting from admixture between endogenous West Asian Zagrosian pastoralists and farmers, and Parthians, Sakas, Saffarids, Samanids, Ghaznavids, Khwarizmians, Mongols (Ilkhanates), Timurids, and Ottomans.

**Fig 10- SAPDA 2 model mean Single-Population-Sharing boxplot – West African.**

**Fig 11- SAPDA 2 model mean Single-Population-Sharing boxplot – East African.**

**Fig 12- SAPDA 2 model mean Single-Population-Sharing boxplot – East Asian.**

**Fig 13- SAPDA 2 model mean Single-Population-Sharing boxplot – East Siberian / American Indian.**

**Fig 14 – SAPDA 2 model mean Single-Population-Sharing boxplot – West Eurasian.**

SAPDA RESULTS – ADMIXTURE

The following boxplots show admixture distributions such as population medians, maximums, and minimums for various tested populations. The Single-Population-Sharing plots in the previous section should be used when comparing 2 or more individuals/populations for admixture from one of the calculator’s reference populations.

For South Asian populations some East Asian shared drift is due to geneflow from endogenous Indian Hunter Gatherers also referred to in the literature as AASI.