Here I perform a detailed analysis of East and North Eurasian gene flow to South, South-Central, and West Asians by doing a one-to-one comparison of genomes sampled at about 404,000 Single Nucleotide Polymorphism (SNP) positions, using the program BEAGLE. To do this I utilize many genomes from northern and eastern Asia publicly available from the Estonian Biocentre.
The results from my analysis are very surprising in that they indicate E & N Asian admixture in W/SC/S Asians is considerably underestimated using allele frequency based programs such as ADMIXTURE, and also from geneological testing companies such as 23andMe, AncestryDNA, and FTDNA. However, the results here do seem somewhat consistent with results from the company GPSOrigins, and with results from 23andMe’s Admixture Date Estimator utility.
Inferring minor gene flow using the commonly used programs such as ADMIXTURE is a haphazard affair, because the output is very much dependent on the number and types of populations used to create the calculator, as well as on the samples that are declared ancestral by the calculator creator. Additionally, and the number of components, “K” also affects the results.
Using the program ADMIXTURE, I have personally created my share of calculators, some of which are publically available at Gedmatch.com, under the GedrosiaDNA project; however, I have always been cognizant of the fact that the results from such calculators can’t be used to accurately infer gene flow from various populations. Here are a couple of good reasons:
The calculator’s components references are themselves admixed
Consider the following scenario where a calculator based on ADMIXTURE is used to determine E Asian admixture in an Arab and in a Kurd. ADMIXTURE uses a Bayesian approach and a Markov Chain Monte Carlo algorithm to sample the distribution of minor allele frequencies at various loci. So for this example, assume that the Kurd individual scores 20% Central Asian and 40% Caucasus, whereas the Arab individual scores 5% Central Asian and 20% Caucasus.
The first problem is that some of the Central Asian and Caucuses references are themselves E Asian admixed. This results in an underestimation of the Kurd individual’s E Asian score as compared with the Arab, because more of Kurd’s E Asian admixture will be “hidden” under the Central Asian or Caucasus component, which Kurd scores more of than the Arab individual.
The component references are not representative of the specific E Asian population which have genetically contributed to the test subject
Suppose a test subject has gene flow from Mongolians or Yakut’s, but an ADMIXTURE based calculator uses Han Chinese as E Asian references. In this situation, the test subject’s E Asian admixture will most likely be underestimated, because the Han references have frequencies of minor alleles at various loci different from the Mongolians or Yakuts.
Inadequate reference sample sizes
Assume the test subject has the “A’ variant at the position rs1234, and there are 10 E Asian references in the run, 7 of which also have the “A” minor allele at rs1234. Also assume that 8 out of the 10 Central Asian references in the ADMIXTURE run, also coincidently have the “A” minor allele at that locus. In this situation if a test subject has the ‘A’ allele at that position, it will be assigned C Asian, and not E Asian.
By contrast, had there been more E Asian references in the run, say 100, it can turn out that 85 of them have the “A” allele at rs1234, in which case rs1234 would be assigned E Asian, and not C Asian.
The commercial genealogical testing companies
To the best of my knowledge testing companies such as FTDNA use ADMIXTURE or STRUCTURE based programs, and thus the results obtained from them would be susceptible to the same issues described above.
23andMe on the other hand uses haplotype segment matching, which I have more fully described here
This method produces more accurate results than methods based on allele frequencies, because allele frequencies can get skewed for the reasons mentioned above. By contrast haplotype segment matching for IBD is more reliable, because it is a one to one comparison of the test subject’s genome against population references, however, the problem with companies like 23andMe is that their algorithm minimizes minor admixture. I have described this in more detail in my post http://www.eurasiandna.com/2017/02/07/23andmes/
The other problem is that the test subject may not have gene flow from their specific population references. As mentioned above, a test subject may have gene flow from Mongolians, but their E Asian references may be Yakuts and Han. This leads to an underestimation of the subject’s E Asian admixture in this situation.
IBD comparisons between South & West Asian test subjects and East & North Asian references
Here I attempt something similar to what 23andMe does, but without minimizing minor ancestry. I also utilize many more E Asian and N Asian populations in the IBD comparisons, recognizing that not all W and S Asians have E Asian ancestry from the same 2 or 3 East Asian groups. I also include Lithuanians in the analysis to look at shared genetic drift between S & W Asians and NE Europeans due to gene flow from the Eurasian steppe. In the future, I will utilize many more NE European populations to better analyze this.
I utilized references with a minimum of about 70-75% East or North Eurasian admixture for all test subjects. Later, for some test subjects, I included East and North Asian references with a minimum of 55% East or North Eurasian admixture. The following N & Asian populations were utilized in the IBD comparisons:
I compare stretches of DNA for shared relatively RARE haplotypes between some S/SC/W Asian test subjects and various N & E Asians, and Lithuanians. However, to accomplish this, the sequenced genotypes, such as from 23andMe, or AncestryDNA, which contain unordered combinations of alleles for each position, have to be PHASED, so that we can compare haplotypes to determine which sequences of alleles were inherited together. In haplotype phasing we attempt to determine which allele belongs to which copy of the chromosome, or alternatively, which alleles appear together on the same chromosome.
Using PLINK, I managed to extract about 404,000 common denominator SNPs between the S/SC/W Asian test subjects, and the N & E Asian and Lithuanian genomes which are available from the Estonian Biocentre. I then used BEAGLE, which is a software program for imputing genotypes, inferring haplotype phase, and performing genetic association analysis. BEAGLE can phase genotype data (i.e. infer haplotypes) for unrelated individuals, parent-offspring pairs, and parent-offspring trios. BEAGLE can detect genetic regions that are shared identical-by-descent (IBD).
Lines of the fastIBD output file report haplotypes shared by pairs of samples within the corresponding input file that have fastIBD score less than the threshold specified by the fastIBD threshold parameter. The fastIBD output file has five columns. The first two columns list the two sample identifiers for the shared haplotype described on each line. The next two columns list the starting (inclusive) and ending (exclusive) marker indices for the shared haplotype.
The first marker has index 0. The last column gives the fastIBD score for the shared haplotype. A fastIBD score < 10 -10 provides strong evidence that the shared haplotype is IBD if the length of the shared haplotype length is ≥ 1 cM. I used a stricter threshold of a fastIBD score of > 10-12 in the analysis.
RESULTS
I first perform a sanity check using a Papuan group, recognizing that they should not have any IBD segment matches with any of my East or North Eurasian references, using my self imposed thresholds of 200 SNPs and a fastIBD score <10-12, and the results in fact indicate that except for a couple of segment matches with the Bajo Indonesians, they don’t have any IBD matches with my references.
An Assyrian and a Jordanian
No surprises here, the Jordanian test subject had the least number of IBD matches from all my test subjects. This is likely due to the very few SW Asian subjects in the run. It also was my only test subject that did not have any IBD matches with any East or North Asians.
The Altaian IBD match for my Assyrian test subject Zephyrous likely represents gene flow from a Turkic population, perhaps Seljuks.
Fig 4 – IBD results for Zephyrous
Comparisons with higher East & North Asians admixed samples
The following IBD comparisons are with samples which are more than 70% East & North Asian admixed. These populations were used to filter out some previous segment matches which may have been due to West Eurasian ancestry in East & North Asians.
Kurds
My Kurd test subjects included a couple of Kurmanji Kurds from North Iraq; Kurds C1 and C3, and a few Feyli Kurds from further south; Kurds F1, F4, F6, and F7.
Surprisingly, all of my Kurd subjects, especially the Feylis had a large amount of shared IBD with East & North Asians. They generally had a larger number of total shared IBD segments with E & N Asians than Iranians, W Asians, and Indians, except for a couple Punjabis and Pashtuns. Also notable was that 4 of the 5 Kurd subjects had IBD shared segments with various Mongol samples, and in fact Kurd F4 had the largest shared segment from all my test subjects; a 951 SNP segment with Mongolian 3, and Kurd F7 sharing a large 733 SNP segment with Mongolian 3! This suggests a relatively intense mixing between Kurds and various Turkic tribes and descendants of Mongols historically.
From a historical perspective, some of the populations that inferred E & N Eurasian admixture to Kurds, very likely include Scythians, Seljuks, Turkmens, and other Turkic groups from Central Asia and the NE Caucasus.
Also surprising, is large segment sharing between Lithuanians and some of the Kurd individuals (see below), especially in lieu of the fact that this sharing between Lithuanians and some of my other W Asian samples, such as Iranians, Jordanians, Syrians, Armenians, and Georgians seemed absent, but seemed present in some of my SC Asians such as Pakistani Pashtun, Sein. The Lithuanian-Kurd and Lithuanian-Pashtun IBD segment sharing can most likely be attributed to a gene pool from the Eurasian Steppe that contributed to Lithuanians, Kurds, and Pashtuns.
Another surprise is that the Lithuanian-Kurd & Lithuanian-Pashtun IBD segments are larger than IBD segments shared between Kurds and most other W Asians in the case of Kurds C1 and C3, and with the Sein sample, the Lithuanian-Pashtun IBD segment is larger than any of the shared Sein-SC/W/S Asian IBD segments. The large size of these segments indicates gene flow from the Eurasian Steppe to Kurds and Pashtuns much more recent than the late Bronze Age.
The fact that the IBD segment sharing between Lithuanians and Kurds is to the exclusion of segment sharing between Lithuanians and Armenians/Georgians is significant, because this indicates Eurasian Steppe gene flow to Kurds not via the Caucasus corridor, but rather via Central Asia.
Feyli Kurds (Iraq & Iran)
Kurmanji Kurds (Iraq)
Comparisons with higher East & North Asians admixed samples
The following IBD comparisons are with samples which are more than 70% East & North Asian admixed. These populations were used to filter out some previous segment matches which may have been due to West Eurasian ancestry in East & North Asians.
Pashtuns
Pashtun-Pakistan (Sein)
Overall, Pakistani Pashtun, Sein, showed more E/N/S Eurasian IBD matches than most of my Afghan Pashtun samples, although more surprising, his largest IBD segment was shared with a Lithuanian. I have discussed the implications of this under the previous section, titled “Kurds”.
Comparisons with higher East & North Asians admixed samples
The following IBD comparisons are with samples which are more than 70% East & North Asian admixed. These populations were used to filter out some previous segment matches which may have been due to West Eurasian ancestry in East & North Asians.
South Asians
Punjabis
Comparisons with higher East & North Asians admixed samples
The following IBD comparisons are with samples which are more than 70% East & North Asian admixed. These populations were used to filter out some previous segment matches which may have been due to West Eurasian ancestry in East & North Asians.
REFERENCES:
B L Browning and S R Browning (2011) A fast, powerful method for detecting identity by descent. Am J Hum Genet
88:173-182.doi:10.1016/j.ajhg.2011.01.010.
Estonian Biocentre, http://www.ebc.ee/
Very interesting Dilawer! I was wondering what you had been working on. I found the comprehensive explanations at the start very helpful.
A few questions:
1. Given the somewhat sporadic nature of shared ibd within an ethnic group, ie you may share with one Punjabi but not another, are all ethnicities sufficiently represented to be able to draw an accurate representation? Ie comparing with only 2 chechen samples vs 10 kurdish ones?
2. Isn’t the pie chart of different sub groups affected by the same problem of sample representation. Ie sharing with turkics and West Asians representing ibd from a shared ancestral population rather than a turkic specific or West Asian specific one?
Would you mind running my ibd sharing as I’d be interested to see how Bengalis fare proportion wise with North and east Asians as opposed to admixture analyses?
Good questions.
1- First, as the title of this post says, the goal here was shared IBD between W/S/SC Asians and N/E Asians. So yes, the dataset I put together is biased in the sense that SW Asians and SE Asians are under-represented, and yes, if the goal was shared IBD with SE Asians, SW Asians, or for that matter with various W Asian ethnic groups, then the sample sizes for those respected populations would have been different. Also, the goal here is not to draw any inferences about whether someone may be more related to Kurds vs Chechens, because if that was the case the sample sizes for both Kurds and Chechens would have been >25 each. That is why you see that I have lumped various W Asian groups under one color, and Kurds, Iranians, and Azeris under a different color (since they are quite similar to each other for the most part), and the various S Asian pops under one color, in the hope that inferences may be made to the group as a whole, ie IBD matches with W Asians instead of with Kurds or Chechens, or matches with S Asians instead of with Punjabis or Marathis. However, again that was not the goal, so yes, the pie chart is biased against groups (and not individual pops within a group) that are under-represented (SW/SE Asians, Amerindians, W Europeans…)
2- IBD is different that IBS. On one extreme we have IBD with high thresholds (rare haplotypes, long segments, etc), and on the other we have IBS with 1 SNP threshold. With the latter, every sample in the run will have a score, because the comparison of the 2 genomes involves only 1 SNP, which can have 1 of only 4 values.
So yes, it is possible for example, for one punjabi to have IBD from Mongol and a cousin not to have any (they will not have the same geneology on the maternal as well as the paternal side). Also, we have somewhat random recombination of DNA. It is entirely possible that one individual’s segment inherited from a Mongol ancestor survives recombination, whereas the other individual’s “Mongol” segment gets replaced with a segment from an Indian ancestor.
Your E Asian shared IBD is under-represented, because there are only a few SE Asian samples in the run. This is not a problem for W and most SC Asians and NW Indians, as most of the E Asian ancestry is from samples which have been included in the run, although ideally I would like to have had 3 times as many samples.
I had actually run yours but forgot to post. I will do that by tomorrow.
Thanks Dilawer, that makes alot of sense, and as you say, the focus is N/E asian ancestry – look forward to seeing my run!
At least on some of the calculators that differentiate between NE Asian and SE Asian, Bengalis can score varying proportions, some with NE Asian predominant to a more common SE Asian predominant ratio – correlating I guess with the above IBD table where I share with Indonesian Lebbo and Buryat.
Going to Varun’s point, I imagine there’s been alot of sharing in both directions between austroasiatic tribals and more eastern South Asians. With Bengalis in particular, I wonder how much of the east asian component comes from austroasiatics versus more recent mixing with tibeto-burman groups over the past 1000 years.
The main takeaway for me was that West and South Asians have considerable North and East Asian ancestry that is not well represented by programs such as ADMIXTURE, or by results from companies such as 23andMe (except for their admixture date estimator), AncestryDNA, or FTDNA. The company GPSOrigins seems to capture that ancestry better.
The other was IBD segment sharing between some Kurds / Pashtuns and Lithuanians. Wonder what the results would have been had I included a good number of samples from Lithuania, Latvia, and other NE European regions.
That’s a lot of SNPs shared with Evenks. Given the latest paper on Scythians, perhaps it’s an indicator of Yamnaya/ Steppe ancestry?
I’m assuming that Reza shares a lot of IBD segments with Austroasiatic peoples
Indeed a possibility since Scythians had Siberian gene flow. I noticed that you had a 783 SNP shared segment with them with an IBD score <10 -13, which is extremely likely IBD.
A good assumption for Reza
This is very interesting analysis, and unfortunately I have not tested with 23andme to offer my data for analysis. My own feeling is that my East Asian ancestry is underestimated by both FTDNA and Ancestry DNA.
Exceedingly fascinating, thank you Dilawer wrora!
I found this to be very exhaustive, and quite well-thought-out.
On a somewhat related note, it would be very interesting to (eventually) see an IBS list with modern populations (global focus), as I’ve seen very few IBS comparisons with populations from different regions.
Also, I find that IBS lists make more sense, compared to f3 scores.
Thanks wrora.
I agree IBS can also be useful, and I will hopefully do a run soon. F3s have a different purpose than IBS. With 1SNP threshold IBS we can infer total shared drift, and rank samples according to how much drift they share with other samples (percent IBS similarity)
With f3s we rule in or rule out whether a sample C is AB admixed, depending on whether the minor allele frequencies at various positions fall between the allele frequencies for A and B at the same position. This is how I believe the f3 program works:
Assume we have a test sample C, which we are trying to determine whether it is AB admixed. Assume that A and B consist of 1 sample each.
The range of possibilities for a locus say rs123 would look as follows:
x MAF y MAF z MAF ADMIXED F3
0% 0% 0% Y 0
0% 0% 50% N “+ve”
0% 0% 100% N “+ve”
0% 50% 0% Y 0
0% 50% 50% Y 0
0% 50% 100% N “+ve”
0% 100% 0% Y 0
0% 100% 50% YY “-ve”
0% 100% 100% Y 0
So MAF of 0 would denote that the sample is homozygous for the major allele, 50% would indicate hetrozygous for the minor allele at rs123, 100% homozygous for the minor allele at rs123.
In this case we see that a -ve f3 is produced only when A has MAF of 0 and B has MAF 100% and C happens to have a MAF of 50% at rs123, or when A is 100% and B is 0.
I believe the program adds the negative f3s and the positive f3s and comes up with a total f3.
As with many other methods, I would expect that f3s would be more accurate if the admixture event is not too distant in the past, since allele frequencies change over time (drift, mutations, etc)
Thank you wrora!
It’ll be very interesting to see what you find.
Also, it seems that I might be confusing a method used for detecting shared drift with f3 scores?