Abstract
The recent publication of numerous DNA sequences from ancient humans spanning the Neolithic to the Iron-Age and Medieval time frames from the Anatolia, Caucasus, and Iran regions has greatly furthered our understanding of the formation of modern ethnic groups such as Armenians, Kurds, Persians, and Turks, which we define as our study populations.
Our detailed downstream analysis of whole genome sequencing (WGS) data along with published DNA genotype data from ancient and extant individuals, including genotype data from the recent 2022 extensive multi-national collaborative studies on the origin and dispersal of Indo-European languages by Lazaridis et al1 shows that the regions encompassed by Anatolia, the Caucasus, Iran, and Iraq underwent major demographic changes from the Bronze-Age to present. These changes include the introduction of hitherto absent paternal haplogroup lineages such as R1a-Z94 from Central Asia during the Parthian era, especially among the Kurds, as well as the introduction of Indo-Iranian languages from the ancient region of Ariana (present-day Turkmenistan and Afghanistan) and surrounds.
We also utilize a robust technique which we call “Differential Cladality Analysis” using the formal statistical software qpWave6, where we perform pairwise cladality tests between one of our study populations and various ancient and extant Eurasian populations. This allows us to determine which ancient populations have genetically contributed to one or more of our four study populations to the exclusion of our other study populations on a relatively more recent time scale.
INTRODUCTION
The recent 2022 extensive multi-national collaborative study and multiple publications on the origin and dispersal of Indo-European languages by Lazaridis et al 1 showed that proto Indo-European languages and Yamnaya’s paternal ancestors originated in the “Southern Arc” region of Armenia and Northern Iran. Subsequently the modified Indo-European languages dispersed from a secondary staging area on the Eurasian Steppe to Europe and Asia. Here we show that the Indo-European languages spoken in the Kurdistan region and northern Iran were further modified post Iron Age by Parthians from Central Asia (Fig 1) who hybridized with the local Iranian/proto-Kurd Iron Age populations at Hasanlu and Tepe Dinkha and introduced a new Y-DNA lineage hitherto missing in the region; R1a Haplogroup R-Z94. R-Z94 became the dominant Y-DNA lineage in Kurds from certain regions2,3,4 , however prior to the Parthian period this lineage was non-existent with a frequency of 0% in the 230 good coverage ancient Bronze and Iron Age samples from Iraq, Iran, and Turkey released by Lazaridis et al1 . Utilizing formal statistics (qpWave and qpAdm and a solid set of pright references) we identify the genetic profile of these Central Asian Iron Age populations who hybridized with the local Iranian Bronze Age and Late Iron Age to form present day Kurds and Persians.
The Lazaridis et al Southern Arc study published 230 new ancient male samples covering Armenia, Iran, Iraq, and Turkey from the Chalcolithic to around 600 years ago. Yet out of the 230 male samples only two recent 1300 year old samples from Turkey were positive for Y-DNA haplogroup R-M417, however according to Grugni et al2 Y-DNA R1a1a R-M17 is a major haplogroup with a frequency of 20% among Kurds (Fig2 ). In fact, a glance at Fig 2 shows that Kurds have the highest frequency of R1a1a amongst all West Asian populations including western Iran and this frequency rivals that of Persians on the border of Afghanistan in Khorasan province. Additional downstream tests have shown that R-Z94 is by far the major R1a1a subclade for Kurds. Since the 230 ancient samples from Kurdistan, Iran, and Turkey, published with the Lazaridis et al “Southern Arc” international study were negative for R1a, this is evidence that R-M417 and R-Z94 was introduced into the Kurdistan regions of Iran, Iraq and Turkey and Northern Iran from the east (Central Asia) post Iron-Age with the most likely vectors being Parthian and Scythian militaristic waves (Fig 1).
This is consistent with the historical, archeological and linguistic connections between certain Iranics such as Kurds and Parthians. This also means that since R1a R-Z94 is a major haplogroup among present day Kurds and was absent in the Kurdistan and Turkey regions from the Chalcolithic to the end of the Iron-Age, that the present day Kurd genetic identity was born post Iron-Age when Central Age Parthians hybridized with pre-R1a populations in the area to form present-day Kurds, and the Hasanlu-IA and Dinkha Tepe populations can be considered proto-Kurds or Persians at best. Our qpAdm models for present day Kurds as DinkhaTepe-A/Hasanlu-IA + Central Asian R-Z94 were extremely robust and could not be rejected even with increasing pright references to 15 Neolithic through Iron-Age diverse Eurasian references. This of course makes sense and is consistent with the aforementioned.
QpAdm models and qpWave 2 source cladeliness analysis also unequivocally show that East Asian admixture in Kurdistan, Iran and Turkey significantly increased with the arrival of R1a and Turkics post Iron-Age.
We identify several Iron Age good coverage samples from Central Asia which proxy this eastern Parthian genetic input into present day Kurds and are able to produce very robust qpAdm 2 source models for Kurds as IRN-DinkaTepe-BA/Hasanlu-IA + Central-Asian-IA. Surprisingly, we are even not able to reject Kurds as 100% Tajikistan-Kushan using the same set of robust pright references. We suspect that this sample is genetically Parthian.
Here we also identify a 2000 year old good coverage DNA sample recovered from Tajikistan from the Parthian Era that forms a perfect clade with Kurds and to a lesser degree Persians using qpWave rank=0 and 17 diverse East and West Eurasian Neolithic, Chalcolithic, and Bronze Age references. Although the sample is labeled as Kushan, we believe it is likely Parthian since the 2 empires overlap in space and time
Due to the established relatively lengthy historical presence of Armenians in this region, we use them as a geographical reference population to show how Iranics such as Kurds and Persians are genetically shifted to the north and east of Armenia, to wit, Central Asia and the Caucasus, and thus the homeland of a substantial proportion of their ancestors is the Parthia Central Asia region which includes Aryana. The Caucasus region includes Sarmatia.
We find that there is greater genetic continuity since the Iron Age with Armenians and Kurds, since using qpAdm they are relatively easy to model with 2 source populations whereas Persians and Turks appear to have a more complex demography and are thus more difficult to model with only 2 source populations.
Using the recently published samples from the “Southern Arc” we also dentify good Late Bronze Age and later source populations for Armenians, Kurds, Persians and Turks, which when combined with source populations from Uzbekistan, Tajikistan, Xinjiang and the Caucasus produce very robust 2 source models especially for Kurds (Fig 7).
A couple of years ago we were excited to discover a high degree of genetic continuity in Kurdistan from 2700 years ago with the release of the Mede era Hasanlu-IA sample where present day Kurds could be modeled as 80% with this sample. Amazingly, however, we now detail a 2000 year old good coverage DNA sample from Tajikistan from the Parthian period that forms a significantly tighter clade with Kurds than Hasanlu-IA with Kurds. This sample is labeled a Kushan sample, however, we believe it is likely Parthian since the 2 empires overlap in space and time. We name this Parthian-LIA-Tajikistan. Kurds can be modeled as 100% Parthian-LIA-Tajikistan using qpAdm even while increasing the number of pright reference populations to 17 diverse Neolithic, Chalcolithic, and Bronze Age West and East Eurasians.
METHODOLOGY & QUALITY CONTROL
We employ the following multi-fold strategy in our analysis here:
- We use the qpWave 2 source framework, and introduce a new method which we call normalized differential shift, to determine relatively more recent introgression into a population to the exclusion of its neighbors as shown in figures 15 – 20. Normalization of the results for each population mitigates pre-existing biases affecting the study samples from using different sequencing and genotyping techniques.
- QpWave and qpAdm are included in the Admixtools package introduced by Patterson et al6 in 2 source population (pleft) cladeliness mode, and a large number of Neolithic, Bronze-Age, and Iron-Age population references in pright to determine how much the 2 source populations form a clade with each other with regards to the diverse reference populations (Fig 3). We use the Chisq output to compare for example the cladeliness betwen DinkaTepe-B or Kazakhastan-Wusun or present day Mongols and Armenians vs Kurds vs Persians vs Turks;
- We tabulate the Chisq values obtained in (1) and perform differential cladeliness analysis where we calculate the cladeliness ratio of say Kurds/Armenians, Kurds/Persians and Kurds/Turks to determine which populations Kurds are closer to to the exclusion of Armenians, Persians, and Turks. This is informative to recent admixture events and migrations because in more distant past admixture scenarios these ratios should approach 1 for neighboring populations such as Armenians, Kurds, Persians, and Turks;
- We use qpAdm which is also a software included in the Admixtools package to determine the “Southern Arc” region present-day populations ancestry proportions using a large number of Neolithic and younger reference populations (pright) along with highest quality published samples in an effort to make all but the most robust ancestry models fail.
We define our study populations as follows:
Armenians = SA
Iranians = SI
Kurds = SK
Turks =ST
We test the strength of the clade formed by our study populations, S with various ancient and extant target populations, T, with respect to right populations. Using qpWave we set pleft to a study population, S, and one target ancient or extant population, T.
We use the following robust set of 16 pright reference populations in qpWave and qpAdm to differentiate subtle differences in shared drift between the targets and references;
Mbuti
CHG
IRN_Ganj_Dareh_N
ISR_PPNB
MAR_Taforalt_EpiP
SRB_Iron_Gates_HG
MA1
TUR_Marmara_Barcın_N
WHG
RUS_Yana_RHS
MNG_Khovsgol_LBA
RUS_DevilsCave_N
RUS_Shamanka_EN
Loschbour_WHG
UZB_Sappali_Tepe_BA
RUS_Siberia_Lena_EBA
Thus we obtain a m x 4 matrix, A, of qpWave Chisq ( χ2) outputs, where each columns is one of our 4 study populations, and each row is either an ancient or contemporary test population.
A = ( χ2ij)1≤i≤m, 1≤j≤4
Thus, each of our study populations which are represented by a column in A can be shown as the following vectors:
Where each vector, S, is composed of Chisq value entries; S = [ χ12 χ22 χ32 . . . . . χm2 ] for m target populations.
We can alternatively state that each study population, S, is an m-dimensional vector. From extensive prior experience processing DNA sequences, mapping sequences to human reference genomes, and genotyping, we have determined that the following affect the analysis ready genotyped DNA sequence:
- SNP ascertainment bias. The SNPs that are used significantly affect analysis results
- Sequencing depth, for example, 1X vs 30X
- Paired-End vs Single-End sequencing
- DNA fragment read lengths
- Software and strategy used for mapping to reference genome
- Downstream sequence processing, for example, GATK vs Samtools
- Single vs joint genome genotyping
- Various quality filter thresholds used at various stages of the pipeline
Unfortunately most scientists are not aware of the impact of the aforementioned on the downstream analysis results since most scientists use analysis ready genotyped datasets, and are not themselves involved in processing the DNA sequences. After devoting many years, considerable effort and resources to understand these issues, we have become quite familiar with them.
The Armenian, Iranian, and Kayseri Turk samples were sequenced and processed as part of the older Simons Genome Diversity Project, whereas the Kurd samples were WGS sequenced to 30X using the latest NGS sequencers. The ancient DNA data was for the most part pseudo-haploid data and genotyped on the 1240K Affymetrix microarray. We performed additional quality by filtering out samples with read depths less than 1X.
the Parthians & Scythians
Around 2300 years ago a series of momentous events occurred on the steppes of the Trans-Caspian area and Turkmenistan. A Central Asian nomadic Scythian tribe called the Parni subsequently to be known as the Parthians took over the Seleucid Empire established one of the most enduring states in the history of Asia for 500 years from 248 BCE to CE 226. At its peak the Parthian Empire stretched from the Mediterranean all the way to China.
The essential human resources at the roots of the Arsacid state were the Dahae, a community of Central Asian nomads. Nomadic clans overpowered the settled Iranian populace of Khorasan, imposing a monarchical structure on a nascent social amalgam of nomadic peoples of the steppes and settled Parthians.
With the rise of the Arsacid Parthians in Iran and in the Trans-Caspian steppes for the first time in recorded history nomads from Central Asia had subdued parts of Iran and imposed their dynasty on it. Henceforth dynasties with nomadic roots such as Seljuks, Ilkhanids and dynasties of Turkmen origin would come to power in Iran. In Eastern Iran and Afganistan waves of nomadic peoples established several states beginning with the Yuezhi and Saka through the White Hun Hephtalites. The growth of the Arsacid state is in many respects comparable with that of the Seljuks, who originally were nomads in Central Asia.
Under Mithradates I (165–132 BCE), Parthia became the most significant state power in Iran and southwestern Central Asia. He subjugated Margiana, most districts of Bactria, western Iran and Mesopotamia. In the meantime, the attacks of the Xiongnu against the Yuezhi set in motion nomadic tribes of Central Asia. As a result of these movements, various steppe peoples appeared n the region of Transoxiana and were pushed into the borders of Graeco-Bactria and Parthia.
Several steppe tribes including Yuezhi, Saka, and Asioi subdued Bactria. Sogdiana had fallen prey to nomads already in the 3rd century BCE. The Yuezhi mi-
gration forced some tribes to move into the territory held by the Arsacids.
As a result, two Parthian kings, Phraates II and Artabanos I, lost their lives in encounters against the Saka and the Tocharoi. Consequently, the Saka overran and plundered the eastern Iranian provinces of the Parthian Empire.
Just as present day Kurds are quite genetically diverse and span a large geographic areas and speak multiple Kurdi languages such as Badini, Sorani, Gorani, Zazaki and Feyli which are quite mutually unintelligible, it is also expected that Parthians would also have had a moderate genetic diversity with individuals overlapping with Scythians, Kushans, Yuezhi, and Wusun because of their overlapping territories.
Linguistically, Parthian seems to have affected the various Kurdi languages more than Persian (fig 5). This is consistent with Kurdi sharing dozens of cognates with Pashto to the exclusion of Persian (fig 6).
COMPARISONS WITH PRESENT DAY POPULATION
To improve our understanding of the population histories of Armenians, Kurds, Persians, and Turks, and to corroborate findings based on ancient DNA, it is helpful to see in which directions these populations are shifted relative to each other. To do so we tabulate QpWave cladeliness for each of these populations with present day Eurasians. To mitigate any biases that may exist due to different sequencing technologies and genotyping pipelines between our targeted study samples, to wit, 2016 Simons WGS Armenians, Turks, and Iranians vs 2022 WGS 30X Kurd-IRQ samples, we normalize the QpWave Chisq results for each of these targeted populations. The normalized results based on QpWave chisq are shown in fig 14.
Although the aforementioned results give a very accurate picture of shared genetic drift between Armenians, Kurds, Turks and Iranians and various Eurasian populations, what if just as important if not more so is what historical admixture events are set a population apart from its surrounding populations. In other words what part of its population history is significantly different from its neighbors, or alternatively, how much more genetic drift does a population share with other Eurasian populations to the exclusion of its neighbors.
For example, to determine how much more genetic drift Kurds share with Brahui to the exclusion of Armenians, we would calculate [(1/chisqN)Kurd-Brahui ] / [(1/chisqN)]Armenian-Brahui ], where N are the normalized values. When these results are tabulated and sorted we obtain information on which Eurasian populations Armenians, Kurds, Persians and Turks share the MOST genetic drift with to the exclusion of each other. In other words, which geographical direction these populations are shifted to with respect to each other. We tabulated and sorted these results for each of the 4 study populations and show the results in Fig 15 – 19.
With a quick glance at figures 15 & 16 it is immediately apparent that Europeans, Siberians, Central and South Asians share significantly more ancestry with Kurds than Armenians. This reinforces our hypothesis that Armenians have a higher genetic continuity in West Asia than Kurds and are more thus derive more of their ancestry locally. Alternatively, we can say Kurds have more ancestry from Europe and Central Asia than Armenians since figure 15 clearly shows Europeans and Central Asians share greater ancestry with Kurds than with Armenians. Figure 16 which is the reverse of figure 15 shows that some local West Asians and Southern Europeans such as Georgians, Iraqi Jews and Greeks share more ancestry with Armenians than with Kurds. Some of this is probably an artifact of greater shared Neolithic Anatolian Farmer ancestry with Armenians to the exclusion of Kurds.
EHG Admixture as a Measure of Central Asian Steppe Admixture
Eastern Hunter Gatherer (EHG) admixture was recently used in the Southern Arc paper by Lazaridis et al1 as a signal of Eurasian steppe admixture in West Asia. We agree this is sound practice since the use of Steppe EMBA and MLBA populations such as Yamnaya, Sintashta and Andronovo as Steppe proxies will yield ambigous results since any shared haplotypes between West Asians and the aforementioned does not necessarily indicate Steppe admixture, since it could simply due to shared ancestral ancestors with Steppe EMBA or MLBA.
Thus we also adopt the use of EHG admixture as a signal of Steppe admixture in our study groups. We use the qpAdm framework with the following robust set of pright references; Khomani_San, MAR_Taforalt_EpiP, JOR_Feldman_PPNB, TKM_Parkhai_EN, ANE, TUR_C_Boncuklu_PPN, WHG, SRB_Iron_Gates_HG, CHN_YR_LN, XINJ_Chemurcheck.N.BA3, Tarim_EMBA1, RUS_Shamanka_EN.
Lazaridis et al1 determined that Eastern Anatolia received negligible Eurasian Steppe admixture. We expected that this does not apply to extant Iranic populations in Eastern Anatolia such as Kurds since Y-DNA haplogroup R1a-Z94 is a major haplogroup among Kurds where they appear to have the highest frequency of this haplogroup in West Asia ( figure 2). Our analysis results reflect this by showing EHG admixture highest among Kurds from our study group, are certainly consistent with the relative high frequency of Y-DNA R1a-Z94/Z93 among Kurds. Lazaridis et al1 also showed via the hundreds of ancient samples published from the Kurdistan, Anatolia, Armenia, and Iran region that haplogroup R1a-Z94 was missing from this region through the Iron Age. This clearly shows that the R1a-Z94 ancestors of Iranics such as Kurds must have arrived in the region during the late Iron-Age to the historical time frame. Thus the Parthian, Scythian, Sarmatian, and Turkic ancestors of Iranics such as Kurds must have introduced this haplogroup to the region.
Our qpAdm results using EHG as a Eurasian Steppe Indo-Iranic signal do indeed show that present day Armenians, Kurds, Persians and Turks have EHG admixture (figure 21) with the highest EHG percentage being in Kurds which is consistent with Kurds having the highest Y-DNA R1a-Z94 frequency.
We also find that although Armenians and Persians can be modeled in qpAdm without additional East Asian sources albeit with lower passable p-values, the same can not be said of Kurds and Turks which require some East Asian source to produce passable p-values (figures 21-23).
It is important to remember that models utilizing EHG and Iron-Age East Asians as genetic sources are only informative to the relative degree of Eurasian Steppe introgression among West Asians, since these populations themselves did not actually participate in the demography of West Asians, and did not themselves hybridize with local West Asian populations such as Iran-Chl/BA or Iran-IA to produce present day Armenians, Kurds, Turks and Persians. Thus their combined EHG and Mongol percentages shown in figures 21-23 are not the actual percentages of non-local West Asian introgression into the present day West Asian populations for this study. The reader is referred to figures 9(a)(b)(c) for a more realistic representation of the actual percentage contribution of Central Asians to the demography of West Asians.
For example, the combined 15% EHG/East Asian in Kurd models, and the combined 4% combined EHG/East Asian in Armenian models would translate to a much a higher percentage in Central Asian Indo-Iranian required input since the relevant Scythians or Parthians were not 100% EHG or East Asian themselves. This is clearly evident when we model Armenians, Kurds, Turks and Persians using the relevant Iran-Chl/BA + relevant Scythians or similar Central Asian populations that potentially introduced Y-DNA R1a-Z93 or Z94 into West Asia post Early Iron Age as indicated in figure 9.
Methods & Quality Control
Our analysis used published data 1&2 genotyped on the 1240K Axiom Affymatrix array. The present-day populations consist of the Simons WGS set. All the available Iranian, Armenian and Turkish samples in the Reichlab dataset were used in this study. As an additional quality control measure we removed samples with less than 600,000 intersecting SNPs with the 1240K set. Additionally, for Kurds-IRQ we used two whole genome sequenced males from northern Iraq sequenced to a depth of 30X. We processed the fastq sequences using our own samtools pipeline which consists of the following:
- AdapterRemoval to remove remnant adapter sequences from High-Throughput Sequencing (HTS) data and trim low quality bases from the 3′ end of reads following adapter removal. Additionally, we removed sequences shorter than 24 bases which due to their short length may map to several regions of the reference genome
- bwa mem to quickly align the sequences to the GRCh37 Human Reference and piped the output to samtools view to produce BAM files. The reads were merged post-alignment instead of with AdapterRemoval
- samtools sort to coordinate sort BAM file
- Picard to soft-clip beyond-end-of-reference alignments and set MAPQ to 0 for unmapped reads
- Picard to remove duplicate sequences
- bcftools mpileup to convert BAM file into genomic positions we first use mpileup to produce a BCF file that contains all of the locations in the genome. We use this information to call genotypes and reduce our list of sites to those found to be variant by passing this file into bcftools call. Thresholds used were –min-MQ 15, –min-BQ 20
- bcftools view to remove positions supported by less than 4 reads
- This yielded VCF files with 1478 million bases
- Extraction of Affymetrix 1240K variants yielded a final VCF of 1160K variants
Data Availability
The 1240K genotype data used in this study is readily available at Reich Lab at https://reich.hms.harvard.edu/datasets
REFERENCES
- The genetic history of the Southern Arc: A bridge between West Asia and Europe, Lazaridis et al, American Association for the Advancement of Science, doi: 10.1126/science.abm4247, https://doi.org/10.1126/science.abm4247, 2022/09/28
- Grugni V, Battaglia V, Hooshiar Kashani B, Parolo S, Al-Zahery N, Achilli A, Olivieri A, Gandini F, Houshmand M, Sanati MH, Torroni A, Semino O. Ancient migratory events in the Middle East: new clues from the Y-chromosome variation of modern Iranians. PLoS One. 2012;7(7):e41252. doi: 10.1371/journal.pone.0041252. Epub 2012 Jul 18. PMID: 22815981; PMCID: PMC3399854.
- Hennerbichler, F. (2012). The Origin of Kurds. Advances in Anthropology, 2, 64-79. doi: 10.4236/aa.2012.22008.
- http://corduene.blogspot.com/2015/07/northern-kurds-y-dna-haplogroups.html
- Bronze and Iron Age population movements underlie Xinjiang population history, Kumar et al, Science 376, 62 (2022), DOI: 10.1126/science.abk1534
- Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D. Ancient admixture in human history. Genetics. 2012 Nov;192(3):1065-93. doi: 10.1534/genetics.112.145037. Epub 2012 Sep 7. PMID: 22960212; PMCID: PMC3522152.