An Ancient Harappan Genome Lacks Ancestry from Steppe Pastoralists or Iranian Farmers – A critique

DISCUSSION

Earlier this month the much awaited Bronze Age Indus Valley Civilization (IVC) paper on the Harappan genomes was released (1). Unfortunately, only one of the sampled skeletons yielded usable DNA fragments. The main conclusions of the paper, which relied primarily on this single sequenced genome, named I6113, and 11 previously sequenced BA individuals in Narasimhan et al (2), from sites in cultural contact with IVC from Eastern Iran and Turkmenistan, were:

  1. The individual I6113, recovered from Rakhigarhi in NW India, is from a population that is the largest source of ancestry for South Asians;
  2. Iranian-related ancestry in South Asia split from Iranian
    plateau lineages >12,000 years ago;
  3. First farmers of the Fertile Crescent contributed little to no
    ancestry to later South Asians.

This is depicted in their best fitting admixture graph shown in Figure 1.

Figure 1 Best-Fitting Admixture Graph Relating Populations with Iranian-Related Ancestry

The authors are commended for their in-depth analysis which led to these conclusions. We find their conclusions to be reasonable and their methodology robust, however, the genomic data used to arrive at those conclusions is the weak link in the chain. Our findings which are based on our previous work with aDNA and present analysis of Rakhigarhi I6113 outlines the reasons for this statement.

Rakhigarhi I6113 is publicly available at Reich Lab .  Based on our prior work with pseudo-haploid ancient DNA (aDNA) and our present examination of I6113, we caution that aforementioned conclusions 2 and 3 can not be confidently made due to the sample’s lower coverage, and mainly due to the following:

PSEUDO-HAPLOID aDNA IS HIGHLY SUSCEPTIBLE TO REFERENCE BIAS

Our previous work such as diploid genotyping ancient Eurasian Steppe sequences and comparing them with their published pseudo-haploid counterparts, such as our work in Diploid genotyping ancient DNA, and work by researchers like Gunther et al, 2018 (3) in “The presence and impact of reference bias on population genomic studies of prehistoric human populations” has clearly shown that pseudo-haploid ancient DNA sequences are more vulnerable to reference bias.

Sequenced DNA libraries are enriched for polymorphic sites in the genome, so theoretically we should see close to 50% alternate allele positions, such as either hetrozygous or homozygous alternate allele calls.We restrict our analysis to the genomic tracts covered by I6113’s 31180 autosomal SNPs for a more relevant comparison.

We show in table 1 that in the case of diploid high-coverage ancients such as paleolithic Ust-Ishim and neolithic Stuttgart genomes, we have about 55% homozygous reference positions, which is close to our aforementioned theoretical expectations.

However, we don’t see this to be the case with Rakhigarhi I6113, the higher coverage pseudo-haploid Swat Saidu-IA-I6894 sample from Narasimhan et al (2), Anatolia-MLBA-MA2200, and other pseudo-haploid aDNA. Here we see marked reference bias, with about 70% of the genotyped positions being homozygous for the reference allele.

Table 1 – Reference bias is exaggerated in the pseudo-haploid samples, including Rakhigarhi-I6113, where approximately 70% of the genotyped positions are homozygous for the reference allele.

There is a great deal of mystery surrounding the populations comprising the Human Reference Genome. Ideally, the Reference Genome should equally represent all world populations, however, investigation has shown this to be far from being the case. Surprisingly, about 72% of the Reference Genome (widely used GRCh37) comes from one, yes one 50% European – 50% African individual, anonymously  referred to as RP11! 23% is from libraries for 10 individuals, and only 5% comes from libraries for about 50 individuals as shown in figure 2 from NCBI

Figure 2 – Composition of the Human Reference Genome surprisingly showing about 72% of the genome is  based only one 50% African / 50% European anonymous individual known as RP11.

So why is reference bias a bad thing. Well because genomic variation not well represented by the Human Reference Genome sometimes either maps with a poor mapping quality score, is sometimes mapped to the wrong part of the genome, or sometimes does not map at all and is discarded. For example an Asian tract that is not as well represented on the Reference Genome, such as say European tracts, maybe discarded. This causes an inflation of the European admixture percentage, at the expense of the individual’s Asian admixture.

We experienced this first hand while comparing our own diploid genotyped versions of various Eurasian Steppe ancient genomes with published pseudo-haploid versions. See  Diploid genotyping ancient genomes . Dstats showed greater reference bias for the published pseudo-haploids when compared with our diploids, using D [diploid, pseudo-haploid ; Hg19 Reference, Chimp ]. Additionally, we noticed a European shift in the published pseudo-haploid Steppe MLBA and IA aDNA samples, when compared with our genotyped more accurate diploid versions, which had a more Asian shift.We noticed this both in IBS, and in ADMIXTURE.  We believe this to be due to greater reference bias for the published pseudo-haploids.

Consistent with our findings, Gunther et al (3) also found “This comparison assumes that the diploid calls are less affected by reference bias as slight deviations from a 50/50-ratio at heterozygous sites should be tolerated by a diploid
genotype caller but random sampling would be biased towards the reference allele. This is supported by the D statistic D(chimp, reference; sf 12_hapl, sf 12_dipl) < 0 (Z = −13.5), indicating more allele sharing between the reference and the pseudo-haploid calls.” They also found that sequences mapping to the European segments of the reference show a strong reference bias with slight differences between continental populations, and that reference bias at the East Asian segments of the reference genome seems intermediate, but the D statistics also show large variation which may be due to the only small proportion of the reference genome that could confidently be assigned to an East Asian origin.

Unfortunately, for the majority of aDNA, which has lower than 2X read depth coverage, we don’t recommend diploid genotyping. What we do need is a reference genome with greater representation of Asian variation for our our present day Near Eastern and Asian populations, and to more accurately genotype aDNA from the Eurasian Steppe, so that results are not as “Europeanized” as they presently are.

Additionally, natural fragmentation associated with aDNA causes an increase in smaller segments sequenced. Gunther et al (3) showed that the strength of reference bias is negatively correlated with fragment length, and reference bias can cause differences in the results of downstream analyses such as population affinities, heterozygosity estimates and estimates of archaic ancestry. Thus we caution against expressing great confidence in analysis conclusions requiring a higher degree of resolution and more accurate higher coverage sequences, which Rakhigarhi I6113, and most of the published pseudo-haploid ancient genomes are clearly not.

REFERENCES

1- Shinde et al., An Ancient Harappan Genome Lacks Ancestry from Steppe Pastoralists or Iranian Farmers, Cell (2019), https://doi.org/10.1016/ j.cell.2019.08.048

2- V. M. Narasimhan et al., Science 365, eaat7487 (2019). DOI: 10.1126/ science.aat7487

3- Gunther et al., 2019 Jul 26;15(7):e1008302. doi: 10.1371/journal.pgen.1008302

Scroll to Top
Scroll to Top