Diploid Genotyping Low to Medium Coverage Ancient DNA

OVERVIEW

The overwhelming majority of published ancient DNA (aDNA) are pseudo-haploid genomes. This study focuses on the suitability and accuracy of these pseudo-haplodized aDNA genomes for various types of downstream analysis, such as those involving admixture allele frequency studies, as well as Identity-by-Decent (IBD) analyses and analyses using formal methods. The accuracy of pseudo-haploid aDNA genomes is also compared their diploid counterparts which were processed as part of this study.

Challenges faced with sequencing and processing aDNA, include contamination with exogenous material, which in some cases can exceed 99% of the sample content. This exogenous material includes post-mortem microbial contamination, as well as intra-species contamination from modern human DNA during excavation and handling of the ancient remains. Other challenges include the fragmentation of aDNA and post-mortem deamination.

The goals of this study include:

Processing various aDNA samples from the Eurasian steppe and calling diploid genotypes using various programs and work-flows;
Comparing the quality and accuracy of the diploid call-sets versus the published pseudo-haploid call-sets;
Assessing the suitability of the published pseudo-haploid aDNA for various downstream analyses;
Assessing various types of post-mortem damage affecting aDNA, and offering solutions to mitigate their effects.

DISCUSSION

Advances in next-generation sequencing (NGS), and the use of shotgun sequencing in combination with hybridization capture, along with improved DNA library preparation methods, have increased endogenous yield in ancient DNA (aDNA) samples, as well as accuracy of the resulting genomes.

This study indicates that post-mortem C–>T and G–>A deamination, and cross-species contamination with microbes, is properly accounted for in pseudo-haploid aDNA genomes published by institutions such as Max Plank, Harvard, and others, however, removing exogenous DNA from humans who originally excavated and handled the fossils, remains a challenge. For example, significant contamination with DNA from western European excavators, will cause a sample from the Eurasian steppe to show artificially increased genetic affinity with western Europeans, at the expense of less genetic shared drift with Central or West Asian populations.

However, by far the biggest handicap with published pseudo-haploid aDNA genomes is that they simply are pseudo-haploid. The “random draw” method for genotyping ⁵, which results in pseudo-haploid genomes was introduced in Haak et al . Using this method, a sample is genotyped by sampling only one allele at each polymorphic site. It seems that this method, or slight variation thereof has been adopted by most of the institutions that publish aDNA. However, the resulting pseudo-haploid genotypes, are inherently inaccurate due to humans being hetrozygous at a large percentage of polymorphic sites. Thus pseudo-haploid aDNA is not well suited for some down stream work such as IBD analysis.

During this study fastq and Bam files for aDNA Eurasian steppe samples with average read depth coverages of 3-8X were processed using various GATK and ATLAS pipelines, and diploid genotypes were called. We show that these diploid genotypes are more accurate and better suited for downstream analyses, than their published pseudo-haploid counterparts.

Samples from two contemporary populations which are substantially Eurasian steppe derived; Estonians and Shugnan Tajiks, from the Simmons Diversity whole genome sequencing project were used as “truth/control” samples to gauge the accuracy of the final analysis ready aDNA diploid genomes, as well as the published pseudo-haploid genomes.

Comparisons between our diploid genomes, and published pseudo-haploid genomes indicate the following advantages for diploid genomes:

Inherently increased accuracy

Metrics for evaluation of the call-sets were obtained using GATK’s VariantEval tool and Picard’s CollectVariantCallingMetrics tool. Hetrozygosity rates for Tajik_Shugnan and Estonian (truth samples) were 64% and 62%, respectively, indicating that around 60% of the positions in the published haploid sequences have at least one incorrect allele. By contrast, the Srubna-I0232 sample published in Mathieson et al, 2015, which I processed for diploid calls using an ATLAS pipeline, has a hetrozygosity rate of 56%, which is much closer to that in the truth samples.

Lower reference bias for the diploid genomes

With pseudo-haploid genomes, an allele is randomly drawn out of 2 alleles to represent the individual at a given locus, whenever each of the possible alleles is supported by an equal number of sequences ⁵. Although the selection of an allele randomly should not cause additional reference bias for the pseudo-haploid genome, it seems that the process of aligning the sequences to the reference genome is introducing the bias, because reads not matching the reference allele may receive lower mapping quality scores, and eventually get filtered out.. Thus this could cause a bias of haploidization towards the reference genome.

This problem can become magnified in ancient genomes where post-mortem DNA damage compromises alignment quality. An increased proportion of reference alleles, combined with a loss of novel and rarer variants, would increase bias to populations which comprise the reference genome ⁶. We believe it is this bias which causes some ancient Eurasian steppe genomes to show a small amount of African admixture, and not post-mortem deamination as has been suggested by some. I have noticed this to be particularly problematic using GATK pipelines, and I can safely rule out post-mortem deamination, because the problem persisted even when using transversion SNPs only.

ANCIENT SAMPLES

Karasuk fastq files were downloaded from the European Nucleotide Archive (ENA) and processed several ways using various tools described in “Materials and Methods” . Pseudo-haploid versions were published in Allentoft, et al. ^7. Six ancient sample BAM files were provided by Swapan Mallick from Harvard University. Pseudo-haploid versions were published in Mathieson et al ⁸. Information and recovery locations for the aDNA samples are shown in Table 1 and Figure 1.

**Fig 1 – Recovery locations of the samples in this study.**

**Table 1 – Sample information ^{7, 8}**

MATERIALS & METHODS

I utilized various tools, including GATK, ATLAS ², DICE ¹, and PMDtools. Using ADMIXTURE, ADMIXTOOLS ⁹, and GATK variant metrics tools, the final call-sets were subjected to various comparative analyses and quality control.

An assortment of techniques and work-flows were utilized for processing the raw Fastq and Bam reads, and genotyping the samples. Some of the tools utilized included:

BWA for mapping fastq reads to the Hg19 Human reference;
Fastqc to visualize various statistics of raw reads outputted from the sequencer and for quality control;
BBTools for trimming read ends for library adapters and deaminated bases;
DICE ¹ for inference of contamination of aDNA by modern humans. Dice is based on a Bayesian Markov Chain Monte-Carlo (MCMC), and uses probabilities of finding certain derived alleles by using derived allele frequencies in contaminant populations to determine the most likely source of human contamination.
The Genome Analysis Tool Kit (GATK) from Broad Institute, using;
1. HaplotypeCaller to reassemble active regions in the genome with significant evidence of variation, by realigning haplotypes present in the data against the reference genome. So in effect the existing mapping information from the BWA aligner is discarded, and the reads are realigned to the reference, followed by diploid calling of genotypes;
2. UnifiedGenotyper, which is a locus-based variant caller;
3. BaseRecalibrator to recalibrate base quality scores (BQSR) to compensate for systematic technical errors in error estimates of base quality scores emitted by sequencers;
4. GenotypeGVCFs for joint genotyping of individual gVCF files outputted by HaplotypeCaller. This was done to increase evidence for real variants;
5. VariantRecalibrator to assign a well-calibrated probability for each called variant. During Variant Quality Score Recalibration (VQSR), a continuous covarying estimate of the relationship between SNP call annotations and the probability that a SNP is a true variant versus a sequencing artifact. Instead of traditional filtering methods which examine each variant annotation individually, in VQSR, the profile of a good variant is established by looking at multiple variant annotations (coverage, fisherstrand, strandoddsratio, etc) simultaneously, by taking the overlap of the training-truth resource sets and the call-set;
6. VariantEval to evaluate all my call-sets, whether obtained using GATK or via ATLAS tools for various metrics such as percentage in dbSNP, genotype concordance, Transition/Transversion (Ti/Tv) ratios, and percentage of homozygous-reference (HomRef) vs homozygous-alternate (HomVar) vs heterozygous (Hets) SNPs in the call-sets.
Picard tools such as CollectVariantCallingMetrics were used to obtain call-set metrics, as well as processing Bam files.
Samtools for processing BAM files;
Analysis Tools for Low-coverage and Ancient Samples (ATLAS) ², using:
1. EstimatePMD to estimate the amount of post-mortem deamination;
2. Recal to perform base quality score recalibration using regions that don’t show polymorphism, such as X chromosomes in males;
3. PMDS to filter out reads (contaminating sequences) originating from modern human DNA, from endogenous aDNA sequences, using postmortem degradation patterns, utilizing a technique introduced in Skoglund et al, 2014.
MapDamage to assess and visualize C-→T and G-→A substitutions due to deamination at read ends, and throughout the reads, and to authenticate the endogenous material as aDNA.
Rstudio for plotting some of the outputs.

The following techniques were used to mitigate the effect of deaminated bases:

Trimming the first few bases at the 3′ and 5′ ends of reads;
Using MapDamage to downscale quality scores for suspected deaminated bases;
Using the ATLAS pipeline to account for deaminated bases during processing;
Using transversion SNPs only in the analysis.

HUMAN CONTAMINATION

Quantifying and removing modern-human contamination from aDNA is one of the most challenging tasks for obvious reasons. Traditionally, aDNA samples which have been substantially contaminated with modern-human DNA have been discarded. For example, in this study, contamination with DNA from Western-European excavators, will tend to cause the Eurasian steppe samples to display increased shared genetic drift with Western-Europeans, at the expense of less shared drift with Asians . It is therefore important to estimate the amount of modern-human contamination to gauge the accuracy of the results of any downstream analysis using these samples.

The program DICE¹ was used to infer the amount of modern-human contamination of the samples. DICE uses a Bayesian Markov Chain Monte-Carlo approach, by computing the probability of finding a derived allele as contaminant by using the derived allele frequency of the suspected contaminating modern-human population. Since Yorubans were not suspected as the contaminant population, they were used as the anchor population to compute demographic patterns. CEU was used as the contaminating population. After confirming convergence of the program, the outputs were plotted using Rstudio, and are shown in figures 2 and 3. The results show around 1.5-3.5% contamination with modern Western-European DNA for most of the samples, except for the Scythian Iron Age sample which showed around 8% contamination .

**Fig 2 – Contamination of aDNA samples by modern Western-Europeans. Srubnaya-I0232 had the lowest at ~1.5%, and EHG had the highest at about 3.5%.**

**Fig 3 – Contamination of aDNA samples by modern Western-Europeans. The Iron Age Scythian sample had the highest estimate at around 8%.**

REFERENCE BIAS

A metric that is often used to assess accuracy in the variant call-sets is reference bias. Although GATK is an excellent BAM processing and variant calling workflow, it is not as well suited as the ATLAS workflow for aDNA, which tend to be lower coverage. This is primarily due to variants supported by fewer reads not being called, as GATK requires greater evidence of a variant. Evidence of this can be seen in table 3, where there appears to be a correlation between increasing reference bias and decreasing read depth A consequence of increased reference bias with the GATK workflow was the observation of as a small percentage of African admixture (2-5%) in the Eurasian steppe samples using ADMIXTURE.

Table 3 – Dstats showing less reference bias for the diploid processed genomes in this study than their published pseudo-haploid counterparts. As expected, reference bias appears to be proportional to the coverage with GATK, with bias decreasing with increasing read depths.

IBS allele by allele comparisons were performed using PLINK to compare the various genomes against the Hg19 reference. The results reinforced the aforementioned results obtained using dstats. The diploid genotyped genomes using GATK showed the highest reference bias, followed by their published aDNA counterparts, with the diploid genotyped genomes using the ATLAS pipeline displaying the lowest reference bias, as shown in table 4. Side-by-side normalized distribution comparisons using the samples shown in table 4 are shown in Fig 4.

**Fig 4 – Reference bias distribution comparisons using IBS. Values are normalized.**

Table 4 – IBS comparison with the Hg19 reference genome using PLINK. The diploid genotyped samples using the ATLAS pipeline show a lower reference bias than the published pseudo-haploid genomes, thus reinforcing the results obtained above using dstats.

This elevated reference bias for the samples processed using GATK, versus those processed using the ATLAS workflow is shown in the IBS comparisons with the Hg19 reference, shown in table 4 . PLINK was used for the IBS runs.

Although suspected post-mortem deaminated sites were filtered out, to rule out deamination as the cause for the small African admixture percentage in the samples, analysis was done using transversion sites only. However, the African admixture persisted.

HETROZYGOSITY

Hetrozygosity matters in any Identity-by-Descent (IBD) or haplotype sharing analysis using aDNA and modern humans. Figure 4 shows that the published pseudo-haploid genomes are rather ill suited for this type of work since a substantial proportion of positions in the human genome are hetrozygous, and thus pseudo-haploid genomes have an incorrect allele assignments of at least one allele for all hetrozygous positions.

**Fig 5 – IBS matrix obtained using PLINK showing limited allele sharing between the published pseudo-haploid genomes and various modern populations.**

ChromoPainter was used to generate a matrix of haplotype sharing, using the ATLAS and GATK processed genomes as donors, and individuals from various populations as recipients, as shown in figure 6.. The results are consistent with those obtained above obtained using IBS.

**Fig 6 – ChromoPainter haplotype donor-recipient heatmap using the processed diploid genomes as donors, based on values from chunkcounts.out.**

For the Iron Age Scythian sample, shown in fig 7, the allele sharing pattern obtained using IBS appears to be somewhat bi-modal, peaking in Kurds, Tajiks, Tatars, and Pashtuns, and also in Bellaroussans and Ukranians. This pattern deviates a little from the general pattern seen for early and middle Bronze age steppe populations, especially in the case of Kurds, where there is a significantly greater peak for NE Europeans. This suggests that Scythians played a larger role than other steppe populations in the demography of some Asian populations such as Kurds and Tajiks, which is consistent with the proximate physical presence of Iron Age Scythians and Sarmatians to present day Afghanistan and Kurdistan.

**Fig 7 – Allele sharing between the ATLAS pipeline diploid processed Iron Age Scythian I0247 sample and various modern populations using IBS. The distribution is normalized.**

**Fig 8 – Allele sharing between the ATLAS pipeline diploid processed Bronze Age Yamna Samara I0231 sample and various modern populations using IBS. The distribution is normalized.**

**Fig 9 – Allele sharing between the ATLAS pipeline diploid processed Middle/Late Bronze Age Srubna I0232 sample and various modern populations using IBS. The distribution is normalized.**

**Fig 10 – Allele sharing between the ATLAS pipeline diploid processed Mesolithic Eastern Hunter Gatherer (EHG I0061) sample and various modern populations using IBS. The distribution is normalized.**

The Karasuk appear to share the highest total drift with EHG. Based on the location of the Karasuk culture on the Eurasian steppe relative to MA1 and EHG, and based on ADMIXTURE results showing them to be an important source population for C/S/W Asians, is likely that the high amount of the allele sharing is due to a combination of ancestry derived from EHG and from a population ancestral to both EHG and Karasuk, to wit, MA1. This combined with a strong f3 admixture signal of the form f3( S/SC/W Asian,;Karasuk, Mbuti) as shown in figure 13 below, indicates Karasuk as an important source of extra ANE related admixture for populations ranging from eastern W Asia to C & S Asia.

DIPLOIID VS PSEUDO-HAPLOID aDNA FOR IBS/IBD ANALYSIS

IBS/IBD is an area where the published pseudo-haploid genomes fared much worse than the diploid genomes processed pursuant to this study. This was expected, however to better quantify and visualize the comparison between the pseudo-haploid and the diploid genomes IBS comparisons were done with some of the ancient genomes. Figures 11 and 12 show IBS comparisons with the diploid and published pseudo-haploid Scythian I0247, respectively. The IBS values are normalized population averages.

A cursory inspection of the rankings and IBS scores of some of the populations, such as the ones highlighted in red, shows that the published pseudo-haploid genome is ill-suited for this type of analysis.

Fig 11 – IBS comparison with Scythian I0247 which was diploid genotyped within this study showing E Europeans and Iranics as populations which share the most drift with Scythian. Values are normalized population means.

Fig 12 – IBS comparisons with the published pseudo-haploid Scythian I0247 sample. Pseudo-haploid genomes are ill-suited for IBS/IBD studies with diploids. Populations with inconsistent rankings or IBS amounts are highlighted in red.

3-POP TEST

The 3-Pop test is a quick test of admixture. Here this test was performed in out-group mode with f3 (modern; ancient, mbuti/chimp) to test for the strength of the admixture signal attributable to the ancient genome only.

**Fig 13 – Qp3pop in out-group mode, f3 ( X; MA1, Mbuti) vs f3( X; Karasuk, Mbuti) testing for evidence of ANE vs Karasuk admixture, with standard errors.**

Fig 14 – Qp3pop in out-group mode. A comparison of, f3 ( X; Scythian, Mbuti) vs f3( X; Srubnaya, Mbuti) testing for evidence of Scythian & Srubnaya admixture, with standard errors. Srubnaya clearly appears to be the preferred proxy for steppe admixture for Europeans, whereas Scythian appears to be just as good a proxy for Iranics and C Asians.

**Fig 15 – Comparison of Qp3pop in out-group mode, f3 ( X; aDNA, Mbuti) indicating that Karasuk best tracks ANE admixture in Eurasians.**

AVAILABILITY OF DIPLOID aDNA

Samples processed for diploid genotypes pursuant to this study are available upon request.

REFERENCES

Racimo F, Renaud G, Slatkin M (2016), Joint Estimation of Contamination, Error and Demography for Nuclear DNA from Ancient Humans. PLoS Genet 12(4): e1005972. doi:10.1371/journal, pgen.1005972.
Analysis Tools for Low-depth and Ancient Samples, Vivian Link, Athanasios Kousathanas, Krishna Veeramah, Christian Sell, Amelie Scheu, Daniel Wegmann, bioRxiv 105346; doi: https://doi.org/10.1101/105346.
Skoglund P, Northoff BH, Shunkov MV, et al. Separating endogenous ancient DNA from modern day contamination in a Siberian Neandertal. Proceedings of the National Academy of Sciences of the United States of America. 2014;111(6):2229-2234. doi:10.1073/pnas.1318934111.
Lazaridis I, Patterson N, Mittnik A, Renaud G, Mallick S, Kirsanow K, et al. Ancient human genomes suggest three ancestral populations for present-day Europeans. Nature. Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.; 2014; 513: 409–413.
Haak W, Lazaridis I, Patterson N, Rohland N, Mallick S, Llamas B, et al. Massive migration from the steppe was a source for Indo-European languages in Europe. Nature. 2015; https://doi.org/10.1038/ nature14317 PMID: 25731166.
The population genomics of archaeological transition in west Iberia: Investigation of ancient substructure using imputation and haplotype-based methods, Rui Martiniano , Lara M. Cassidy, Ros Ó’Maoldúin, et al, 2017, bioRxiv 134254; doi: https://doi.org/10.1101/134254.
Allentoft ME, Sikora M, Sjögren K-G, Rasmussen S, Rasmussen M, Stenderup J, et al. Population genomics of Bronze Age Eurasia. Nature. 2015; 522: 167–172. https://doi.org/10.1038/nature14507, PMID: 26062507.
Mathieson I, Lazaridis I, Rohland N, Mallick S, Patterson N, Roodenberg SA, et al. Genome-wide patterns of selection in 230 ancient Eurasians. Nature. 2015; 528: 499–503. https://doi.org/10.1038/ nature16152 PMID: 26595274.
Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, et al. Ancient admixture in human history. Genetics. 2012; 192: 1065–1093. https://doi.org/10.1534/genetics.112.145037 PMID: 22960212

Diploid Genotyping Low to Medium Coverage Ancient DNA – Part 1