abstract
To date there has been very limited analysis on Asian population histories using whole genome sequencing (WGS). Most population history genetics studies only utilize a few hundred thousand SNPs, which is small fraction of the genetic variation in humans. This small fraction of the genetics variation is ascertained using certain populations. An example are the datasets at the Reich Lab at Harvard or Max Plank which utilize the 1240K Affymetrix Axiom chipset SNPs. Another example is the use of the 2.3M SNP datasets ascertained with the Illumina Omni 2.5 chipset, such as various 1000 Genome Project datasets.
While using a small fraction of the human genetic variation made sense in the past due to the cost of sequencing whole genomes being prohibitively high, nowadays when sequencing a whole genome at an average depth of 30X can be done for under $500, it makes more sense to use whole genomes in population history genetics analysis since analysis results are less biased than using a small fraction of the variation in human DNA.
Genetics analysis based on a small fraction of the SNPs of humans, whether it be 500,000 or 1,000,000 SNPs, ascertained using certain populations introduces all sorts of biases and leads to false conclusions on population histories.
Through a data sharing agreement, the GenomeAsia 100K project1 provided us with WGS VCF files aligned using the GrCh37 Human Reference for 1163 primarily Asian individuals from around 170 populations . We merged this dataset with WGS genomic data sequenced to a depth of 30X from two Kurd individuals from the Kurdistan, Iraq region.
Here we present IBD and IBS analysis results using an intersecting 65 million SNPs across Asian populations.
DNA Sequence Processing & quality control
To reduce processing time we processed the various large fastq, bam and VCF files using our Intel i9-12900K 24 core server equipped with 128 GB RAM and 4 TB Solid State M2 PCIe memory. The primary dataset was provided by the GenomeAsia 100K project (GA100K) via a 300 GB .gz zipped VCF file. The broken VCF headers and “Format” fields were reformatted using Picard FixVcfHeader and our in-house scripts.
The 718 million forward and reverse strand paired-end reads of the fastq files for the Kurd individuals were processed as follows:
- Quality checked using fastqc2 . Mean quality across all bases is good at around 36 with the usual slight drop at the 3′ and 5′ prime ends shown in fig1. Mean read lengths are good and pretty consistent at 150 bp. Sequence duplication levels and per base N contents are unremarkable.
- Adapter and quality trimming was performed using AdapterRemoval v2.3-1 which has been shown to outperform other similar programs with regards to speed, sensitivity, negative predictive value, and Mathews’ correlation coefficient3 . Parameters used were as follows; “AdapterRemoval –file1 –file2 –basename –minlength 24 –trimqualities –trimns –minquality 10 –gzip –threads 24″.
- Alignment to CrGh37 human reference genome was performed using BWA mem and the result piped to Samtools from standout using the following parameters; bwa mem -k 19 -r 2.5 -t 24 -R Homo_sapiens.GRCh37.dna.primary_assembly.fa file.pair1 file.pair2 | samtools view -O BAM –threads 24 -o file.bam –
- The 122 GB bam files were coordinate sorted and indexed using samtools as follows: samtools sort -@ 24 -T Temp/temp file.bam -O BAM -o file.sorted.bam. Indexed as follows: samtools index -b -@ 24 file.sorted.bam
- Variants were called using Bcftools as follows; bcftools mpileup -O u –min-MQ 15 –min-BQ 20 –annotate INFO/AD –threads 24 -f Homo_Homo_sapiens.GRCh37.dna.primary_assembly.fa file.sorted.bam | bcftools call -c -O z –threads 24 –format-fields GQ -o file.Hg37.vcf.gz
- The resulting VCF files containing 1478 million SNPs were further filtered using Vcftools dropping sites with total allelic depths of high-quality bases (AD) less than 3.
- A total of 65M SNPs covered by the GA100K WGS project were extracted from the VCF files and then merged with the GA100K VCF files using Bcftools.
To conduct IBD and IBS analysis we used Plink as follows ./plink –bfile GA100K –geno 0 –genome. The –geno flag ensures that all sites used in the analysis have variant calls for all the samples used. This resulted in 65M intersecting sites across all the samples.
We then used our own script to create IBS/IBD population averages and tabulate the output for use in our R PCA script using the R packages ggplot2, gplots, ggfortify, ggrepel, and factoextra.
PCA ANALYSIS
The IBD (Pi-Hat) results obtained using 65M intersecting SNPs across all samples using the Plink –genome flag were processed as follows:
- For each population we averaged the IBD results such that each population was represented by only one population average observation in the PCA. The main reason this was done is to reduce bias in the PCA resulting from populations having varying numbers of representative samples.
- We constructed a 19×19 square matrix from the IBD population averages resulting in a PCA with 19 principal components shown in fig 2.
We used the matrix shown in fig 2 to build a PCA using the ggplot2 package in R with input scaling. Approximately 85% of the variance in the PCA can be explained by the first 2 principal components; PC1 and PC2 (fig 5). A plot using PC1 and PC2 is shown in fig 3.
Using PC1 and PC2 we note 3 general clusters:
- The North Indian Indo-Aryan/Indo-Iranian cluster which also includes the Indo-Iranic Kurds. North Indian populations such as Sindhis and Rajputs appear to be genetically closer to Indo-Iranic Kurds than to primarily Ancestral South Indian (ASI) populations such as Paniya. This is due to shared Eurasian Steppe and Neolithic Iranian ancestry between Kurds and Northern Indians. From the populations studied, genetically, Parsis, Pakistani Pashtuns and Brahui appear to be the 3 closest South Asian populations to Kurds.
- The East Asian cluster which includes Mongols, Chinese and Koreans.
- The Ancient Ancestral South Indian (AASI) and Papuan cluster.
- Nicobarese are located on a cline between Papuans and SE Asians nearer to SE Asians. This suggests a relatively more recent migration from SE Asia into the Nicobar islands with an admixture event with a population related to Papuans.
Using the PC2-PC3 plot we see that Indo-Iranians and Indo-Aryans are differentially related to East Asians suggesting more recent admixture between some South Asians and East Asians to the exclusion of others.
heatmap & GENETIC DISTANCE
One-to-one shared genetic drift between populations used in this study is visualized using the heat map shown in fig 6
Euclidean genetic distances were calculated based on the matrix shown in fig 2 and are shown in fig 7 to 12 based on whole genome data with 65 million overlapping SNPs among the populations studied. Results are for the most part unsurprising except for Papuans and Onge. Like the Andamanes Onge Papuans are genetically quite distant from all the populations studied. Surprisingly, they are closer to Northern Indian and Pakistani populations and even Iraqi Kurds than Andamanese Onge. Papuan results would not be consistent with Papuans and Andamanese Onge being an early Eurasian split sharing a branch to the exclusion of other Eurasians as some have suggested. Rather, these results would be more consistent with Papuans being a relic population who split away from other Eurasian populations early on or were part of a separate out-of-Africa event. Alternatively, the possibility exists that the original Papuans admixed with a South or West Eurasian migrant population to Oceania a few thousand years ago to create present day Papuans.
References
- GenomeAsia100K Consortium. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019). https://doi.org/10.1038/s41586-019-1793-z
- Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
- Schubert, M., Lindgreen, S. & Orlando, L. AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Res Notes 9, 88 (2016). https://doi.org/10.1186/s13104-016-1900-2