Genetic Relationships of Asian Populations using Whole Genome Sequencing

abstract

To date there has been very limited analysis on Asian population histories using whole genome sequencing (WGS). Most population history genetics studies only utilize a few hundred thousand SNPs, which is small fraction of the genetic variation in humans. This small fraction of the genetics variation is ascertained using certain populations. An example are the datasets at the Reich Lab at Harvard or Max Plank which utilize the 1240K Affymetrix Axiom chipset SNPs. Another example is the use of the 2.3M SNP datasets ascertained with the Illumina Omni 2.5 chipset, such as various 1000 Genome Project datasets.

While using a small fraction of the human genetic variation made sense in the past due to the cost of sequencing whole genomes being prohibitively high, nowadays when sequencing a whole genome at an average depth of 30X can be done for under $500, it makes more sense to use whole genomes in population history genetics analysis since analysis results are less biased than using a small fraction of the variation in human DNA.

Genetics analysis based on a small fraction of the SNPs of humans, whether it be 500,000 or 1,000,000 SNPs, ascertained using certain populations introduces all sorts of biases and leads to false conclusions on population histories.

Through a data sharing agreement, the GenomeAsia 100K project1 provided us with WGS VCF files aligned using the GrCh37 Human Reference for 1163 primarily Asian individuals from around 170 populations . We merged this dataset with WGS genomic data sequenced to a depth of 30X from two Kurd individuals from the Kurdistan, Iraq region.

Here we present IBD and IBS analysis results using an intersecting 65 million SNPs across Asian populations.

DNA Sequence Processing & quality control

To reduce processing time we processed the various large fastq, bam and VCF files using our Intel i9-12900K 24 core server equipped with 128 GB RAM and 4 TB Solid State M2 PCIe memory. The primary dataset was provided by the GenomeAsia 100K project (GA100K) via a 300 GB .gz zipped VCF file. The broken VCF headers and “Format” fields were reformatted using Picard FixVcfHeader and our in-house scripts.

The 718 million forward and reverse strand paired-end reads of the fastq files for the Kurd individuals were processed as follows:

  1. Quality checked using fastqc2 . Mean quality across all bases is good at around 36 with the usual slight drop at the 3′ and 5′ prime ends shown in fig1. Mean read lengths are good and pretty consistent at 150 bp. Sequence duplication levels and per base N contents are unremarkable.
  2. Adapter and quality trimming was performed using AdapterRemoval v2.3-1 which has been shown to outperform other similar programs with regards to speed, sensitivity, negative predictive value, and Mathews’ correlation coefficient3 . Parameters used were as follows; “AdapterRemoval –file1 –file2 –basename –minlength 24 –trimqualities –trimns –minquality 10 –gzip –threads 24″.
  3. Alignment to CrGh37 human reference genome was performed using BWA mem and the result piped to Samtools from standout using the following parameters; bwa mem -k 19 -r 2.5 -t 24 -R Homo_sapiens.GRCh37.dna.primary_assembly.fa file.pair1 file.pair2 | samtools view -O BAM –threads 24 -o file.bam –
  4. The 122 GB bam files were coordinate sorted and indexed using samtools as follows: samtools sort -@ 24 -T Temp/temp file.bam -O BAM -o file.sorted.bam. Indexed as follows: samtools index -b -@ 24 file.sorted.bam
  5. Variants were called using Bcftools as follows; bcftools mpileup -O u –min-MQ 15 –min-BQ 20 –annotate INFO/AD –threads 24 -f Homo_Homo_sapiens.GRCh37.dna.primary_assembly.fa file.sorted.bam | bcftools call -c -O z –threads 24 –format-fields GQ -o file.Hg37.vcf.gz
  6. The resulting VCF files containing 1478 million SNPs were further filtered using Vcftools dropping sites with total allelic depths of high-quality bases (AD) less than 3.
  7. A total of 65M SNPs covered by the GA100K WGS project were extracted from the VCF files and then merged with the GA100K VCF files using Bcftools.
Fig 1 – Kurd 30X WGS base quality scores. Average read lengths are 150 bp. Total sequences approximately 720,000,000.

To conduct IBD and IBS analysis we used Plink as follows ./plink –bfile GA100K –geno 0 –genome. The –geno flag ensures that all sites used in the analysis have variant calls for all the samples used. This resulted in 65M intersecting sites across all the samples.

We then used our own script to create IBS/IBD population averages and tabulate the output for use in our R PCA script using the R packages ggplot2, gplots, ggfortify, ggrepel, and factoextra.

PCA ANALYSIS

The IBD (Pi-Hat) results obtained using 65M intersecting SNPs across all samples using the Plink –genome flag were processed as follows:

  1. For each population we averaged the IBD results such that each population was represented by only one population average observation in the PCA. The main reason this was done is to reduce bias in the PCA resulting from populations having varying numbers of representative samples.
  2. We constructed a 19×19 square matrix from the IBD population averages resulting in a PCA with 19 principal components shown in fig 2.
Fig 2 – Plink IBD population averages based on 65M intersecting SNPs used to build 19 principal components

We used the matrix shown in fig 2 to build a PCA using the ggplot2 package in R with input scaling. Approximately 85% of the variance in the PCA can be explained by the first 2 principal components; PC1 and PC2 (fig 5). A plot using PC1 and PC2 is shown in fig 3.

Using PC1 and PC2 we note 3 general clusters:

  1. The North Indian Indo-Aryan/Indo-Iranian cluster which also includes the Indo-Iranic Kurds. North Indian populations such as Sindhis and Rajputs appear to be genetically closer to Indo-Iranic Kurds than to primarily Ancestral South Indian (ASI) populations such as Paniya. This is due to shared Eurasian Steppe and Neolithic Iranian ancestry between Kurds and Northern Indians. From the populations studied, genetically, Parsis, Pakistani Pashtuns and Brahui appear to be the 3 closest South Asian populations to Kurds.
  2. The East Asian cluster which includes Mongols, Chinese and Koreans.
  3. The Ancient Ancestral South Indian (AASI) and Papuan cluster.
  4. Nicobarese are located on a cline between Papuans and SE Asians nearer to SE Asians. This suggests a relatively more recent migration from SE Asia into the Nicobar islands with an admixture event with a population related to Papuans.
Fig 3 – Observations are population averages for IBD sharing. This plot explains approximately 85% of the variances between the populations. Interestingly, Nicobarese appear to be an offshoot of a SE Asian population that admixed with a population related to Papuans.

Using the PC2-PC3 plot we see that Indo-Iranians and Indo-Aryans are differentially related to East Asians suggesting more recent admixture between some South Asians and East Asians to the exclusion of others.

Fig 4 – PCA plot using PC2 and PC3 shows about 40% of the genetic variation . This plot splits 3 deeply diverged populations; East Asians, Ancient Ancestral South Indian (AASI) and Papuans.
Fig 5 – Scree plot showing about 90% of the variances can be explained by the first 3 principal components

heatmap & GENETIC DISTANCE

One-to-one shared genetic drift between populations used in this study is visualized using the heat map shown in fig 6

Fig 6 – Shared drift between populations in this study using 65M overlapping SNPs. Rows are scaled values and columns unscaled IBD values.

Euclidean genetic distances were calculated based on the matrix shown in fig 2 and are shown in fig 7 to 12 based on whole genome data with 65 million overlapping SNPs among the populations studied. Results are for the most part unsurprising except for Papuans and Onge. Like the Andamanes Onge Papuans are genetically quite distant from all the populations studied. Surprisingly, they are closer to Northern Indian and Pakistani populations and even Iraqi Kurds than Andamanese Onge. Papuan results would not be consistent with Papuans and Andamanese Onge being an early Eurasian split sharing a branch to the exclusion of other Eurasians as some have suggested. Rather, these results would be more consistent with Papuans being a relic population who split away from other Eurasian populations early on or were part of a separate out-of-Africa event. Alternatively, the possibility exists that the original Papuans admixed with a South or West Eurasian migrant population to Oceania a few thousand years ago to create present day Papuans.

Fig 7 – Genetic distances to Pakistani Pashtuns and Iraqi Kurds. Pakistani Pashtuns from Peshawar, Pakistan and surrounds are on a genetic cline between Afghan Pashtuns and Indic populations to the east. Unsurprisingly, Parsis who are Zoroastrians who migrated from Iran to India approximately 1000 years ago are closest to Kurds followed by Pakistan Pashtuns and Brahui. We hypothesize that Afghan Pashtuns would be genetically closer to Kurds than Parsis.
Fig 8 – Genetic distances to Brahui and Parsis.
Fig 9 – Genetic distances to Burusho and Hazara of Pakistan. From East Asian populations Hazara are genetically significantly more proximate to Mongols and Buryat than Chinese Han. With regards to Burusho it\s a little surprising to see Iraqi Kurds genetically signifcantly closer to them than other Iranic or Indic populations.
Fig 10 – Genetic distances to Khalka and Nicobarese. Results for Nicobarese are consistent with a SE Asian population migrating to the Nicobar islands and admixing with a local Andamanese population with Papuan type admixture.
Fig 11 – Genetic distances to South Indian Paniya and Andaman island Onge. Results for Onge are consistent with a deeply diverged South Asian population who have been isolated for a very long time. Paniya results do not show high genetic proximity to any of the populations studied.
Fig 12 – Like the Andamanes Onge, Papuans are genetically quite distant from all the populations studied. Surprisingly, they are closer to Northern Indian and Pakistani populations and even Iraqi Kurds than Andamanese Onge.

References

  1. GenomeAsia100K Consortium. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019). https://doi.org/10.1038/s41586-019-1793-z
  2. Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
  3. Schubert, M., Lindgreen, S. & Orlando, L. AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Res Notes 9, 88 (2016). https://doi.org/10.1186/s13104-016-1900-2
Scroll to Top
Scroll to Top