The Human Pangenome Reference: A New Frontier for Human Genomics

May 24, 2023

Assembling the Human Genome

The human genome project (HGP), a global collaboration involving 20 groups, released the first draft of the human reference genome in 2001. Since its release, the HGP reference genome has formed the backbone of human genomics research and led to numerous discoveries in the field of human healthcare. However, the initial reference genome was not fully complete.

Over the years, the HGP reference genome has undergone several updates, with the latest updates including the Genome Research Consortium Human Build 38 patch release 7 (GRCh38.p7) and the Telomere-to-Telomere Consortium Human Genome Build 13 (T2T-CHM13). The T2T-CHM13 was mainly assembled using sequences from PacBio and Oxford Nanopore sequencers, which generate long reads that span repetitive regions and aid in resolving complex genomic structures such as highly repetitive regions, centromeres, and telomeres.

Towards Capturing the Full Diversity of Human Genome Variation

The T2T-CHM13 assembly provides a more complete and contiguous sequence, with fewer gaps and improved accuracy in challenging genomic regions. However, it became clear that reference genomes derived from a small number of individuals cannot capture the full extent of genetic diversity within the human population. For instance, more than two-thirds of structural variants, which consist of insertions, deletions, duplications, and translocation events, are overlooked when aligning sequencing data to a single reference genome. This problem is critical to address since structural variants often have a greater impact on gene function than single nucleotide polymorphisms (SNPs) or indels.

In a recent publication in Nature, Liao et al. report the first release of a human pangenomic reference from the Human Pangenome Reference Consortium (HPRC). The HPRC pangenome reference consists of high-quality genomic assemblies from a diverse set of individuals and hopes to better capture global genomic diversity.

Assembling the First Human Pangenome Reference

The HPRC pangenome reference consists of 47 genomic assemblies, with 29 samples sequenced by HPRC using long and linked read sequencing data and 18 samples sequenced by other efforts. The 29 sample group, selected from 1KG lymphoblastoid cell lines with normal karyotypes and low passage, were sequenced using PacBio High Fidelity and Oxford Nanopore long read sequencers, as well as Illumina short-read sequencers to encompass reads with varying lengths and error profiles.

The samples were subjected to an average depth of 39.7X, with a quality and N50 value (measure of contiguity or average sequence length) of 54.5 (1 error per 227,509 bp) and 19.6 Kb, respectively. The individual haploid genomes were assembled using the Trio-Hifiasm software, followed by annotation using a custom Ensemble mapping pipeline to label GENCODE genes and transcripts.

New Human Pangenome Reference Outperforms Existing Reference Genomes

The assembled genomes were aligned to the T2T-CHM13 to assess completeness and copy number polymorphisms and demonstrated high concordance. Additionally, over 99% of protein coding genes and transcripts were identified in the HPRC genomes. This demonstrated that the HPRC assembled genomes were high quality, structurally sound, complete and encompassed known human copy number variation in the latest genome release.

The HPRC human pangenome was drafted from the 47 genome assemblies using the Minigraph, Minigraph-Cactus (MC) and PanGenome graph builder (PGGB) software. The average length of the pangenome was over 3 Gb, with MC graph reporting the most accurate alignment. The MC graph also showed the highest recall and precision rate for small variants when comparing the pangenome decoded variants with the GRCh38 variant sets. The authors showed that alignment to the HPRC pangenome outperforms the current reference genomes in capturing genomic variation, such as SNPs, indels and SVs, and that most errors reported by conventional mapping techniques are real variants.

Future Implications for Life Science Research

Overall, the human pangenome is a huge step towards creating a more comprehensive human reference genome that represents more populations as a whole. Since the pangenome reference is able to better capture the genomic diversity in the global population, it paves the way for more accurate variant detection, including structural variants, SNPs and indels, relative to the current standards. This will allow us to detect novel variants that would have otherwise been misclassified as non-aligned reads or errors as biomarkers of human diseases.

Outsourcing Bioinformatics Analysis: How Bridge Informatics Can Help

Groundbreaking studies like these are made possible by technological advances making biological data generation, storage and analysis faster and more accessible than ever before. From pipeline development and software engineering to deploying existing bioinformatics tools, Bridge Informatics can help you on every step of your research journey.
As experts across data types from leading sequencing platforms, we can help you tackle the challenging computational tasks of storing, analyzing and interpreting genomic and transcriptomic data. Bridge Informatics’ bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis. Click here to schedule a free introductory call with a member of our team.

Haider M. Hassan, Data Scientist, Bridge Informatics

Haider is one of our premier data scientists. He provides bioinformatic services to clients, including high throughput sequencing, data pre-processing, analysis, and custom pipeline development. Drawing on his rich experience with a variety of high-throughput sequencing technologies, Haider analyzes transcriptional (spatial and single-cell), epigenetic, and genetic landscapes.

Before joining Bridge informatics, Haider was a Postdoctoral Associate at the London Regional Cancer Centre in Ontario, Canada. During his postdoc, he investigated the epigenetics of late-onset liver cancer using murine and human models. Haider holds a Ph.D. in biochemistry from Western University, where he studied the molecular mechanisms behind oncogenesis. Haider still lives in Ontario and enjoys spending his spare time visiting local parks. If you’re interested in reaching out, please email jennifer.martinez@old.bridgeinformatics.com or dan.ryder@old.bridgeinformatics.com

The Human Pangenome Reference: A New Frontier for Human Genomics

Assembling the Human Genome

Towards Capturing the Full Diversity of Human Genome Variation

Assembling the First Human Pangenome Reference

New Human Pangenome Reference Outperforms Existing Reference Genomes

Future Implications for Life Science Research

Outsourcing Bioinformatics Analysis: How Bridge Informatics Can Help

Haider M. Hassan, Data Scientist, Bridge Informatics

Recent Posts

Deep Neural Networks Promise Early Detection of Pancreatic Cancer

What’s Next for Neurofilament as a Biomarker of Drug Efficacy?

Bioinformatics Services in the Greater Boston Area

120 Washington St, Salem, MA 01970

QUICK LINKS

© bridge informatics 2023