A New Tool for Better Assembly of HiFi Genomic Sequencing

September 9, 2022

What is HiFi Sequencing?

With all of the recent advances in genomic data analysis, it is worth examining where the raw sequence data actually comes from. Researchers have a few choices of sequencing platforms, from traditional, low-throughput Sanger sequencing to innovative next- and third- generation sequencing technologies. Platform choice will depend on the scale of the project, cost of sequencing and the ultimate research question being answered by downstream analysis.

One company, Pacific Biosciences (often called PacBio), is one of the newer players in the genomics space, sometimes placed in the “third-generation sequencing” category. PacBio has a unique sequencing-by-synthesis method, called HiFi sequencing. This method produces large, circular DNA molecules which can then be sequenced continuously, unlike the fragments produced by other methods, like Illumina sequencing. PacBio HiFi sequencing has an accuracy rate over 99%, and was the sequencing technology of choice for the recently completed Telomere-to-Telomere project to complete the human genome sequence.

Advantages of Long-Read Sequencing

PacBio HiFi sequencing is one of a few choices for long-read sequencing (10,000+ base pairs per read), as opposed to older methods that were all primarily short-read (50-300 base pairs). It is very time and labor intensive to assemble short reads correctly, and if the genome is from an organism that lacks a high-quality reference genome or has many repeat sequences or rare variants, it makes assembly even more challenging and less accurate.

Long-read technology, on the other hand, produces reads over 10,000 base pairs in length. This has the dramatic advantage of faster and easier genome assembly as well as higher accuracy in identifying rare variants and distinguishing repeating sequences more clearly.

New Machine Learning Tool Improves HiFi Read Assembly

Although long-read sequencing produces more accurate genome assembly overall, it still has its own challenges in producing accurate sequences that are tens of thousands of base pairs long. A recent paper in Nature Biotechnology highlights a deep learning ML tool from Baid et. al. that significantly improves assembly of HiFi reads. The tool, called DeepConsensus, focuses on improving the initial accuracy of the HiFi sequence. HiFi sequencing produces a “consensus” sequence based on multiple observations of the same circular DNA molecule. 

The DeepConsensus model, compared to the current standard tool, reduced read errors by 42%, thus increasing the yield of HiFi reads as well. Combined with another common HiFi read assembly tool, DeepConsensus increased contiguity from 4.9 Mb to 17.2 Mb and reduced the false gene duplication rate from 1.1% to 0.5%. Improved accuracy of the initial sequence data increases the accuracy and reproducibility of all of the downstream analysis, and the authors hope that tools like theirs can be developed to improve multiple types of genomic sequence data.

Outsourcing Bioinformatics Analysis

The raw genomic data produced by these sequencing platforms has enormous potential to provide us with biological and health-related insights. However, the size of this datarequires significant downstream processing and analysis to extract this valuable information.Working with service providers like Bridge Informatics is a great option. We support your data storage, analysis and pipeline development needs to eliminate common challenges associated with these downstream analysis tasks. Book a free discovery call with us if you’re interested in outsourcing your bioinformatic needs with Bridge Informatics.



Jane Cook, Journalist & Content Writer, Bridge Informatics

Jane is a Content Writer at Bridge Informatics, a professional services firm that helps biotech customers implement advanced techniques in management and analysis of genomic data. Bridge Informatics focuses on data mining, machine learning, and various bioinformatic techniques to discover biomarkers and companion diagnostics. If you’re interested in reaching out, please email daniel.dacey@old.bridgeinformatics.com or dan.ryder@old.bridgeinformatics.com.

Sources:

https://www.nature.com/articles/s41587-022-01435-7

Recent Posts

genomic sequencing cassette