Big Data and the Future of Genomics: How Apache Spark is Revolutionizing Genomic Analyses

The cost of sequencing the human genome has gone down exponentially in the last two decades. However, this low cost of sequencing has led to mountains of data, causing data management and analysis to become the rate limiting step.

Limitations of Current Genomic Data Warehousing

Most of the time, the energy and cost for making sense of genomic data comes from the limitations of existing tools for analysis. Many existing tools are single node, which makes them very challenging to scale up, and many work as command line tools, making it difficult to link them together into a complex workflow.

Apache Spark

Fortunately, some companies are creating data warehouse architecture platforms to address these challenges. Interestingly, they’ve taken existing tools in big data and applied them to the life sciences, creating unified platforms for genomic analysis using Apache Spark, a unified analytics engine that is part of the largest open-source project in data processing.

Open-Source Code for Maximum Speed, Ease and Support

Apache Spark is a general-purpose, open-source, multi-language big data engine that can process up to petabytes of information on clusters of thousands of nodes. Yes that’s right, no longer are we using one node. This also means Apache Spark can be leveraged for machine learning. Apache Spark is extremely fast and has many existing APIs and standard libraries that provide plenty of ease and support for its users.

Data scientists have taken the capacity, scalability and speed of the Apache Spark platform and used it to their advantage. They’ve used it to create optimized genomic analysis workflows. For example, variant calling is an essential step in transforming raw sequence data into a usable format by assembling the genome and identifying the variants present. Bioinformatic scientists have leveraged Apache Spark to match best practices from single-node pipelines across whole clusters of nodes to dramatically speed up the variant calling workflow.

Save Time, Money, and Blaze New Trails in Bioinformatics

Leveraging open-source tools and cloud computing to create better tools for genomics is essential for realizing the promise that big (genomic) data holds in the life sciences. These tools save time and money by reducing expenditures on cloud compute and storage. The speed of these tools will be vital going forward: clinicians and researchers will need raw data to go from the sequencer to actionable information as quickly as possible as whole genomes and transcriptomes are brought into clinical settings. Additionally, these general-purpose big data platforms are not too specialized to one data subtype so they can be used to integrate different data types, like single cell data, gene expression data, and even genomic information with clinical observations (aka genotype to phenotype connections).

Accessibility and version control is another important aspect of using mostly open-source tools. These platforms have notebooks and examples for users to work through and learn how to use the platform, view activity of other users, and apply it immediately to their own datasets. Projects using Apache Spark in the life sciences will continue to help researchers realize the promise of genomic data for advancing medicine and biology.

Importance of Outsourcing Bioinformatic Tasks

Working with service providers like Bridge Informatics for your data infrastructure setup, analysis and pipeline development needs can help save time and money. Our team can help sidestep the learning curve associated with these tasks by having our experts help you at every step of your analysis. Book a free discovery call with us to discuss your project needs.


Jane Cook, Journalist & Content Writer, Bridge Informatics

Jane is a Content Writer at Bridge Informatics, a professional services firm that helps biotech customers implement advanced techniques in management and analysis of genomic data. Bridge Informatics focuses on data mining, machine learning, and various bioinformatic techniques to discover biomarkers and companion diagnostics. If you’re interested in reaching out, please email daniel.dacey@old.bridgeinformatics.com or dan.ryder@old.bridgeinformatics.com.

Dan Ryder, Founder & CEO, Bridge Informatics

Dan is the founder of Bridge Informatics, a greater Boston-based consulting firm that focuses on bioinformatics and software development. Experts at Bridge Informatics can help you build tools for life science with a focus on data mining, machine learning, and various bioinformatic techniques to discover biomarkers, drug targets, and companion diagnostics. If you’re interested in reaching out, he can be contacted at dan.ryder@old.bridgeinformatics.com.

Sources:

https://spark.apache.org/

https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

https://databricks.com/session/scaling-genomics-on-apache-spark-by-100x

https://databricks.com/spark/about

Illustration of computers and servers linked with clouds, hardware and computing tools to depict cloud computing, pipeline development and bioinformatics analysis.

Recent Posts