Making the Most of Big Data

Utah Genome Project and USTAR Center for Genetic Discovery investigator Aaron Quinlan and his team have released GQT, a software tool for exploring and querying large data sets of thousands to millions of genomes.

Since the first human genome was completed in 2003, sequenced genomes have accumulated at an exponential rate with scientists finishing 1,000 genomes by 2013 and 60,000 in 2015, a mere two years later. This astonishing growth in genome sequencing is predicted to continue. Over a million new human genome sequences are expected to arrive on the virtual desks of researchers by 2020. This massive data influx presents a big data challenge for genomics researchers. Computational tools that were designed to study genetic variation in one or a few genomes fall short when applied to cohorts of hundreds, thousands, and potentially millions of individuals.

The software, Genotype Query Tools (GQT), published today in the journal Nature Methods, reorganizes and compresses genomic datasets to render them searchable, even at a very large scale. In contrast to previous tools, GQT makes it fast and easy to mine the information contained within large genomic datasets to reveal how genetic variation contributes to the health or disease states of individuals, study groups, and populations. For example, a researcher may use GQT to ask, “Which genetic variants are enriched in my group of patients who are under age 30 and have high LDL cholesterol?” or “Which genetic variants differ between individuals with different ancestry?”

Daniel MacArthur, a geneticist at the Broad Institute of Harvard and MIT points out that GQT will be a valuable tool for teasing information from the Exome Aggregation Consortium (ExAC) data set containing over 60,000 genomes. “It has 10 million sites of genetic variation across over 60,000 people, which we study to understand the patterns of genetic variation in the general population as well as the causes of rare diseases. The sheer size of ExAC makes it extremely difficult to ask even simple questions using standard tools,” he explains. “Clever methods like GQT are absolutely essential to interrogate big data sets like this – and they will become even more so as we begin to increase our sample sizes into the hundreds of thousands or millions of people.”

When asked about his goals and vision for GQT, Aaron Quinlan replied, “The ideas behind GQT were conceived in a lab meeting two years ago in which we realized that existing algorithms were ill-suited to the millions of human genomes that were headed our way. First author Ryan Layer realized that by combining insights from computer science and population genetics, we achieve high data compression and drastically improve analytical performance. Our vision is to develop a scalable and standardized language for exploring datasets produced by efforts such as the Utah Genome Project, UK100K, and the Precision Medicine Initiative.”

This work is published in the journal Nature Methods on November 9, 2015 and is supported by the National Human Genome Research Institute (NHGRI).

Efficient genotype compression and analysis of large genetic-variation data sets

Ryan M LayerNeil KindlonKonrad J KarczewskiExome Aggregation Consortium Aaron R Quinlan
Nature Methods (2015) doi:10.1038/nmeth.3654

Adding Meaning to Genetic Information

About the Author:

Mary Anne KarrenMary Anne is Research Manager for the USTAR Center for Genetic Discovery

undefined