Biology’s Big Problem: There’s Too Much Data to Handle
- 9:30 AM
Image: Shaury Nash/Flickr
Twenty years ago, sequencing the human genome was one of the most ambitious science projects ever attempted. Today, compared to the collection of genomes of the microorganisms living in our bodies, the ocean, the soil and elsewhere, each human genome, which easily fits on a DVD, is comparatively simple. Its 3 billion DNA base pairs and about 20,000 genes seem paltry next to the roughly 100 billion bases and millions of genes that make up the microbes found in the human body.
And a host of other variables accompanies that microbial DNA, including the age and health status of the microbial host, when and where the sample was collected, and how it was collected and processed. Take the mouth, populated by hundreds of species of microbes, with as many as tens of thousands of organisms living on each tooth. Beyond the challenges of analyzing all of these, scientists need to figure out how to reliably and reproducibly characterize the environment where they collect the data.
“There are the clinical measurements that periodontists use to describe the gum pocket, chemical measurements, the composition of fluid in the pocket, immunological measures,” saidDavid Relman, a physician and microbiologist at Stanford University who studies the human microbiome. “It gets complex really fast.”
Ambitious attempts to study complex systems like the human microbiome mark biology’s arrival in the world of big data. The life sciences have long been considered a descriptive science — 10 years ago, the field was relatively data poor, and scientists could easily keep up with the data they generated. But with advances in genomics, imaging and other technologies, biologists are now generating data at crushing speeds.
One culprit is DNA sequencing, whose costs began to plunge about five years ago, falling even more quickly than the cost of computer chips. Since then, thousands of human genomes, along with those of thousands of other organisms, including plants, animals and microbes, have been deciphered. Public genome repositories, such as the one maintained by the National Center for Biotechnology Information, or NCBI, already house petabytes — millions of gigabytes — of data, and biologists around the world are churning out 15 petabases (a base is a letter of DNA) of sequence per year. If these were stored on regular DVDs, the resulting stack would be 2.2 miles tall.
“The life sciences are becoming a big data enterprise,” said Eric Green, director of the National Human Genome Research Institute in Bethesda, Md. In a short period of time, he said, biologists are finding themselves unable to extract full value from the large amounts of data becoming available.
Solving that bottleneck has enormous implications for human health and the environment. A deeper understanding of the microbial menagerie inhabiting our bodies and how those populations change with disease could provide new insight into Crohn’s disease, allergies, obesity and other disorders, and suggest new avenues for treatment. Soil microbes are a rich source of natural products like antibiotics and could play a role in developing crops that are hardier and more efficient.
Life scientists are embarking on countless other big data projects, including efforts to analyze the genomes of many cancers, to map the human brain, and to develop better biofuels and other crops. (The wheat genome is more than five times larger than the human genome, and it has six copies of every chromosome to our two.)
However, these efforts are encountering some of the same criticisms that surrounded the Human Genome Project. Some have questioned whether massive projects, which necessarily take some funding away from smaller, individual grants, are worth the trade-off. Big data efforts have almost invariably generated data that is more complicated than scientists had expected, leading some to question the wisdom of funding projects to create more data before the data that already exists is properly understood. “It’s easier to keep doing what we are doing on a larger and larger scale than to try and think critically and ask deeper questions,” said Kenneth Weiss, a biologist at Pennsylvania State University.
Compared to fields like physics, astronomy and computer science that have been dealing with the challenges of massive datasets for decades, the big data revolution in biology has also been quick, leaving little time to adapt.
“The revolution that happened in next-generation sequencing and biotechnology is unprecedented,” said Jaroslaw Zola, a computer engineer at Rutgers University in New Jersey, who specializes in computational biology.
Biologists must overcome a number of hurdles, from storing and moving data to integrating and analyzing it, which will require a substantial cultural shift. “Most people who know the disciplines don’t necessarily know how to handle big data,” Green said. If they are to make efficient use of the avalanche of data, that will have to change.
When scientists first set out to sequence the human genome, the bulk of the work was carried out by a handful of large-scale sequencing centers. But the plummeting cost of genome sequencing helped democratize the field. Many labs can now afford to buy a genome sequencer, adding to the mountain of genomic information available for analysis. The distributed nature of genomic data has created its own challenges, including a patchwork of data that is difficult to aggregate and analyze. “In physics, a lot of effort is organized around a few big colliders,” said Michael Schatz, a computational biologist at Cold Spring Harbor Laboratory in New York. “In biology, there are something like 1,000 sequencing centers around the world. Some have one instrument, some have hundreds.”