Accelerating Search of Compressed Data

Researchers have found that biological data are amenable to fast search.

As biological data grow at an exponential rate, computing power and storage can no longer keep pace. For example, genomic data are increasing by 10-fold per year whereas computing speed is doubling just every 18 months. “We are currently generating massive datasets—so massive that without smart algorithms, we can’t effectively analyze them,” says Bonnie Berger, PhD, professor of applied math and computer science at MIT.

 

Fortunately, biological data are highly redundant, which means that the amount of novel data entering these ballooning databases is growing at a much slower rate, Berger says. Her team—including graduate student William Yu and postdoc Noah Daniels, PhD—has figured out a way to exploit this redundancy to greatly speed up database search. Their framework, called entropy-scaling search, works directly on compressed data. It is described in a 2015 paper in Cell Systems.

 

“Clearly if you wanted to store massive amounts of data, you would compress it—which many people have done,” Berger says. “But that isn’t enough because eventually we have to look at the data. So we asked ourselves if we could design methods that could search the compressed data.”

 

“Octopus-Like” Data. These data exhibit low metric entropy and low fractal dimension. Though the picture appears 2-D, the data-points traverse multi-dimensional space; circles are spheres that define clusters of data. Low metric entropy means that relatively few spheres are required to cover all the data, limiting the size of the coarse search. Fractal dimension describes the number of neighbors each sphere has: A fractal dimension of 1 means that neighboring clusters lie on a straight line; a fractal dimension of 2 means that each cluster is completely surrounded by neighboring clusters in 2 dimensions; and so on. In real data, this number is fractional—hence the name fractal dimension. Lower fractal dimension means fewer neighbors to search during the fine search. Image courtesy of YW Yu, NM Daniels, DC Danko and B Berger.Sequence data are easily compressed because genomic sequences are much more similar than they are different. For example, two human genomes differ in only 0.1 percent of their nucleotides. To compress the data, Berger’s team groups closely related sequences into clusters in a way that exploits the structure of biological data, and represents each cluster with just one sequence. To search the compressed data, they compare the query sequence only to the cluster representatives. This coarse search returns a few clusters that can then be searched more finely.

 

The search time depends on (1) how many clusters have to be checked during the coarse search and (2) how many neighboring clusters have to be examined during the fine search. Berger’s team showed that biological data has two key properties that limit these two quantities, respectively: low metric entropy and low fractal dimension.

 

Data with low metric entropy are highly redundant and thus can be represented by a small number of clusters. Take sequence data: Because only a tiny fraction of all possible genomic sequences actually exist in nature, relatively few clusters are required  to cover all the data.

 

During the fine search, the best-matching cluster and all neighboring clusters have to be searched. Fractal dimension describes the number of neighbors each cluster has. Lower fractal dimension means fewer neighbors to search. Fortunately, biological data have a fairly low fractal dimension because evolution tends to trace out relatively linear paths (see figure). “Clusters largely tend to extend along the branches of the tree rather than in all directions,” Berger explains.

 

Berger’s team showed that their compression and search framework is effective for any data that exhibit low metric entropy and low fractal dimension. Thus, potential applications extend way beyond sequence search. In their Cell Systems paper, Berger’s team demonstrates orders of magnitude speed-ups for searching databases of chemical compounds, metagenomes, and protein structures.

 

PubChem is a comprehensive database of 60 million small molecules that can be used for tasks such as repositioning drugs. Until now, it was infeasible to perform even a one-molecule search of all of PubChem on a typical desktop computer. So Berger’s team clustered the chemical compounds in PubChem based on the geometric similarity of chemical motifs, and then applied their two-step search process to these data. Compared with the commonly used search tool SMSD (Small Molecule Subgraph Detector), they were able to achieve a 150-fold speed up with 92 percent accuracy.

 

The team’s framework can be wrapped around common search tools, such as SMSD for small molecule search, BLAST for DNA sequence search, and PSI-BLAST for protein sequence search. “The cool thing about all our tools is that they plug right into existing pipelines.”

 

Berger’s team has made their tools openly available at: http://cast.csail.mit.edu/, and other groups have begun building upon these tools, she says. “This whole area of compressive algorithms is really taking off because it’s absolutely necessary.”

 



All submitted comments are reviewed, so it may be a few days before your comment appears on the site.

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
All submitted comments are reviewed, so it may be a few days before your comment appears on the site. This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.