Top 10 Retrospective: Reflections on a Decade of Biomedical Computing

Experts reflect on challenges identified ten years ago.

The first issue of this magazine (June 2005) featured a story called “Top Ten Challenges of the Next Decade” written by Eric Jakobssen, PhD, who had recently left his position as Director of the Center for Bioinformatics and Computational Biology in the National Institute of General Medical Sciences (NIGMS) at the National Institutes of Health (NIH).

 

Today, as we near the end of that decade, we’ve asked ten domain experts to weigh in:  How well has each of these challenges been met? And, with the benefit of hindsight, were they the right challenges in the first place?

 

CHALLENGE 1

In Silico Screening of Drug Compounds


Status 10 years ago:  

In 2005, Jakobsson hoped the next ten years would see researchers advance our ability to “predict the efficacy and side effects of lead compounds using computer modeling and simulation,” thereby reducing the need for human testing while also saving time and money spent in the laboratory.


Update by:

Arthur Olson, PhD, professor in the Department of Integrative Structural and Computational Biology at the Scripps Research Institute

 

Progress made:

We’ve made a lot of progress in terms of how many people are doing in silico screening.  There seems to be a larger and larger community of people doing virtual screens, many of whom are not computational chemists. The tools have improved because the toolmakers have had to respond to the demands of all these users. The chemical libraries have become larger, better characterized and more focused. The peer-reviewed science using virtual screening that has been published over the past 10 years has also been staggering. I believe that structure-based drug design has informed development of many of the new drugs that have come out in recent years. I’m guessing that this was the case with the Hepatitis C antiviral drug from Gilead, which made the news recently as a cure for the disease.

 

In terms of specific advancements, we’ve improved the ability to rank the results of screening. We do broad screens using quick docking methods and then pass the top candidates along for evaluation using more computationally intensive methods (calculating molecular dynamics-based binding free energies). While the basic theoretical framework hasn’t changed that much in the past 10 years, properties that were difficult to estimate 10 years ago are now possible because computing has become so much more powerful and available.  The docking algorithms have also gotten incrementally better. For example, we’ve improved how we model water during a docking calculation; this can make a significant difference in which poses are selected. We’re also making better use of parallel computing—the fact that the analysis by molecular dynamics can be broken up into multiple runs and information exchanged between them can improve sampling and throughput.


Challenges ahead:

We still face the challenge of designing synthetic drugs that modulate protein-protein interactions. Most successful small molecule drugs at this point have been targeted to individual protein active sites. While solving this problem won’t require any new physics, it will require new algorithms that can model complex interactions efficiently.


20/20 Hindsight:

Given the advances of the last decade, was this challenge the right one? Yes, this was the right challenge a decade ago, and it’s still the right challenge for the next decade. The payoff could be very, very large in terms of human health.

 


CHALLENGE 2

Predicting Function From Structure of Complex Molecules at an engineering level of precision

 

Status 10 years ago:

In 2005, molecular simulation and analysis methods could “capture the essence of the mechanism of biomolecular function, but could not predict that function with quantitative accuracy,” Jakobsson wrote. He hoped the decade would lead to improved capability in this regard, enabling a precise understanding of the consequences of mutations and other biological variations, and the ability to design molecules for medical nanotechnology.

 

Update by:

Predrag Radivojac, PhD, associate professor of computer science and informatics at Indiana University

 

Progress made:

I think this particular challenge has not been met if we look strictly. Since 2005, we have broadened the concept of function tremendously and now understand the “breadth” a lot better. Today, we think of function in more specific terms (such as whether a residue binds to a protein or DNA) and at more levels (for example, a protein may participate in a specific reaction, in the cell cycle, or in a disease). As a result, we have not reached this goal because of the many new challenges we have discovered along the way.

 

Still, we have made a lot of progress in the past 10 years. We can now predict many aspects of function surprisingly accurately, such as certain metal-binding residues, catalytic residues, ligand-binding sites and protein-DNA binding sites. All these different aspects of function have some specificities in their methods; there’s no silver bullet to address all of them. But each of these little sub-fields has pushed things forward. I believe that in the next 10 years we will be able to deliver on the goal of predicting the consequences of mutations and sequence variants; and we will see some fascinating discoveries.

 

There have also been individual success stories, where researchers were able to achieve an engineering level of precision of function prediction. For example, David Baker’s group designed a protein with particular functionality de novo by structural modeling of an enzyme with increased catalytic activity. This is exactly what this challenge had in mind.

 

Challenges ahead:

The 2005 article does not talk about the fact that proteins are dynamic molecules. To predict function at an engineering level of precision, we will have to have some sort of dynamic models both at the micro and macro levels, including large irregular movements. And this will require advancements in mathematical, computational, and physical approaches. Current methods do not scale. We cannot model motions of proteins at the appropriate granularity and length of time in order to be able to extract the signatures of motion that would be predictive of function. Another important challenge is that structure data are noisy, reflecting many experimental artifacts. We have to find the right statistical and machine learning approaches to model the uncertainty in the structure data, and then integrate it with other types of data in order to be able to infer function.

 

20/20 Hindsight:

Given the advances of the last decade, was this challenge the right one? This challenge was the right one. I would have refined it slightly, to: predicting function from structure and dynamics of complex molecules. We discovered that we have a lot of sub-problems to solve. The stars need to be aligned for us to be able to deliver on this challenge. But I definitely think it was the right challenge.

 


CHALLENGE 3

Prediction of Protein Structure

 

Status 10 years ago:

In 2005, Jakobsson noted that there were many more known protein sequences than structures.  He hoped that through a combination of accelerated experimental structure determination and improved techniques for mining known structures to determine the rules for predicting unknown structures, researchers would gain the ability to assign a structure to every sequence. Jakobsson believed this achievement would advance the field of biomedicine in many ways.

 

Update by:

Adam Godzik, PhD, professor and program director of bioinformatics and systems biology at the Sanford Burnham Medical Research Institute

 

Progress made:

I think this is the challenge where the most progress has been made. We don’t have a tool that works every time; but, compared with 10 years ago, the progress has been amazing. Ten years ago, you would look at predicted structures and just cringe; now some of them are as good as real.

 

Much of the progress is due to David Baker’s efforts with the Rosetta algorithm for energy-based predictions. The tool doesn’t work for every case, but when it works, it works fantastically. The second big thing that happened is people started to realize the importance of distant homology prediction (finding sequence relationship with proteins that have already been characterized experimentally). This approach is actually much more powerful, because in addition to giving clues about 3D structure, it also tells you about what the protein does. With advancements using hidden Markov models, we can now recognize much more distant relationships than we did 10 years ago.

 

The CASP (Critical Assessment of Protein Structure Prediction) experiments also gave the field an enormous push, because blind tests and judges allow you to actually see what is working and what is not.

 

Challenges ahead:

Currently the energy-based methods work well for cases where one energy term dominates, but often get it wrong when there are multiple opposing forces. We’d like to advance this to the point where it works in every case. For distant homology prediction, we’re still missing a lot. Sometimes after a structure is solved experimentally, we realize that there were homologies we missed.

 

20/20 Hindsight:

Given the advances of the last decade, was this challenge the right one? I think it was perfect. And, because of CASP, the progress has been verified every year.

 

 

CHALLENGE 4

Accurate, Efficient, and Comprehensive Dynamic Models of the Spread of Infectious Disease

 

Status 10 years ago:

In 2005, Jakobsson saw an opportunity for modeling to take advantage of the extensive data that had been gathered on the spread of infectious disease and the consequences of various strategies of intervention. Such models, he hoped, would provide a basis for rational, informed, real-time decision making in combating natural epidemics and bioterrorist attacks.

 

Update by:

Stephen Eubank, PhD, professor in the Virginia Bioinformatics Institute and Population Health Sciences department at Virginia Tech.

 

Progress made:

There are three big areas where there have been some substantial changes: (1) surveillance (what feeds into the models), (2) the models themselves, and (3) the use of modeling evidence to inform decision making in government agencies. 

 

In the past decade, the scope of surveillance has broadened from simple factors, like vaccination, to more complex social behaviors such as whether or not people stay home from work when they’re sick. We are beginning to get a better handle on measuring people’s behaviors and how they change during an outbreak.

 

The models themselves have also advanced. Models 10 years ago usually assumed homogenously mixed populations. Now, we’re using high-resolution network-based models that model every single person in a large region. People come and go, and they change their behaviors in reaction to things they hear on the news or their perceptions of what’s going on around them. So the system’s not stationary and it’s not well mixed. And the new network-based models are able to take both of those things into account.

 

Finally, we’ve made a lot of progress in getting decision makers to pay attention to what the models say.  One of the things that might have made people hesitate before is that the models were so generic it didn’t seem as if they could really be applied to any specific circumstance. But by creating these highly resolved models that are representative of particular places and particular outbreaks, I think we’ve managed to convince folks that, yes indeed, these models should be taken seriously. 

 

Challenges ahead:

I think the jury is still out on a lot of the new surveillance techniques; there’s something there and there has got to be some way to use the flood of information coming at us from sources like social media, but I don’t think we’ve perfected that art yet. We also need faster turnaround on traditional disease reporting surveillance, which hasn’t been brought into the electronic era. There’s still a one to two week delay in getting really good, accurate information from emergency departments and clinics up to a scale where the modelers can get hold of it.

 

20/20 Hindsight:

Given the advances of the last decade, was this challenge the right one? I think it was right on the money.

 

 

CHALLENGE 5

Intelligent Systems for Mining Biomedical Literature

 

Status 10 years ago:

In 2005 there existed no efficient and effective way to organize data from biomedical literature into computable databases from which accurate interpretive and predictive models could be constructed. Jakobsson hoped the ensuing decade would see better access to the abundance of information about the functioning of genes, gene products, and cells that was then buried in published papers.

 

Update by:

Graciela Gonzalez, PhD, associate professor of biomedical informatics at Arizona State University

 
Progress made:

In the past decade, there has been significant progress on this challenge. We’ve advanced the furthest in our ability to recognize named-entities—specific genes, diseases, chemicals, drugs, and other entities—mentioned in biomedical text. For many entities, the problem is considered pretty much solved. By retraining machine-learning-based tools such as our system, BANNER , or others like it, one can have an entity recognition system with little effort. We’ve also progressed along the next step in the pipeline: entity normalization. For example, once you find a gene, you need to know exactly which gene it is referring to out of multiple possible mappings (homologues for different species). There are different systems available for different entities. For example, the NIH’s National Center for Biotechnology Information (NCBI) recently released DNorm (Disease Name Normalization), an automated tool for determining specific diseases mentioned in biomedical texts. Finding and normalizing entities are key steps towards enabling intelligent systems. There has also been significant progress toward integration of data from the literature with experimental results. For example, the NCBI links GenBank sequence data or genome data in the GEO database back to the literature in PubMed.

 

Challenges ahead:

Where the challenge remains is in connecting all these pieces of knowledge into larger, complex systems and hunting out causal relationships between entities. When I moved to this field in 2005, we wanted to use text-mining tools to model and make inferences from biological pathways, such as protein-protein interaction pathways; but we couldn’t do that as there was no system available to extract them. This still remains a challenge. Nobody has solved the problem of pathway extraction. The NCBI links notwithstanding, the challenge remains how to coherently integrate all the experimental data being produced, such as whole genome sequencing data, with knowledge from the literature so a scientist can automatically hone in on support for or against a hypothesis or novel theory. In short, there is still a large gap from data to discovery.

 

20/20 Hindsight:

Given the advances of the last decade, was this challenge the right one? I think this was the right challenge because the literature remains a major source of valuable yet hidden knowledge. Every paper that’s written represents months or even years of work by a team of scientists. But it’s practically impossible to find them all without a lot of help from an automated system.

 

 

CHALLENGE 6

Complete Annotation of the Genomes

 

Status 10 years ago:

In 2005, so-called “complete” genomes were far from “complete,” Jakobsson wrote. He hoped the research community would select eukaryotic and prokaryotic model organisms for a focused attack on complete annotation, and use all experimental, bioinformatics, and data-mining tools on these organisms. As a sequel to complete annotation, he challenged researchers to elucidate the target organisms’ complete metabolic, signaling, and homeostatic pathways and networks.

 

Update by:

Terry Gaasterland, PhD, professor of computational biology and Genomics Director, Scripps Genome Center University of California, San Diego


Progress made:

In the past ten years, the genomes of many different species have been sequenced and the 10,000 genomes project headed by David Haussler at University of California, Santa Cruz, is making progress toward sequencing the genomes of many more. But perhaps the biggest thing that we have done in the past decade is to become capable of dealing with incomplete genomes. We’ve started to understand that no genome is ever truly complete. Every individual has its own genome and all these genomes are inter-related—so we have local variation, and we have large-scale variation. The best we can do are snapshots and draft genomes. As a community, we’re becoming comfortable with this and even building tools to leverage this knowledge.

 

The single most important contribution to genomics over the last 10 years, beyond the data, is the one-stop shopping that has emerged through the UCSC genome browser project. The community needed a common way to view, manipulate, and manage genome data; and David Haussler’s team built that. They’re providing production-quality comparisons and calculations across many prokaryotic and eukaryotic genomes. Also, in the past decade, the development of short-read high-throughput sequencing has been of utmost importance. The community has exhibited such creativity in using the high-throughput sequencing—to sequence anything from new genomes of new organisms to RNA to DNA binding sites to nascent transcripts. 

 

There has also been enormous progress toward elucidating target organisms’ complete metabolic networks. For example, in 2007, Bernard Palsson’s lab at University of California, San Diego published a detailed in silico model of human metabolism. Using that model and others, researchers can simulate the effect of virtual knockouts as a prelude to laboratory testing.  

 

Challenges ahead:

Ever bigger datasets need ever faster and more efficient data storage arrays. The University of California, San Diego, Super Computer Center presents a prototype of how to provide this kind of computing power to a local community. What we’ve done here is we’ve all bought pieces of the larger system. For example I spent $20,000 to buy nodes and because of that I have access to a two million dollar computer. I’d love to see this happen over and over again at many universities. Another challenge is clinical phenotyping. For annotating the human genome in a disease-aware way, the computational biologists have to be in lock step with the physicians.

 

20/20 Hindsight:

Given the advances of the last decade, was this challenge the right one? Because we’ve realized that no genome is ever complete, its annotation can never be complete as well. Thus, Jacobsson’s goal was per se unattainable. Nevertheless, I think it was exactly the right challenge. You lay out the ideal, you shoot for Mars, and you might get to the moon.  

 

 

CHALLENGE 7

Improved Computerization of the Healthcare Delivery System

 

Status 10 years ago:

Jakobsson wrote in 2005, about his concern that “the relatively primitive information technology environment supporting the delivery of health care” resulted in extra expense and avoidable error. He called for “a nationally interoperable system of medical records to support transferable patient records, diagnosis and treatment based on integrating the patient record with relevant basic and clinical knowledge, and efficient patient monitoring.” The deployment of personalized medicine would be, he believed, a logical consequence and extension of this computerization.

 

 

Update by:

Lucila Ohno-Machado, MD, PhD, professor of medicine, University of California, San Diego

 

Progress made:

Of the top-ten challenges Jakobsson listed in 2005, this is one of the ones that has made the most progress. In the past decade, electronic health records (EHR) have been widely adopted, thanks to large investments from the government. In 2005, we were talking about institutions not even having electronic health records; now we’re talking what to do with them. There are still challenges, but we are at the next level now. So, it’s a very exciting time in our field.

 

The next major breakthrough is also on the horizon. In late December, the Patient-Centered Outcomes Research Institute (PCORI) awarded $93.5 million for the creation of PCORnet, the National Patient-Centered Clinical Research Network. The network will securely link EHR data for millions of patients, which will enable large-scale comparative effectiveness research—figuring out which types of medical care work best. Someday, EHR data may even be linked to bio-samples, such as DNA sequencing data or proteomics, with an eye toward personalized medicine. With huge numbers of patients, we will be able to correlate responses to particular therapies with very specific biomarker profiles.  

 

Challenges ahead:

The technology for enabling preservation of privacy has evolved a lot, and that’s removed many barriers. But we still need to improve data quality and standardization. We need to promote a broad understanding from patients, clinicians, administrators, and researchers, of what it takes to make these data useful.

 

20/20 Hindsight:

Given the advances of the last decade, was this challenge the right one? Yes, this was the right challenge for the past decade. But it will also remain a key challenge for the next decade. The challenge will not be as primitive, but it will still be there. It’s not going to be solved overnight.

 


CHALLENGE 8

Integrating Computational Tools to Make Systems Biology a Reality

 

Status 10 years ago:

In 2005, Jakobsson observed that “many useful tools for systems biology have been created, but they are not integrated into computational environments that provide for automatic interaction of multiple programs and functionalities to address generally useful issues in biomedicine.” The tools themselves also need improvement in their scope of applicability, computational efficiency, and ease of use, he wrote. The aim: a much-needed computational environment for information-based modeling of pathways, networks, cells, and tissues.

 

Update by:

Markus Covert, PhD, associate professor of bioengineering at Stanford University

 

Progress made:

There has been a lot of motion in the space particularly from pathways to cells and cells to tissues. I wouldn’t say this challenge has been accomplished, but it’s going well. I remember that during the first funding initiative on multiscale modeling, that term was still being defined, even at the programmatic level. But I don’t think people would have that same confusion now.

 

In terms of tool integration, we still don’t have a unified, integrated, seamless situation. Centers for systems biology are bringing different professors together, but there isn’t a one-stop shop where you can find all the tools you need for your modeling interests. People have tried to start a biomodels database, but it’s challenging because you don’t always know in advance what you will want to store. So it’s still largely up to individual teams to make their models widely accessible.

 

Along these lines, we developed a comprehensive whole-cell model that predicts phenotype from genotype (Cell, July 2012). For this model, we’ve been trying to give people access at a variety of levels. We’ve made a knowledge base that is structured to hold all the information that you would need to run a model.  

 

Challenges ahead:

Many problems could be solved if systems biology would reach even further outside of itself. It’s already a highly interdisciplinary field, but we need to take another major step forward that would literally involve talking to people who you think you have nothing in common with. For example, systems biology tools could be greatly improved by an influx of industry talent. The best coding in the world is not happening in systems biology; it’s happening at companies like Google and Facebook. For our whole-cell model, we hired a software engineer from Google for six months; and I was very impressed by how much we needed that software help. I have also realized that we have a lot of visualization tools that can be used for education, but few that can be used for exploration and discovery. To develop these more sophisticated visualization tools, we’re going to need artists and graphic designers, as well as coders.

 

20/20 Hindsight:

Given the advances of the last decade, was this challenge the right one? This challenge was not the best specified one, probably out of necessity, but Jakobsson definitely did show some foresight. We’re in that space now; we’re progressing along the vector he outlined.

 


CHALLENGE 9

Tuning Biomedical Computing Software to Computer Hardware

 

Status 10 years ago:

In 2005, biomedicine used substantial computing resources at all levels, from the desktop to high-end supercomputing centers, but “a large fraction of these resources are not efficiently used, as the hardware and software are not tuned to each other,” Jakobsson wrote. He believed that addressing this problem would allow research to advance more rapidly.

 

Update by:

Vijay Pande, PhD, professor of chemistry, structural biology and computer science at Stanford University

 

Progress made:

This challenge is an ongoing issue. People have gotten much better at tuning software to hardware, but the challenge has gotten even harder. As time goes on, the hardware is getting more heterogeneous, and getting the best performance out of it requires more effort. So, as people have advanced on this challenge, the goalposts have been pushed back as well.

 

On the hardware side, the key breakthroughs are advances in graphics processing units (GPUs) and in how people handle large amounts of memory. But GPUs are very specialized. And to roll out an engineering algorithm and have it run on them very quickly is a challenge. We and others have been trying to push the area of domain-specific languages, which are intentionally not general purpose and can easily be ported to GPUs. This approach has been quite powerful, allowing us and others to rapidly create code that still executes quickly. So, these languages have been a major breakthrough on the software end.

 

Challenges ahead:

With each generation of new GPUs, we have to re-tune our domain-specific languages. So the constant maintenance of doing this is an ongoing challenge. Sustainable funding is also a challenge. People think software is written and then it’s done. But software is like your lawn—it needs constant maintenance and upkeep to make sure it remains in good shape. With our current funding system, it’s very difficult to sustain codes over long periods of time. Many people have chosen to commercialize their software. But this leads to a closed-off system that slows the community down. Compared with a decade ago, the NIH is doing much better on this issue, but I would like to see even more progress.

 

20/20 Hindsight:

Given the advances of the last decade, was this challenge the right one? It certainly should be one of them. Whether it should be in the top 5, 10, or 20, could be debated. But it’s certainly a significant challenge. 

 

 

CHALLENGE 10

Promoting the Use of Computational Biology Tools in Education

 

Status 10 years ago:

To help forestall a likely shortage of quantitatively competent researchers, Jakobsson called for the adaptation of biomedical computing tools to education at all levels in order to capture their power to motivate youngsters to pursue biomedical research careers. He believed that the same developments that were making biomedical computing tools useful to experimental researchers could also make them the basis of compelling problem-solving educational environments for students.

 

Update by:

Brian Athey, PhD, professor and chair of computational medicine and bioinformatics at the University of Michigan

 

Progress made:

I think we made good progress at getting computational tools out there, thanks largely in part to the National Centers for Biomedical Computing (NCBCs). The imaging pipeline that came out of the Center for Computational Biology NCBC, LONI, was a key to the Alzheimer’s Neuroimaging Initiative. Andrea Califano’s network biology tools in cancer have made a dramatic impact on our understanding of systems biology and cancer. The National Center for Biomedical Ontology put together collections of ontologies that are being used worldwide.

 

But it is a fundamentally different world that we’re living in now compared with 2005 because of the proliferation of data. A decade ago, we were focused on computing tools and software; that focus has now been eclipsed by big data analysis. The computer is more in the background; the data and information are in the foreground.

 

Challenges ahead:

There’s more of a need than there was even 10 years ago for training. Most biomedical researchers, from the basic to the clinical sciences, are dealing with heterogeneous digital data. They need to learn how to access and analyze these data. We need to bring forward basic exposure and instruction about data science at all levels from undergraduates through to the faculty.

 

20/20 Hindsight:

Given the advances of the last decade, was this challenge the right one? It was the perfect challenge, very important to put on the list. And it’s important to keep it on the list for the next decade, with a new focus on data and information.

 



All submitted comments are reviewed, so it may be a few days before your comment appears on the site.

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
All submitted comments are reviewed, so it may be a few days before your comment appears on the site. This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.