Next: Drug Design Up: Exemplar Applications Previous: Computational Quantum Chemistry-HIV

PetaFLOPS or PetaOPS Requirements fromGenome Projects

The Human Genome project should produce the entire sequence of all bases of human DNA in the next decades. It is reasonable to expect that in that same time frame it may be possible to do genomes of similar size, such as individual humans, experimental and domestic animals, crops, and others at the rate of one per year. Rates of 100 bases per second may be routine. Full, element-by-element comparisons of new sequences with existing data bases would require more than to elementary comparisons per second, compounded as the database grows. Clever heuristics can reduce the essential operations, but a newly discovered method could require complete re-analysis of all existing data. Full comparison of human and mouse genomes would need - operations. This kind of comparative sequence analysis is already one of the most powerful sources of knowledge about genes, and shows promise of becoming more important as the data and knowledge bases grow. These discoveries will drive the need for even more sophisticated analyses.

Genomic information completely determines the characteristics of the protein and nucleic acid molecules that express a living organism's form and function. One of the greatest challenges, in which computation is playing a major role, is the prediction of higher-order structure from the one-dimensional sequence of genes. Rules for prediction of macromolecule folding are beginning to emerge. In the case of RNA, there are some simple rules that partially predict the secondary interactions of distant parts of the polymer chain. A deterministic, dynamic programming code is in wide use. It scales as in operations and in memory. Other non-deterministic methods also are available. More complex methods for predicting three-dimensional structure are appearing. Preliminary secondary structure predictions for sequences of HIV RNA (9,218 nucleotides) can be done on a processor SIMD Maspar or an 8-processor CRAY Y-MP in about six hours. There are 100,000s of sequences of potential interest ranging in size from 50 to 10,000 nucleotides.

Protein structure prediction is even of greater interest since proteins are the principal agents of expression for genetic information. Rules for prediction are more complex that for RNA, and are a research area of major concern. In its extreme, the problem could be viewed as of complexity, where reasonable values for , the number of conformations amino acids may take, could be dozens. Exhaustive conformational search would not be feasible for many proteins, even with PetaFLOPS computers. However, even now there are a number of strategies to explore the problem. Exhaustive search on highly simplified lattice models, using simplified potential functions is partially successful. Statistics-based assignments of structure from sequence similarities is another. In all cases, refinement of final structures requires molecular mechanical and computational chemistry tools.

Much experimental work, involving heavy computation, is still needed in algorithm development for both the RNA and protein-folding problem.

Problem match to three categories of PetaFLOPS computer: Almost every sort of high-performance architecture is easily adaptable to sequence comparison. Workstations are very competitive at the present time.

Class I machines are seldom used for sequence matching. High-precision word length is not needed. Lattice models are explored with supercomputers for their raw, scalar speed. Three-dimensional modeling is most frequently done on scalar plus vector, and limited multiprocessor shared memory.

Class II machines will do well on sequence comparison. Some HPCC efforts are adapting the molecular mechanical and electronic methods.

Class III machines are beginning to appear for sequence matching in commercial products. Some experimental adaptations of molecular mechanics calculations on heterogeneous architectures have been done.



Next: Drug Design Up: Exemplar Applications Previous: Computational Quantum Chemistry-HIV


gcf@npac.syr.edu