The Human Genome project should produce the entire sequence of all
bases of human DNA in the next decades. It is
reasonable to expect that in that same time frame it may be possible to
do genomes of similar size, such as individual humans, experimental and
domestic animals, crops, and others at the rate of one per year. Rates
of 100 bases per second may be routine. Full, element-by-element comparisons
of new sequences with existing data bases would require more than
to
elementary comparisons per second, compounded as
the database grows. Clever heuristics can reduce the essential
operations, but a newly discovered method could require complete
re-analysis of all existing data. Full comparison of human and mouse
genomes would need
-
operations. This kind of
comparative sequence analysis is already one of the most powerful
sources of knowledge about genes, and shows promise of becoming more
important as the data and knowledge bases grow. These discoveries will
drive the need for even more sophisticated analyses.
Genomic information completely determines the characteristics of the
protein and nucleic acid molecules that express a living organism's
form and function. One of the greatest challenges, in which
computation is playing a major role, is the prediction of higher-order
structure from the one-dimensional sequence of genes. Rules for
prediction of macromolecule folding are beginning to emerge. In the
case of RNA, there are some simple rules that partially predict the
secondary interactions of distant parts of the polymer chain. A
deterministic, dynamic programming code is in wide use. It scales as
in operations and
in memory. Other
non-deterministic methods also are available. More complex methods for
predicting three-dimensional structure are appearing. Preliminary
secondary structure predictions for sequences of HIV RNA (9,218
nucleotides) can be done on a
processor SIMD Maspar or an
8-processor CRAY Y-MP in about six hours. There are 100,000s of
sequences of potential interest ranging in size from 50 to 10,000
nucleotides.
Protein structure prediction is even of greater interest since
proteins are the principal agents of expression for genetic
information. Rules for prediction are more complex that for RNA, and
are a research area of major concern. In its extreme, the problem
could be viewed as of complexity, where reasonable values for
,
the number of conformations amino acids may take, could be dozens.
Exhaustive conformational search would not be feasible for many
proteins, even with PetaFLOPS computers. However, even now there
are a number of strategies to explore the problem. Exhaustive search
on highly simplified lattice models, using simplified potential
functions is partially successful. Statistics-based assignments of
structure from sequence similarities is another. In all cases,
refinement of final structures requires molecular mechanical and
computational chemistry tools.
Much experimental work, involving heavy computation, is still needed in algorithm development for both the RNA and protein-folding problem.
Problem match to three categories of PetaFLOPS computer: Almost every sort of high-performance architecture is easily adaptable to sequence comparison. Workstations are very competitive at the present time.
Class I machines are seldom used for sequence matching. High-precision word length is not needed. Lattice models are explored with supercomputers for their raw, scalar speed. Three-dimensional modeling is most frequently done on scalar plus vector, and limited multiprocessor shared memory.
Class II machines will do well on sequence comparison. Some HPCC efforts are adapting the molecular mechanical and electronic methods.
Class III machines are beginning to appear for sequence matching in commercial products. Some experimental adaptations of molecular mechanics calculations on heterogeneous architectures have been done.