Data Intensive Computing in
High Energy and Nuclear Physics
Investigators: L. Dennis and
G. Fox
The FSU High Energy Physics Group is an active participant in two major experiments: D0 at Fermilab and CMS at CERN. We plan to be a center for large-scale Monte Carlo simulation of high-energy reactions as well as a center for the development and application of advanced neural network and probabilistic methods of data analysis. We are convinced both of an opportunity for landmark discoveries in fundamental physics in the next decade resulting from the search for the Higgs and in the need for advances in core technology to make them possible.
The
FSU Nuclear Physics Group is an active participant in major experimental
programs at Jefferson Lab: the CEBAF Large Acceptance Spectrometer
Collaboration (CLAS) and the planned Hall D project. We currently serve as a
simulation resource for CLAS and could serve in a larger capacity for Hall D.
These facilities will provide the detailed experimental information on baryon
resonances required for a fundamental examination of Quantum Chromodynamics in
nuclei.
Currently,
the above experimental programs are generating hundreds of terabytes
yearly. This production rate will grow
to many petabytes of data annually. Our
ability to work effectively with enormous data sets and to mine the data for
the required information will determine the rate of progress in these fields.
There
are several problems we wish to solve, of which the following are the most
pressing: 1) To create a software system capable of running simulations, in
parallel, on hundreds of computer nodes by possibly dozens of scientists at a
time, who may be scattered across the globe; 2) To create a software system
that provides an easy interface to both real and simulated data, not only for
FSU scientists but for our colleagues world-wide and; 3) To create an analysis
environment that makes it possible to manipulate and visualize multiple data
sets as if they were a single entity.
We
have considerable experience from our current experiments in dealing with large
data volumes and widely distributed computations on computer clusters. For example, the Jefferson Lab experiments
generate nearly a terabyte of acquired and simulated data on a good day. The
simulations essential for using this data take place on computer clusters at
several universities, including the 32-processor Linux cluster at FSU. As the data volume from experiments and
computations explodes, one will require integrated problem solving environments
to effectively conduct, track, combine and compare the results of thousands to
hundreds of thousands of related, but independently conducted
computations. Without such tools
scientists will spend an inordinate amount of time simply keeping track of what
computations they have done and where their data is stored.
There
have been some projects in this area including the NILE project and more
recently the National Scalable Cluster Project but technologies are changing
rapidly and a modern infrastructure built around XML databases and interactive
(Java) datamining tools appears very promising. Note even though this field has
large data volumes, the analysis is compute not I/O bound. Thus it is
attractive to build more flexible powerful information management schemes. Note
that although Fox’s current emphasis is computer science, while on the faculty
at Caltech he built and used the data analysis system for several successful
physics experiments at Fermilab.