Data Intensive Computing in High Energy and Nuclear Physics

Investigators: L. Dennis and G. Fox

 

The FSU High Energy Physics Group is an active participant in two major experiments: D0 at Fermilab and CMS at CERN. We plan to be a center for large-scale Monte Carlo simulation of high-energy reactions as well as a center for the development and application of advanced neural network and probabilistic methods of data analysis.  We are convinced both of an opportunity for landmark discoveries in fundamental physics in the next decade resulting from the search for the Higgs and in the need for advances in core technology to make them possible.

 

The FSU Nuclear Physics Group is an active participant in major experimental programs at Jefferson Lab: the CEBAF Large Acceptance Spectrometer Collaboration (CLAS) and the planned Hall D project. We currently serve as a simulation resource for CLAS and could serve in a larger capacity for Hall D. These facilities will provide the detailed experimental information on baryon resonances required for a fundamental examination of Quantum Chromodynamics in nuclei.

 

Currently, the above experimental programs are generating hundreds of terabytes yearly.  This production rate will grow to many petabytes of data annually.  Our ability to work effectively with enormous data sets and to mine the data for the required information will determine the rate of progress in these fields.

 

There are several problems we wish to solve, of which the following are the most pressing: 1) To create a software system capable of running simulations, in parallel, on hundreds of computer nodes by possibly dozens of scientists at a time, who may be scattered across the globe; 2) To create a software system that provides an easy interface to both real and simulated data, not only for FSU scientists but for our colleagues world-wide and; 3) To create an analysis environment that makes it possible to manipulate and visualize multiple data sets as if they were a single entity.

 

We have considerable experience from our current experiments in dealing with large data volumes and widely distributed computations on computer clusters.  For example, the Jefferson Lab experiments generate nearly a terabyte of acquired and simulated data on a good day. The simulations essential for using this data take place on computer clusters at several universities, including the 32-processor Linux cluster at FSU.  As the data volume from experiments and computations explodes, one will require integrated problem solving environments to effectively conduct, track, combine and compare the results of thousands to hundreds of thousands of related, but independently conducted computations.  Without such tools scientists will spend an inordinate amount of time simply keeping track of what computations they have done and where their data is stored.

 

There have been some projects in this area including the NILE project and more recently the National Scalable Cluster Project but technologies are changing rapidly and a modern infrastructure built around XML databases and interactive (Java) datamining tools appears very promising. Note even though this field has large data volumes, the analysis is compute not I/O bound. Thus it is attractive to build more flexible powerful information management schemes. Note that although Fox’s current emphasis is computer science, while on the faculty at Caltech he built and used the data analysis system for several successful physics experiments at Fermilab.