GIOD Project Description
(1997-2000)

The GIOD
(Globally Interconnected Object Databases) joint project between Caltech, CERN
and HP addressed the data storage and access problems
posed by the next generation of particle collider experiments which
are due to start at CERN in 2008.
The following information is archived material.
The data rates from the experiments' online systems will be of
order 150 to 1500 MBytes/sec (each event's data is ~1 MByte), giving rise to a yearly accumulation of
several PetaBytes. The raw data from the online systems will be reconstructed to particle
tracks, energy clusters, etc. in near-real time by large processor farms based on
commodity hardware. We expect farms of ~107 MIPS will be required. The
reconstructed data (around 100 kBytes per event) will be stored (perhaps with the raw
data) in ODBMS.

Object data from around 109 particle collisions will need to
be made available each year to collaborating physicists. This will require replication of
significant fractions of the ODBMS amongst "regional centres" (which serve
outlying collaborating institutes), which are scattered across the globe.
The project ended in 2000, having
investigated the scalability of commercial ODBMS,
and models of organising the data to optimise access and analysis for the
end-user physicist. Some serious
challenges were identified in devising a system architecture
that allows sufficient flexibility which at the same time prevents inadvertent abuse!
In the project we used several
then "leading edge"
hardware and software systems, namely the Caltech HP Exemplar, a 256-PA8000 CPU SMP
machine of some ~ 0.1 TIPS, the High Performance Software System (HPSS) from IBM, the
Objectivity/DB Object Database Management System, the Java 3D API from Sun Microsystems,
the Versant ODBMS, and various high speed Local Area and Wide Area networks.
Overview
A data thunderstorm is gathering on the horizon with the next generation of particle
physics experiments. The amount of data is overwhelming. Even though the prime data from
the CERN CMS detector will be reduced by a factor of more than 107, it will still amount
to over a Petabyte (1015 bytes) of data per year accumulated for scientific analysis. The
task of finding rare events resulting from the decays of massive new particles in a
dominating background is even more formidable. Particle physicists have been at the
vanguard of data-handling technology, beginning in the 1940's with eye scanning of
bubble-chamber photographs and emulsions, through decades of electronic data acquisition
systems employing real-time pattern recognition, filtering and formatting, and continuing
on to the PetaByte archives generated by modern experiments. In the future, CMS and other
experiments now being built to run at CERNs Large Hadron Collider expect to
accumulate of order of 100 PetaBytes within the next decade.
The scientific goals and discovery potential of the experiments will only be realized
if efficient worldwide access to the data is made possible. Particle physicists are thus
engaged in large national and international projects that address this massive data
challenge, with special emphasis on distributed data access. There is an acute awareness
that the ability to analyze data has not kept up with its increased flow. The traditional
approach of extracting data subsets across the Internet, storing them locally, and
processing them with home-brewed tools has reached its limits. Something drastically
different is required. Indeed, without new modes of data access and of remote
collaboration we will not be able to effectively mine the intellectual
resources represented in our distributed collaborations. Thus the projects we are working
on explore and implement new ideas in this area that until now have only been discussed in
a theoretical context. These ground-breaking projects include:
To be as realistic as possible, the projects make use of large existing data sets from
high energy and nuclear physics experiments. They will help to answer some important
questions that include:
- How are we going to integrate the querying algorithms and other tools to speed up access
to the distributed data?
- How are we going to cluster the data optimally for fast access?
- How can we optimize the clustering and querying of data distributed across continents?
- What dynamical re-clustering strategies should be used?
- How do we compromise between fully ordered (sequential) organization, and totally
anarchic, random arrangements of the data?
The use of OO languages and Object persistency is fundamental in our current thinking:
these technologies allow us to define, implement and store the physics objects and
inter-relationships that we deal with. We can then express the highly complicated queries
on the object store in order to extract the events and features of interest.
These research directions will very likely be taken up in other branches of science,
and in large corporations: the ability to rapidly mine scientific data, and the use of
smart query engines will be a fundamental part of daily research and education in the 21st
century.
Other material describing GIOD
|