The Caltech/CERN/HP Joint Project

 

INTRODUCTION

The Computing Technical Proposals of CERN's CMS and ATLAS Experiments [1,2] describe the unprecedented challenge of computing at the LHC. Its scale is well illustrated when considering the billions of highly complex interactions that must be recorded for physics analysis by the detectors each year. This amounts to an expected accumulation of several PetaBytes of event data, starting in the year 2005 at LHC turn-on, and continuing each year for many years thereafter. During this time, the recorded data will be analysed by thousands of physicists at their home institutes around the World.

The computing models proposed by the LHC experiments favour storage of and access to the event data by the use of large Object Database Management Systems (ODBMS) located at CERN, and with replicas at so-called Regional Centres. Regional Centres are special computing centres that serve groups of national collaborating institutes and which enjoy very high bandwidth connectivity to CERN, where the raw experimental data are accumulated.

The models represent an unprecedented challenge for data storage, access and networking computing technology. The challenge extends to the entire Object Oriented software development task, a task which must ensure that the software correctly implements and utilises the known relationships between the HEP objects stored in the ODBMS.

Effectively meeting this challenge must be a primary goal for the LHC experiments and therefore for CERN itself. In this document, we present detailed plans for a project which addresses the data storage, access and networking problems that are implicit to the LHC computing models. The project makes use of pre-funded hardware, financial support from Hewlett Packard, and participation by experts from CERN's IT Division.

OVERVIEW

The Project's purpose is to construct a large-scale prototype of an LHC Computing Centre. The prototype envisaged will employ a latest generation multi-node SMP server on which will be installed a large scale ODBMS, the High Performance Storage Management System (HPSS), and high bandwidth Wide Area Networks (WANs). In time, a complete LHC software environment will also be installed and then evaluated, and production simulation work carried out. To address the networking aspects of the challenge, tests of the ODBMS across the WANs will be made, so providing better predictions of how Central and remote computing centres might interoperate on a global scale.

COMPUTING INFRASTRUCTURE

Recent agreements between Caltech/JPL, NASA and Hewlett Packard (HP) have secured project funding for the installation of a major computing facility in the form of next-generation shared memory SMPs from HP. The initial machine (to be installed in June '97) will have 256 nodes, with a possible extension to 384 or 512 nodes by '98, and then a definite replacement of the CPUs with "Merced" chips in '99. The initial configuration will have roughly 40 times the CPU power of CERN's CSF, of which 10-20% will be available to HEP.

The Table below shows the timescales for the installation of hardware at Caltech/JPL and the San Diego Supercomputer Centre.

DATE DESCRIPTION
Apr '97 128 200 MHz PA8000 node HP SPP2000 installed at Caltech/JPL with 64 GBytes shared memory.
Jun/Jul '97 SPP2000 nodes increased to 256. Possible further expansion to 384-512 nodes depends on additonal funding and consortium partners.
End '97 OC12 (622 Mbits/sec) Caltech to SDSC link installed.
Start '98 SPP2000 connected in Internet II project ?
Jun '98 Peer system installed at SDSC?
Mid '98 Two "Yosemite" server with 8 "Merced" nodes, 8 GBytes memory, 100 GBytes disk installed as general purpose machines for debugging and development.
End '98 Peer system at SDSC upgraded?
May '99 32 node (64 GFLOPS) "Merced" based SPP3000 (Serial No. 1) installed at Caltech, with 64 Gbytes of memory.
Nov '99 Another 32 "Merced" nodes and 64 Gbytes memory added to SPP3000 at Caltech.

GOALS AND PHASES

The Project goals are subdivided into the following phases:

TIMING

The installation and major users of the machine will be established in mid-'97, the machine being operational by June or July. It is important that the Project establishes itself promptly as one of the major users, to ensure that the required share of the machine is obtained to make the Project a success. A start in June is sufficiently long after the first results have been obtained from the DEC/RD45 Joint Project (expected end March '97) and fully digested. This will allow us to repeat the tests with confidence.The networking aspects of the Project invite its candidacy as an Internet II project: these will be chosen this year, and early visibility is obviously an advantage.

Clearly, the goals and phases of the Project, and their component work items, should be reviewed at each stage in the light of progress made.

COMPLEMENTARITY WITH CERN/RD45

The RD45 project is currently focussing on the use of distributed Object Databases. These databases would be distributed across multiple servers in both the local and wide area, but would appear to the user as a single logical database. Critical resources - both system and user - would be replicated using database services. Each replica of a given database is called an image. Important issues, which can be tested in this project, include the scalability of the current Objectivity/DB architecture to large numbers of autonomous partitions and database images. The SPP architecture can be partitioned into multiple logical partitions, allowing the above scenarios to be tested in a well-controlled environment. In addition, the use of wide-area replication, using high-bandwidth networks, which are highly unlikely to be available at CERN on the time-scales mentioned, could be tested. Finally, the project will provide a test-bed that is much more conveniently located for Objectivity engineers, should special developments be required for an efficient Objectivity/DB-HPSS interface.

SUBSIDIARY BENEFITS

In addition to the above areas, the Project has other features of direct importance to the LHC Experiments and CERN:

Firstly, it provides the LHC Experiments with access to a significant amount of computing capacity that would not otherwise be available. Since CERN is between experiment generations, this is an unusual opportunity. Mature experiments (such as those at LEP) have already installed the computing capacity they needed, and are using it. However, the LHC Experiments already have massive needs for capacity for simulation work. Their total demand for '97 approaches what is currently installed and allocated for the LEP experiments.

Secondly, the project is pre-funded, and the equipment is due for installation in the near term. There is a complementary installation at the nearby San Diego Supercomputer Centre, which will be built up to a similar total power. This will allow head-to-head comparisons to be made between the two systems. The SDSC already has experience with using a large database system (DB2 from IBM) with HPSS, and has plans to work in the area of distributed OO database systems in the future. Once the Caltech and SDSC machines are networked together, we can investigate the heterogeneous aspects of the LHC Computing models, and perhaps experimentally model the diverse interactions between the principal (CERN) Centre, and the remote centres.

FUNDING

The Project makes use of pre-funded computing hardware. The other costs include funding for Project coordination, travel money for experts from CERN visiting Caltech, travel money to attend review meetings at CERN, and to present the Project to the LHC Experiments, displacement costs, software licensing costs and logistics support at Caltech. These costs are detailed as follows:

PARTICIPATION

The Project participants are CERN, Caltech and HP. The Project will be led jointly by Julian Bunn, representing CERN as a Visiting Associate in the Physics, Mathematics and Astronomy Division of Caltech, and Harvey Newman, both as the Chair of the CMS Software and Computing Board (SCB), and as the Caltech faculty member in charge of the Caltech HEP Group in CMS.

CERN's participation includes time from experts in the RD45 project and in PDP Group on short visits to Caltech to work on the Objectivity and HPSS parts of the Project in collaboration with the HEP and CACR teams. In addition, the strong interest shown by CMS in the Project leads us to expect programming contributions from individuals in the collaboration, in particular in the area of the future GEANT4 CMSIM. We expect assistance with the Project from two part-time CMS physicists, probably located nearby to Caltech geographically (e.g. at San Diego, or at UC Davis).

The Caltech HEP Group will cover the following aspects of the Project. Two members of staff will be provided: as well as assisting with the Objectivity and HPSS work, one will undertake the porting of CMSIM, and the other will manage the installation and production running of CMSIM. The Caltech Centre for Advanced Computing Research (CACR), under Paul Messina, will provide the hardware and software environment for the Project, and assist in development work such as the porting of HPSS to the SPP2000. Specifically, the Caltech work will comprise:

IMPACT ON CTP REQUIREMENTS FOR THE ODBMS

The project will test the ODBMS against the general requirements as laid out in the CMS CTP, namely:

  1. All types of HEP data (raw, reconstructed, summary and tag) should be accessible in a uniform manner.
  2. It should be possible to make dynamic selections of events and associated event information. The Object Query Language (OQL) used to make the selections should be sufficiently powerful and intuitive.
  3. The system should optimise access bandwidth according to data access patterns, by moving frequently accessed data to higher performance media, and migrating infrequently used data lower in the storage hierarchy. This part of the evaluation will test both the ODBMS and its coupling to the HPSS. The system should also determine relations between the stored objects based on patterns of how they are accessed, and then "cluster" the objects accordingly. Additionally, it should be possible from the applications code to add information on the relations between collections of objects.
  4. The system should be capable of providing access to PetaBytes of data. Clearly, in the early stage of the project, a more modest amount of data will populate the database. The size should match or exceed that being used in the RD45/DEC Joint Project (100 GBytes initially, rising to 1 TByte later).
  5. The system must use information on the network performance, configuration and prevailing conditions to evaluate a cost function for accessing objects held remotely. An initial systems-simulation model to help predict and track the network-related performance issues is already under construction.
  6. The system must allow access to the data by several applications and users simultaneously. The locking system must be robust and efficient, and not impact significantly on the overall performance of the system.
  7. In order to achieve the above the system should exhibit an efficient hierarchy of caching across several levels: single processor/node/cluster/shared memory/disk/tertiary storage, with extensions across very high speed LANs and high speed WANs.

WORK ITEMS

The table below details the Project work items and their sequence, which is synchronised with the equipment installations in the table above.

DATE DESCRIPTION
Mar/Apr '97 Visit to Caltech by Project Coordinator to pre-view the SPP2000 and discuss detailed plans with Messina's Group and the HEP CMS Group.
Jun '97 PHASE 0 BEGINS
Jun '97 Attendance at Objectivity/DB training course by Project Coordinator and member(s) of Caltech group(s).

Install Objectivity/DB on the SPP2000. The availability of Objectivity 4.0.2 for HP/UX 10 is promised for before this time.

Install the RD45 program suite on the SPP2000, which includes programs for populating the database, testing associations, generating random events as objects, testing access methods, testing replication. These require the prior installation of Tk/Tcl, CVS and the C++ compiler.

Allow access by Objectivity engineer(s) to the HPSS client library at Caltech, for testing purposes.

Jun '97 Present Project status at CMS week in CERN (23/6-27/6)
Jul '97 Begin performance studies work:

Population of DB (>100 GBytes disk space will be available for tests). Functionality tests. Comparison with RD45/DEC benchmark results (already available by end March '97) on the "impact of using an ODBMS for the physical organisation of the data". SMP scalability (machine partitions, replication between partitions, large numbers of servers and clients working concurrently), large memory caching, container placement and size, locking. This will require one or more visits by RD45 experts.

The above performance studies continue throughout subsequent phases.

Sep '97 REVIEW PHASE 0

PHASE 1 BEGINS

Sep '97 Caltech staff to install and test CMSIM and set up batch queues for CMSIM production. Acquire disk space for production runs. Start testing CMSIM production in around 10% of available capacity, and then full production in the background on the SPP2000. Production continues throughout subsequent phases.
Sep '97 Present Project status at CMS Week in Madison (15/9-19/9)
Nov '97 Install ODBMS on Caltech in-house AIX and SGI systems, and test it over LANs in the Gbits/sec range.
Dec '97 REVIEW PHASE 1

Present Project status to RD45 and CMS at CERN (8/12 - 12/12) (CMS Week)

PHASE 2 BEGINS

Jan/Feb '98 With Objectivity engineer, PDP expert and assistance from CACR, integrate the Objectivity/DB as a client for the HPSS. This assumes that the basic Objectivity HPSS support is available.
Jan/Feb '98 Install and test LHC++ and CMS OO environment, including IRIS Explorer. Install GEANT4, preferably with CMS geometry, if available, and make tests using Objectivity/DB as the event store. Ensure LHC++ environment functions corectly with the ODBMS. Make comparison tests with PAW and nTuples?
Jan/Feb '98 Report on results of ODBMS performance studies started in Jul '97 and contrast with results from RD45/DEC.
Jan/Feb '98 Install, test and re-evaluate all components of the environment after the SPP upgrade to 384-512 nodes (if available). This would require visits by PDP and RD45 experts. Initiate new performance and scaling studies.
Jan/Feb '98 Implement networked ODBMS over OC12 WAN with the peer machine at SDSC and make scalability and head-to-head comparison tests.
Mar '98 REVIEW PHASE 2
Mid '98 PHASE 3 BEGINS

Connectivity trials between Caltech and CERN and other large LHC centres. Replication tests. Caching tests.

Mid '98 Test full model implementation targetting CMS as client. The detailed work plan defined in the Phase 2 review begins.
May/Jun '99 Install ODBMS on SPP3000. Test performance according to criteria established earlier in the Project. Compare the performance on this smaller, high performance, next generation system, with the SPP2000. Evaluate the system in-house and as a networked centre with SDSC and/or other sites with multi-TFLOPS power.
Mid '99 REVIEW PHASE 3

CONSIDER FOLLOW-ON PHASES AS APPROPRIATE

END '97 - '99 PROJECT

BACKGROUND SECTIONS

These sections provide some background information for those unfamiliar with some of the HEP terminology used in the body of the Project proposal.

BACKGROUND A: THE LHC EVENT

The fundamental object of interest to the physicist is the "event". An event occurs when the two counter-rotating particle beams of the LHC machine collide in the centre of one of the LHC detectors. This collision causes sprays of energy (jets) to be produced in the form of other particles, which travel outwards through the detector, from the collision point. As these particles travel through the detector, their passage is signaled and digitised in more than ten million (multiplexed) channels. The digitised signals are filtered, discriminated, compressed and eventually stored in memory, on magnetic disk, or on magnetic tape.

 

At the LHC, an event is expected to occur at the rate of around 800 million times per second. Careful filtering is applied at a very early stage of digitising so that only events of interest are recorded (this process is called "triggering"). The expected "interesting" event rate is around 100 events per second. The goal of the triggering process is to ensure that it is these 100 events that are fully recorded by the detector. Each recorded event is unique and is identified by the 1 Mbytes of digitised electronic channel information from the detector. The event data rate during operation of the LHC machine is thus 100 Mbytes/sec for each of the LHC experiments.

Because of the very high rate of operation of the machine and detectors there are, at any given moment, several events propagating through the detector, like waves rippling across a pond. Thus the digitisings from the detector will contain information from around 25 different interactions. Only one of these interactions will have caused the triggering logic to read out the event, but the data from the remaining background (or "minimum bias") events will also be recorded. The picture shows a simulation of a single interaction, for clarity. To disentangle the interesting event from the background events is a difficult and challenging pattern recognition problem. To compound this difficulty, the intensity of neutron radiation inside the LHC detectors gives rise to a background "noise" that is due to a steady rate of atomic reactions. This noise is also digitised along with the interesting event data.

Not all events are of the same type. The triggering attempts to select events that fall into any one of about twenty different and broad categories. An event of one type will generally be followed by an event of a completely different type, and the order in which the events occurred, and are recorded, has no physics significance. In traditional HEP experiments the events are stored on sequential access devices like magnetic tapes, and there is consequently a difficult access problem in retrieval of a particular class of event.

BACKGROUND B: THE LHC DISTRIBUTED COMPUTING MODEL

The goal of the distributed LHC computing model is to realise a high degree of location-independence for physicists wishing to analyse LHC event data, within the practical bounds of network throughput at a feasible cost. The desire is that the computing resources, which are necessarily distributed amongst the collaborating institutes, are put to best use, and that new resources acquired by the institutes during the course of the construction and running phase of the experiment can be integrated and profited from by the collaborations.

The computing models described in the CMS and ATLAS CTPs favour a handful of Regional Centres that enjoy high bandwidth connectivity to CERN, and which act as the collaborations' computing centres for local institutes. The Regional Centres are expected to house substantial compute resources and support infrastructure. They will typically offer a replica of the OO event database, the master of which is populated at CERN directly from the DAQ systems in the experiments.

 

 

By using an OO database, we can store the events as objects and access them using powerful queries that operate on "collections" of objects. We can assign events with a particular tag to the collection, and thus avoid the costly overhead of sequentially testing each event to see if it is of the desired type. Moreover, we can assign relations between event objects and other event objects of the same type, and we can then navigate through all these events with great ease. In summary, the OO database approach is very promising because it gives us the ability to handle the expected complexity of LHC data.

The federated database is a collection of databases whose members may be located in different places on the network, and which helps with the scaling issues involved in the massive amounts of LHC data to be stored. The federated database features a distributed lock manager that locks database pages while they are updated. Each object in the federated database has a data member that indicates to which page of the database it belongs. When a page of the database is locked, then all objects in that page are locked against update.

One can envisage keeping the whole of a collection of event objects online, given enough memory. With TeraByte memories, this becomes a possibility for LHC event samples.

BACKGROUND C: THE HIGH PERFORMANCE STORAGE SUBSYSTEM (HPSS)

The role of HPSS (High Performance Storage Subsystem) in the project is as a manager for the physical storage used by the ODBMS. In an optimised and correctly sized system, one would hope to avoid access to tertiary storage as much as possible. However, the PetaBytes of event data need to be permanently stored on tertiary devices, and HPSS is the enabling technology for this task. It must sit beneath the ODBMS and must be integrated with it seamlessly. The database itself, which is active in memory and on disk, will, however, sometimes require access to a page that has been moved onto tertiary storage. How the HPSS interoperates with the ODBMS to retrieve the page (by retrieving the media offline) requires investigation.

REFERENCES

  1. CMS Computing Technical Proposal, The CMS Collaboration. CERN/LHCC/96-45. 19th. December 1996. ISBN 92-9083-096-4
  2. ATLAS Computing Technical Proposal, The ATLAS Collaboration. CERN/LHCC/96-43. 15th. December 1996. ISBN 92-9083-092-1