Preliminary results on the scalability of the Caltech HP Exemplar
under a HEP track reconstruction workload
Koen Holtman, 20 Apr 1998
INTRODUCTION
Track reconstruction in high energy physics is highly CPU-intensive with modest I/O
requirements. Track reconstruction needs to be done for large sets of events. For each
event, track reconstruction can proceed independently. This makes the process highly
parallelisable, and allows I/O to be done in a `streamed' way. According to the CMS CTP
the parameters for CMS track reconstruction are
- CPU: 2 * 10^4 mips per event (this is about 5 seconds on a Caltech HP Exemplar CPU)
- I/O: 1 MB reading, 100 KB writing per event.
EXPERIMENT
We put a workload of N (simulated) reconstruction processes on the Exemplar, for N in
the range 15 - 210.
The parameters of the reconstruction processes were as follows.
- All data resides in the Objectivity/DB object database.
- CPU / I/O ratio: about equal to that of CMS track reconstruction, but only reading, no
writing.
- The simulated reconstruction process consists of a loop as follows:
for(;;)
{
read next 10 KB object;
compute for 0.045 CPU seconds;
}
The 10 KB objects are clustered in the order of reading. Each process has its own
database (130 MB) of objects, and reads through them cyclicly.
- The object are read through a simple read-ahead optimisation layer which ensures that
database I/O is done in sequential bursts of 2 MB.
- Note that in the reconstruction code, there is no attempt to do computation while
waiting for I/O.
The processes were reading independent data sets located in database files on two node
/tmp filesystems. The two node filesystems have a maximum combined throughput (for reading
only) of about 44 MB/s. Reconstruction processes were divided evenly over all 15 batch
complex nodes.
RESULTS
We measured the system throughput (expressed in MB/s aggregate data rate) versus the
number of running reconstruction processes N. As N increased, the system went from a
CPU-bound one (not enough CPUs used to saturate available disk bandwidth) to a disk-bound
one (not enough available disk bandwidth to fully load CPUs used).
A plot of the results is in the upper part of the figure below. The dotted line is the
theoretical maximum throughput (which is equal to the I/O demand of all processes if they
would have 0-latency I/O). We have an almost perfect speedup curve until the disk
subsystems become saturated at N=150 processes and 30 MB/s.
The lower part of the figure shows the `CPU efficiency' which is the fraction of time
in which the CPUs allocated to the processes work on executing the (simulated) track
reconstruction code.
EFFECT OF THE READ-AHEAD LAYER
We noted above that the reconstruction code accesses the database through a read-ahead
layer. This layer (which consists of about hundred lines of C++ code) was developed by us
in the past to achieve good I/O performance in some scenarios on smaller systems.
In a test with the layer disabled, performance degraded with a factor 2. Below 60
processes, the throughput was close to the theoretical maximum, above 60 processes it
levelled out at 15 MB/s.
PRELIMINARY CONCLUSIONS
The Exemplar seems to be very well suited for handling a HEP reconstruction workload.
With 2 node filesystems it was possible to use up to 120 processors in parallel with an
extremely efficient utilisation of allocated CPU power and I/O resources.
It is likely that efficient use of more CPUs is possible when more than 2 node
filesystems are used in parallel -- this is a subject for future tests.
It is important to note that this high efficiency was achieved using a standard
commercial object database with a semi-standard read-ahead layer on top as the I/O engine.
We did not have to invest in hardware-dependent or application-dependent I/O optimisations
in the (simulated) track reconstruction code.
The test with disabling the read-ahead layer showed the importance of this
optimisation. A case can be made for incorporating the functionality of the read-ahead
layer into future commercial object database products.
These conclusions are preliminary because no writing of result data was done -- this is
a subject of future tests. It is possible that scalability will be less good if writing is
introduced. Writing will produce a higher load on the transaction and locking mechanisms
of the object database and the scalability of these mechanisms is an area of concern.
|