Large Scale Simulation on NICE PCs

Memorandum to the CERN Divisional Planning Group

Date: 12th. December 1994

To: DPG

CC:

From: Julian J. Bunn and David G. Foster

Subject: Large Scale Simulation on NICE PCs

Introduction

At the current time, there are around 1000(?) PCs at CERN that are connected to the central NICE servers. A large fraction of this number (around 600) remain powered up and centrally connected overnight, when the owner is away from the office. All of them run one and the same operating system (Microsoft Windows with the Netware client), offer identical user environments, and are configured in the same way. When taken as a computing resource, these machines represent a combined power of several thousand CERN Units.

In this Memorandum, we describe a method by which this computing resource could be tapped for large scale simulation or other low-I/O tasks.

Homogeneity

Other proposals for utilizing spare cycles on unattended desktop workstations (such as Condor [ref]) suffer from several drawbacks that make successful implementation very difficult:

In a heterogeneous workstation environment it is hard to manage the support matrix of different operating system types and versions for a given application,
Predicting the achievable processing speed on each participating workstation is difficult due to the multi-tasking capabilities of workstations (there may be other tasks running in the workstations) and the vast variety of local hardware differences that exist (memory size, page/swap file size, local disk speed etc.)
Ensuring read/write access to central file systems using, for example, NFS brings problems of central authentication etc..

In contrast, the NICE environment offers the following features:

A single operating system. An executable built once on one Intel x86 PC will run will run on all other Intel x86 PCs: the operating system and environment are maintained on the central servers.
Windows is a single tasking OS. An application executing on a PC does not have to compete with other applications. The execution speed of a PC on a particular application is thus very well predicted.
The central Novell Servers maintain file systems for all the PCs that are read/write and highly reliable.

Thus, the NICE PCs are a homogeneous ensemble of centrally accessible compute engines, albeit each with relatively modest processor power (we estimate around 4 CERN Units per 486DX 33MHz).

Operation

To make use of the available PC processing power we suggest targetting an application with the following characteristics:

CPU intensive
light I/O (typically reading job parameters at startup, and writing a small amount of result data at job end).
Small (<8 Mbytes) image size, in order to minimize local swapping
No I/O (other than swapping) to local devices during job execution (we do not want to create files on the user's disk that subsequently have to be cleaned up).
others ...

Each PC will execute a copy of the application, which is hosted on the Novell Server. The scheduling is achieved by a cron-like job that lies dormant in the PC during normal working hours, then wakes up at a predetermined hour, looks for a suitable task to execute by examining a list of unallocated jobs on the Server, marks the job as in progress, and begins execution. At job end, results are written to a directory on the Server that identifies the task name, and an accounting file is written giving details of the job execution time, elapsed time, host PC configuration, and so on.

Once the PC has begun executing the job, a Windows check box appears on the screen of the PC, indicating that the PC is being used for this purpose. The check box will allow the PC owner to abort the job if necessary. Indeed, the PC owner will be able to configure his PC to either allow its use for offline processing in this way, or not, as he or she wishes.

To keep things simple, we propose not to implement any error checking or failover for individual jobs. If for any reason the PC fails to complete it's assigned job (the job fails, the network becomes unreachable, the PC owner aborts the job), then the job remains marked as "in progress" on the server. Later, at the end of the night, a master application runs through the directories corresponding to each of the jobs and either collects the results from completed jobs or resets assigned but incomplete jobs. Thus, a particular job that failed for any reason, will be re-offered the following night.

Applications

Monte Carlo simulation is clearly the most suitable candidate for running in the environment described. The minimum amount of information required to start a particular job is a single seed for a random number generator. A typical result might be a set of shower data for a shower library. Since the returns are so great, some time could be devoted to optimising a particular application for execution on the x86 architecture. The CERN Program Library is already available on MS-DOS, although this is not necessarily required for some applications. We would choose a 32-bit Fortran compiler, such as Microsoft's Fortran PowerStation, to build the application. This compiler can generate executables that run under Windows. Some other compilers, such as Microsoft Fortran 5.1, can generate Windows executables, but run in 16 bit precision only.

Windows'95

We anticipate that the introduction of Windows'95 on the CERN PC desktops will not pose any problems for the simulation environment proposed, since, unlike Windows/NT which requires substantial system resources (most notably in the form of real memory), Windows'95 is tuned specifically for the Intel architecture and machines having modest amounts of real memory. We also note that Windows'95 will be a multitasking OS, which may well offer the possibility of scheduling simulation tasks during the day time as well, when the PC is not being used for its normal tasks. Another advantage brought by Windows'95 will be full protected-mode addressing, meaning that the application can only crash itself, and not the whole OS.

Job Scheduling

The cron-like task is easily implemented: currently, all NICE PCs run an application at Windows start up that simply arranges the icons into predefined positions: the executable that does this sits on the Novell Servers. This application usually does its job and terminates; it could, of course, lie dormant until late at night, then wake up, check for no activity on the PC, look on the Server for the list of unassigned tasks, select one and begin executing it.

Benefits

An estimated (600 machines x 4 CERN Units x 8 hours = ) 19,200 CERN Unit hours per day are available as unused processing power heating CERN offices at night. This roughly corresponds to the existing capacity available in CSF, and an investment of around 400,000 USD if equivalent capacity were to be purchased at 500 USD per CERN Unit. The capacity will of course increase as PCs based on more powerful processors such as the Pentium (no FDIV jokes, please) are installed at CERN. We believe that a large fraction of the available power can be lassooed for low I/O Monte Carlo simulation, in the way described above, the physics requirements for which know no bounds. A test confined to those machines attached to the SRV1_HOME Novell Server would demonstrate whether or not this is but a dream.

JJB/DGF