Date: 12th. December 1994
To: DPG
CC:
From: Julian J. Bunn and David G. Foster
Subject: Large Scale Simulation on NICE PCs
At the current time, there are around 1000(?) PCs at CERN that are connected to the central NICE servers. A large fraction of this number (around 600) remain powered up and centrally connected overnight, when the owner is away from the office. All of them run one and the same operating system (Microsoft Windows with the Netware client), offer identical user environments, and are configured in the same way. When taken as a computing resource, these machines represent a combined power of several thousand CERN Units.
In this Memorandum, we describe a method by which this computing resource could be tapped for large scale simulation or other low-I/O tasks.
Other proposals for utilizing spare cycles on unattended desktop workstations (such as Condor [ref]) suffer from several drawbacks that make successful implementation very difficult:
In contrast, the NICE environment offers the following features:
Thus, the NICE PCs are a homogeneous ensemble of centrally accessible compute engines, albeit each with relatively modest processor power (we estimate around 4 CERN Units per 486DX 33MHz).
To make use of the available PC processing power we suggest targetting an application with the following characteristics:
Each PC will execute a copy of the application, which is hosted on the Novell Server. The scheduling is achieved by a cron-like job that lies dormant in the PC during normal working hours, then wakes up at a predetermined hour, looks for a suitable task to execute by examining a list of unallocated jobs on the Server, marks the job as in progress, and begins execution. At job end, results are written to a directory on the Server that identifies the task name, and an accounting file is written giving details of the job execution time, elapsed time, host PC configuration, and so on.
Once the PC has begun executing the job, a Windows check box appears on the screen of the PC, indicating that the PC is being used for this purpose. The check box will allow the PC owner to abort the job if necessary. Indeed, the PC owner will be able to configure his PC to either allow its use for offline processing in this way, or not, as he or she wishes.
To keep things simple, we propose not to implement any error checking or failover for individual jobs. If for any reason the PC fails to complete it's assigned job (the job fails, the network becomes unreachable, the PC owner aborts the job), then the job remains marked as "in progress" on the server. Later, at the end of the night, a master application runs through the directories corresponding to each of the jobs and either collects the results from completed jobs or resets assigned but incomplete jobs. Thus, a particular job that failed for any reason, will be re-offered the following night.
Monte Carlo simulation is clearly the most suitable candidate for running in the environment described. The minimum amount of information required to start a particular job is a single seed for a random number generator. A typical result might be a set of shower data for a shower library. Since the returns are so great, some time could be devoted to optimising a particular application for execution on the x86 architecture. The CERN Program Library is already available on MS-DOS, although this is not necessarily required for some applications. We would choose a 32-bit Fortran compiler, such as Microsoft's Fortran PowerStation, to build the application. This compiler can generate executables that run under Windows. Some other compilers, such as Microsoft Fortran 5.1, can generate Windows executables, but run in 16 bit precision only.
We anticipate that the introduction of Windows'95 on the CERN PC desktops will not pose any problems for the simulation environment proposed, since, unlike Windows/NT which requires substantial system resources (most notably in the form of real memory), Windows'95 is tuned specifically for the Intel architecture and machines having modest amounts of real memory. We also note that Windows'95 will be a multitasking OS, which may well offer the possibility of scheduling simulation tasks during the day time as well, when the PC is not being used for its normal tasks. Another advantage brought by Windows'95 will be full protected-mode addressing, meaning that the application can only crash itself, and not the whole OS.
The cron-like task is easily implemented: currently, all NICE PCs run an application at Windows start up that simply arranges the icons into predefined positions: the executable that does this sits on the Novell Servers. This application usually does its job and terminates; it could, of course, lie dormant until late at night, then wake up, check for no activity on the PC, look on the Server for the list of unassigned tasks, select one and begin executing it.
An estimated (600 machines x 4 CERN Units x 8 hours
= ) 19,200 CERN Unit hours per day are available as unused processing
power heating CERN offices at night. This roughly corresponds
to the existing capacity available in CSF, and an investment of
around 400,000 USD if equivalent capacity were to be purchased
at 500 USD per CERN Unit. The capacity will of course increase
as PCs based on more powerful processors such as the Pentium (no
FDIV jokes, please) are installed at CERN. We believe that a large
fraction of the available power can be lassooed for low I/O Monte
Carlo simulation, in the way described above, the physics requirements
for which know no bounds. A test confined to those machines attached
to the SRV1_HOME Novell Server would demonstrate whether or not
this is but a dream.
JJB/DGF