On each node, users have access to a large /scratch
partition on one of the local disks.
Communications between nodes is switched gigabit ethernet. However, channel
bonding is used so that both ports on the NICs are used for increased bandwidth.
The operating system is Red Hat
Enterprise Linux 6 (kernel 2.6.32-504.12.2.el6.centos.plus.x86_64).
The Torque/PBS batch system is used
with the Maui scheduler.
The latest cluster set up was done in September 2014 using
ROCKS.
Several RAID array partitions are available:
experimental group /raid1 to /raid11
lattice theory group /latticeQCD/raid1 to /latticeQCD/raid8
Log onto the server emilio.phys.cmu.edu using secure shell (ssh).
To change your password, log onto albert.phys.cmu.edu and use
the passwd command. The change may take
a few hours to propagate throughout the system.
Compile and test programs on emilio. Short jobs (under 15 minutes) may
be run on emilio, ernest, max, or albert as long as the nice
command is used. Remember to cross-compile
since the compute nodes are different architectures than emilio.
The compute nodes should only
be used for final production runs and are not for interactive use.
Please do not log onto the compute nodes (unless file clean up on local disks
is needed due to an error return).
If your executable needs non-standard libraries, you may need to statically
link these into your binaries (the -static
compiler option). Commonly-used shared library files
are available on the compute nodes.
Submit jobs to the compute nodes using the qsub
command on emilio. The procedure to do this is described in detail below,
but it involves writing a shell-script, say
run_job,
then issuing the command
        qsub run_job
One can also use
xpbs to submit jobs.
The CMU cluster currently has three queues:
magenta:
jobs run in cpu-dedicated mode (8 nodes of 8 cores each)
cyan:
jobs run in cpu-dedicated mode (8 nodes of 8 cores each)
blue:
jobs run in cpu-dedicated mode (8 nodes of 8 cores each)
red:
jobs run in cpu-dedicated mode (12 nodes of 32 cores each)
green:
jobs run in cpu-dedicated mode (8 nodes of 24 cores each)
Jobs can be monitored using the text-based qstat
command, the GUI xpbs and
xpbsmon, or the web-based portal above.
To kill a job, use the qdel command.
When writing your job scripts, ensure that termination by
qdel is handled gracefully.
IMPORTANT:
NFS bottlenecks can cause great problems with the PBS queue.
Please do NOT use cp to transfer
files between the server and the
compute nodes. Instead, use scp and
specify qcderw: in the server
file name for emilio and qcdsvr:
for ernest.
See the qsub man page for details on
composing the job script.
For serial jobs, the script file should perform the following sequence of tasks:
Issue the necessary PBS directives. These must come first in the
script file. Any PBS directives which come after an executable
script statement are ignored.
Stage in: copy all needed files, including the executable, into the directory
/scratch/PBS_<job_id>
which PBS automatically creates on
the local disk on the compute node.
Run the program executable (in the foreground, not in the background).
Stage out: after program execution is done, copy appropriate files from the local disk
back into the user's permanent directory; the /scratch/PBS_<job_id>
directory on the local disk is automatically deleted by PBS upon
job completion.
Provide some clean up instructions in case the job must be killed
using qdel.
MPI programs which read data from standard input are problematical.
Usually, only the boss process can easily read from standard input
and the remote processes have their standard inputs mapped into the
null device. If all processes need to read information at run time,
use file I/O from a specified input file (which must be
copied onto all of the local disks).