modified on 18 June 2015 at 09:20 ••• 7,003 views

Best Practices

From Predictive Chemistry

Jump to: navigation, search

This page collects best use practices and expected timings for codes we use regularly. If you have additional timing info., please share the wealth!


On CIRCE, USF's Linux x86_86 SLURM cluster, we have several GPU systems available. Using them requires requesting a special request flag to SLURM.

Here's a submit script:

#SBATCH -J run
#SBATCH -o run_job.log
#SBATCH --nodes=1 -p cuda --exclusive --cpus-per-task=16 -t 24:00:00 --gres=gpu:2 --constraint gpu_K20 --constraint avx

The -p option is not required, but selects a special partition that gives higher priority to GPU-using jobs. The --gres option requests nodes with 2 gpus, and the gpu_K20 option requests Kepler K20 GPUs (based on the GK110 chipset) with CUDA compute capability 3.5. Note that although this card has lots of double-precision floating point units, Gromacs won't use them, since it prefers single precision for memory throughput anyway.

Also, note that this command is specialized, since our dual-GPU nodes have exactly 16 cores. --exclusive requests the whole machine and should be the default, but is not for some reason.

The correct launch command is almost as complicated.

mpirun -bysocket -bind-to-socket -report-bindings --npernode 2 mdrun_mpi -ntomp 8 -deffnm run

The first 4 options are all sent to mpirun, asking for 2 processes to be started per node, and each process to be bound to a single socket. Each socket is a physical processor, containing 8 cores. Without the binding options, mpirun sets up Gromacs (mdrun_mpi) to run on a single core, and will only use 2 cores out of the total 16!

To continue a run that was terminated before finishing, use

mpirun -bysocket -bind-to-socket --npernode 2 mdrun_mpi -ntomp 8 -deffnm run -cpi run -append

-append is required because our NFS filesystem doesn't support locking, and you have to override the default.


Using this setup, with 1 node I get 2.1 hours per nanosecond on a system with 72,270 atoms (using the TIP4P rigid water model and 1.2 nm cutoffs). This is the best timing I've seen on CIRCE so far, and translates to 2.9 hours per nanosecond per 100,000 atoms.

Using a similar setup on a 16-core Intel E5-2640 system clocked at 2.6 GHz (rather than the 2.0 GHz of the hardware above), and using the GeForce GTX 980, 1380.0 MHz (lots of floating point units, fewer doubles) gives 1.413 hr/ns.

Compilation Specifics

Gromacs 5.0.5 was compiled for the Intel(R) Xeon(R) CPU E5-2650 and GPU acceleration using the following script:

module load compilers/intel/14.0.1 mpi/openmpi/1.6.1 apps/cmake/ apps/cuda/6.5.14
cmake   -DCMAKE_INSTALL_PREFIX=/shares/rogers \
        -DFFTW_INCLUDE_DIR=/shares/rogers/include \
        -DFFTW_LIBRARY=/shares/rogers/lib \
        -DGMX_SIMD=AVX_256 \
        -DGMX_FFT_LIBRARY=fftw3 \
        -DGMX_GPU=on \
        -DGMX_MPI=ON \
make -j8
make install

AVX2_256 fails to run on this machine, terminating with an error Program received signal 4, Illegal instruction. This happens generally when your program tries to do something the CPU running it doesn't understand. Here, the program compiled for AVX2_256 had an AVX2 instruction, which the older processor choked on. You'll have to watch out for issues like this on CIRCE, which contains a mix of old and new Intel and AMD processors.

Also, note that this version of Gromacs, compiled with CUDA support, is flexible. It can run efficiently on machines with or without GPU accelerators.