Code:SLURM
User environment
Zeroth, set up your own system (for linux), by adding a host def for circe to your $HOME/.ssh/config file. Then you don't have to keep typing in circe's full path and your username when running ssh from linux.
Host rc User <username on circe> Hostname rcslurm.rc.usf.edu ServerAliveInterval 30 ServerAliveCountMax 120 ForwardX11 yes
You may need:
mkdir -p $HOME/.ssh && chmod 700 $HOME/.ssh vi $HOME/.ssh/config # ':q' to exit chmod 600 $HOME/.ssh/config
The slurm job status command is squeue. A helpful alias to monitor your own jobs is,
alias myq="squeue -u $USER"
Queue
Here's a basic template for queuing a job using MPI (here 2 whole nodes and 6h max run-time).
<source lang="bash">
- !/bin/bash
- SBATCH -J test
- SBATCH -N 2 -t 6:00:00
module load mpi/openmpi/1.4.5 compilers/intel/11.1.064
start=`date +%s` mpirun parallel-executable end=`date +%s` echo "Job completed in $((end-start)) seconds." </source> By default, slurm jobs start in the same directory that sbatch was invoked.
A few more useful options are:
#SBATCH -o output_log_name.log #SBATCH --mem=2000
These will specify the name of the output log file instead of the default ()
Submit with <source lang="bash"> sbatch -p saturn job.sh </source>
The -p saturn is the default and can be left out. Other possible values select different node partitions, and are:
- "jupiter": 444 cores; preemptable by "deadline" QOS (currently inactive).
- "saturn": 280 cores; default; preemptable by "deadline" QOS
(currently inactive).
- "neptune": 168 cores; preemptable by "deadline" QOS (currently inactive).
- "hii_broad": 80 cores; testing "contributor" hardware pool;
preemptable by "hii_broad" QOS (active).
- "titan": 16 cores; no preemption; 128 GB RAM for large memory jobs.
- "pluto": 8 cores; no preemption.
For execution and status, use squeue, or the myq command, defined above as an alias in .bashrc.
For more info, see LLNL's Slurm Quickstart Guide.
Here's a run-down on some of the environment variables available during running (for scripting) see sbatch's manpage for more:
- SLURM_JOB_NAME - Name of the job.
- SLURM_JOB_ID - The ID of the job allocation.
- SLURM_CPUS_ON_NODE - Number of CPUS on the allocated node.
- SLURM_JOB_NODELIST - List of nodes allocated to the job in a compressed format.
- SLURM_JOB_NUM_NODES - Total number of nodes in the job’s resource allocation.
- SLURM_JOB_CPUS_PER_NODE - Count of processors available to the job on this node.
- SLURM_SUBMIT_DIR - The directory from which sbatch was invoked.
- SLURM_JOB_PARTITION - Name of the partition in which the job is running.
- SLURM_LOCALID - Node local task ID for the process within a job.
- SLURM_GTIDS - Global task IDs running on this node. Zero origin and comma separated.