Grid Engine
In order to guarantee best utilisation and fair sharing of the computing resources, all long-running or large memory jobs has to be run through management software. This software, called "Son of GridEngine", takes care of distributing the submitted jobs to the best suited computer in the cluster, where it will run.
In order for GridEngine to be able to do it's task, you have to tell it how long time you think your program will need, and how much RAM it will need. You should ask for enough resources to finish the job, but try to avoid asking for too much, as this might result in longer waiting times for suitable resources to become available. If the program overruns either of them, it will be terminated by the system.
Grid Engine commands
These are the most common tasks when interacting with Grid Engine
Job submission
Jobs are submitted with the command `qsub`. The basic resources that has to be requested are the amount of memory required, and the runtime to allocate to the job. The easiest way to submit jobs is to create a batch file, which contains all commands that should be executed. This file also contains information about job resources and other aspects of how the job should behave. It is good practice to save these job scripts for future reference and to keep track of how you ran your commands.
A batch file is basically a script run by the normal command shell. All lines will be executed as if they were typed at the prompt. However, lines beginning with a # will be ignored and are used as comments. Furthermore, lines beginning with #$ will be read by qsub to set options for the job, e.g. resource request. An example could be:
# This is a job script to do ... [insert a short description here...] # # Request 2 hours, 0 minutes, 0 seconds run time #$ -l h_rt=2:0:0 # # Request 4GB memory (check the need for your specific program!) #$ -l h_vmem=4G # # Request mails to be sent when the job finishes (e) or if it fails (a) #$ -m ae # # Run in the current directory (otherwise it will run from your home directory) #$ -cwd # # Keep all output in a single file #$ -j y # # Temporary files created during the job should be placed in $TMPDIR # This directory is local to the server where the job is executed and # file access here is MUCH faster than the project folder. # # Place your commands below this line my-command -a my_file
This file should be created with a text editor, e.g. nano when you are logged in to the grid. You can cut and paste the template into a file opened with nano (nano my_job_file.job
). Once you have created the job file, submit it with qsub:
$ qsub my_job_file.job
Then record the job ID as reported. Whatever output there will be from the script will be saved into a file named as your job file, but extended with .oNNNN
where NNNN is your job ID. Once your job is run, check the file to ensure that the job succeeded. This is best done with the command less
(e.g. $ less my_job_file.job.oNNNN
). See below for how to get more information on if your job is running. You should check the mail you got from GridEngine regarding how the job was run (or use qacct -j JOB_ID
to get the same information). First check that the exit_status is zero. If not, your job failed. Check how long time the job used ("wallclock"). If it is equal to the requested time, the job most likely failed due to insufficient run time (increase the runtime requested and try to rerun your job). Go on and check how much memory was used (max_vmem), if it is equal to the requested amount, your job most likely failed due to lack of memory (update your job file and request a few more GB). If none of these two factors seem to be the culprit, check the output of your job for error messages.
Multithreading option
If you need to run your job with multiple threads you can do so by using the following commands:
SMP (symmetric multiprocessing) is the processing of programs by multiple processors that share a common operating system and memory.
- -pe smp
- -pe smp (# of cores you want - not in brackets)
The Message Passing Interface (MPI) is a library specification for message-passing. It is a standard API (Application Programming Interface) that can be used to create parallel applications.
- -pe openmpi
- -pe openmpi (# of cores you want - not in brackets)
For more information on what you can do with qsub, check the man page ($ man qsub
)
Job status
Once you have submitted your job, you can use the command qstat
to get a list of your jobs. If you don't find your job in the list, it has most likely finished (or failed). The status column indicates what is happening:
- qw
- The job is waiting in the queue
- r
- The job is running
- E or e
- There was some system error while trying to run the job. Please contact an administrator to check what happened.
Deleting running or queuing jobs
Running jobs can be cancelled with the command qdel JOB_ID
.
Job history
You can get information about resource usage and other metadata regarding finished jobs with the qacct
command, more specifically $ qacct -j JOB_ID
.
Useful Grid Engine Commands
qhost -q
- This will show you the details on each grid node(# of threads avail., Ram avail. etc)
qstat -u username
- This will show you the jobs status for a particular user
qstat -j job number
- This will show status of a particular job running
qstat -u \*.
- This will show you the status of all jobs currently running on the grid
qsub -pe smp N
- This will submit a job and ask for multiple threads (N is the number of threads wanted)
qstat -f
- This will give you the status of all the queue types per server (number of threads used, load avg)