Wiki: SGE

Sun Grid Engine (SGE) Reference
by Oliver; 2014


The Sun (or Oracle) Grid Engine, abbreviated as SGE, is a suite of commands for scheduling programs on a computer cluster. Picture the following the scenario: You have 100 people sharing a computing cluster. Each wants to run programs on this resource. Sometimes user Alice wants to run a batch of very time- and memory-intensive programs; sometimes user Bob does; and sometimes they both do at the same time. The SGE framework manages all of this and tries to fairly allocate computing resources. The programs that Alice and Bob want to run go into a queue and the scheduler decides when they actually do.

Simple-mindedly, we can think of a computer cluster as a bunch of computers chained together and can call each such computer a node. Instead of running a program—or a thousand programs—on your local computer, you can request to run these programs on nodes of the cluster. In SGE parlance, requesting a node on which to run your program is called "submitting a job" and the fundamental command to do this is called qsub. Running a program is called "running a job".

What's the advantage of qsub-ing vis-à-vis just running a program on your local computer? Taking advantage of a computing cluster is useful when you want to parallelize heavily or perhaps run a memory intensive job that is not suitable for your local machine. A good example from bioinformatics is parallelizing a blast job.

Sun Grid Engine for Dummies gives an eloquent introduction to this subject:
Servers tend to be used for one of two purposes: running services or processing workloads. Services tend to be long-running and don't tend to move around much. Workloads, however, such as running calculations, are usually done in a more "on demand" fashion. When a user needs something, he tells the server, and the server does it. When it's done, it's done. For the most part it doesn't matter on which particular machine the calculations are run. All that matters is that the user can get the results. This kind of work is often called batch, offline, or interactive work. Sometimes batch work is called a job. Typical jobs include processing of accounting files, rendering images or movies, running simulations, processing input data, modeling chemical or mechanical interactions, and data mining. Many organizations have hundreds, thousands, or even tens of thousands of machines devoted to running jobs.

Now, the interesting thing about jobs is that (for the most part) if you can run one job on one machine, you can run 10 jobs on 10 machines or 100 jobs on 100 machines. In fact, with today's multi-core chips, it's often the case that you can run 4, 8, or even 16 jobs on a single machine. Obviously, the more jobs you can run in parallel, the faster you can get your work done. If one job takes 10 minutes on one machine, 100 jobs still only take ten minutes when run on 100 machines. That's much better than 1000 minutes to run those 100 jobs on a single machine. But there's a problem. It's easy for one person to run one job on one machine. It's still pretty easy to run 10 jobs on 10 machines. Running 1600 jobs on 100 machines is a tremendous amount of work. Now imagine that you have 1000 machines and 100 users all trying to running 1600 jobs each. Chaos and unhappiness would ensue.

To solve the problem of organizing a large number of jobs on a set of machines, distributed resource managers (DRMs) were created. (A DRM is also sometimes called a workload manager. I will stick with the term, DRM.) The role of a DRM is to take a list of jobs to be executed and distributed them across the available machines. The DRM makes life easier for the users because they don't have to track all their jobs themselves, and it makes life easier for the administrators because they don't have to manage users' use of the machines directly. It's also better for the organization in general because a DRM will usually do a much better job of keeping the machines busy than users would on their own, resulting in much higher utilization of the machines. Higher utilization effectively means more compute power from the same set of machines, which makes everyone happy.

Here's a bit more terminology, just to make sure we're all on the same page. A cluster is a group of machines cooperating to do some work. A DRM and the machines it manages compose a cluster. A cluster is also often called a grid. There has historically been some debate about what exactly a grid is, but for most purposes grid can be used interchangeably with cluster. Cloud computing is a hot topic that builds on concepts from grid/cluster computing. One of the defining characteristics of a cloud is the ability to "pay as you go." Sun Grid Engine offers an accounting module that can track and report on fine grained usage of the system. Beyond that, Sun Grid Engine now offers deep integration to other technologies commonly being used in the cloud, such as Apache Hadoop.

The Basics

SGE commands usually begin with the letter q for queue.

Get an interactive node with 4 gigabytes for 8 hours:
$ qrsh -l mem=4G,time=8::
Submit a job to the cluster:
$ qsub
Monitor the status of your jobs:
$ qstat
It looks like this:


Delete a particular job:
$ qdel [jobid]

Some Useful SGE Commands and One-Liners

See the SGE commands at your disposal:
$ ls -hl $( dirname $( which qstat ) )
See all the nodes:
$ qhost
A picture:


Count and tabulate all jobs on the cluster:
$ qsum
On our system, this command is only available on the head (or login) nodes. A picture:


See every job on the cluster submitted by any user:
$ qstat -u "*" | less
Delete any job whose job name begins with prefix:
$ qdel prefix* 
Delete all your jobs (except interactive nodes):
$ qstat | sed '1,2d' | grep -v LOGIN | cut -f1 -d" " | xargs -i qdel {}
Monitor your jobs every 30 seconds:
$ while true; do qstat; sleep 30; done
(stop with Cntrl-C)
A more elegant way to do this is with the watch command, as we'll see in a second.

See extended options (which will show how much memory your job's consuming, etc.)
$ qstat -ext
The same, but update every 2 seconds:
$ watch qstat -ext
Tips via my co-worker, Albert—to add arguments to jobs that are still in the queue:
$ qalter [additional flag] [jobid]
For example, to set the email notification:
$ qalter -m be -M [jobid]
To suspend a running job:
$ qmod -sj [jobid]
This is useful, for example, to freeze a job when space is almost full.

qsub-ing Flag Style

Let's take the following script,, as an example:

echo "[start]"
echo "[date] "`date`
echo "[cwd] "`pwd`

# my script
sleep 5

echo "[end]"
Let's make a logs directory to store our job's output and error information:
$ mkdir logs # make a dir to store log files
Whatever our script would normally echo to std:out and std:error will go here instead.

We can qsub this as follows:
$ qsub -N myscript -e logs -o logs -l mem=1G,time=1:: -S /bin/sh -cwd ./ 
Your job 3212900 ("myscript") has been submitted
where we used the flags:
  • -N myscript Name is “myscript”
  • -e logs -o logs Error and Output go into logs/ directory
  • -l mem=1G,time=1:: Time and Mem request: 1 Gigabyte for 1 hr
  • -S /bin/sh Interpret script with sh (this could just as well be the path to perl, python, etc)
  • -cwd Run from current working directory
We can examine our log files after our job is finished. First let's check the error:
$ cat logs/myscript.e3212900 
Whew! It's empty—that's good. Now the output:
$ cat logs/myscript.o3212900 
[date] Tue May 6 13:23:22 EDT 2014
[cwd] /path/sge_test

qsub-ing Header Style

If you like, you can also put the flag information directly into the script's header as follows:
#$ -N myscript 
#$ -e logs 
#$ -o logs 
#$ -l mem=1G,time=1:: 
#$ -S /bin/sh 
#$ -cwd 

echo "[start]"
echo "[date] "`date`
echo "[cwd] "`pwd`

# my script
sleep 5

echo "[end]"
Then to qsub, it's simply:
$ qsub ./ 
However, I would stick to the flag style instead of hard-coding lines into your header. Using flags is a lot more flexible and keeps clutter out of your scripts.

Job States

When we qsub this script and query its status with qstat, we might see the following common job states:
  • qw - queued
  • Eqw - error
  • hqw - holding (waiting on another job)
  • r - running

Useful qsub Syntax

Example Syntax

Example job submission syntax:
$ qsub -N myjob -e ./logs -o ./logs -l mem=8G,time=2:: -S /bin/sh -cwd ./

Getting the Job ID

Grab the job id in bash:
$ jobid=$( qsub -S /bin/sh -cwd ./ | cut -f3 -d" " )
or save it in a file:
$ qsub -N myjob -l mem=1G,time=1:: -S /bin/sh -cwd ./ | cut -f3 -d" " > job_id.txt
This depends on the fact that qsub returns a sentence like: "Your job 3212900 ("myscript") has been submitted". We expect the third word to be the job id.

Submitting Binary Commands Rather Than Scripts

Submit a binary command, such as sleep, echo, or cat (rather than an uncompiled script):
$ qsub -b y -cwd echo joe
$ qsub -b y -N myjob -l mem=1G,time=1:: -S /bin/sh -cwd sleep 60
A real example - find the size of every directory in /my/dir:
$ for i in /my/dir/*; do 
  	name=$( basename $i ); echo $name; qsub -N space_${name} -e logs -o logs -V -cwd -b y du -sh $i; 

Importing Shell Variables into Your Jobs

To import shell environmental variables into your job use the flag -V:
$ qsub -V …
$ qsub -V -N myjob -l mem=1G,time=1:: -S /bin/sh -cwd ./
In practice, it's a good idea to always use this flag.

To pass specific shell variables to your job, you can use the -v flag. For instance, to pass the DISPLAY variable to your job:
$ qsub -v DISPLAY ...

Parallelizing over Multiple Cores

Run a job parallelizing over 4 cores:
$ qsub -pe smp 4 -R y ...

Piping into qsub

Pipe into qsub:
$ echo "./" | qsub -N myscript -e logs -o logs -l mem=1G,time=1:: -S /bin/sh –cwd

qsub-ing Scripts in Perl, Python, R, etc

Submit a script that is not bash (in this case we'll use R):
$ qsub -V -N myscript -e logs -o logs -l mem=1G,time=1:: -S /nfs/apps/R/2.14.0/bin/Rscript -cwd ./test.r
(replace the path to Rscript, of course, with your own)

Array Jobs

For some reason—I don't know why!—if you submit an array job of 1000 jobs it's supposedly gentler on the scheduler than if you submit 1000 regular jobs. So how do you submit an array job? Here's an example script called

echo "example array job"
echo "iterator = "${SGE_TASK_ID}

case $SGE_TASK_ID in
        1) var="hello";;
        2) var="goodbye";;
        3) var="world";;

echo $var
What's going on here? This job just echoes some things, but note the special shell variable SGE_TASK_ID. This variable iterates over the number of elements in your array job, which is submitted with the -t flag, as in:
qsub -t 1-3:1 -V -N job -e logs -o logs -l mem=1G,time=1:: -cwd ./
The synax:
-t 1-3:1
means the array will range from 1 to 3 in steps of 1.

So, what's the result? The output logs are as follows:
==> logs/job.o342892.1 <==
example array job
iterator = 1

==> logs/job.o342892.2 <==
example array job
iterator = 2

==> logs/job.o342892.3 <==
example array job
iterator = 3
Note that the case statement is a good way to get your script to do different things each iteration of the array. If you're using perl, for example, you'll find yourself referring to $ENV{'SGE_TASK_ID'}, and so on.

How is Job Priority Assigned?

How is Job Priority Assigned? This depends on the whims of the IT sysadmin gods. For C2B2, you can read about it here.

Common Problems

If you get an Eqw, there are some simple blunders you might have made. The most common is, on our system jobs are forbidden to write in the home directory—an exclusive quirk of our system set by the system administrators. Another reason you might get an Eqw is if you are trying to write your logs into a folder that doesn't exist. Finally, one other SGE-specific problem is that if you use a construction like:
# get the directory in which your script itself resides
d=$( dirname $( readlink -m $0 ) ) 
where you look for "sister scripts" in the directory where your script itself resides, you won't find them. The reason is that when you run a job, you script is actually copied to and run from a different directory altogether (something like /opt/gridengine/default/spool).

How Much Time & Memory Should I Assign my Job?

An often-asked question is, how much time and memory should I assign my job? Let's review how to query how much time and memory a program takes in ordinary unix.

How much time did your program take? Use time:
$ time ./ 
[date] Tue May 6 13:21:11 EDT 2014
[cwd] /path/sge_test

real    0m5.022s
user    0m0.004s
sys     0m0.015s
How much memory did your program take? Again, use time but with the -v flag. For example, to get information about zipping a file test.txt:
$ /usr/bin/time -v gzip test.txt

        Command being timed: "gzip test.txt"
        User time (seconds): 19.77
        System time (seconds): 0.74
        Percent of CPU this job got: 24%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:23.40
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2416
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 185
        Voluntary context switches: 94
        Involuntary context switches: 551
        Swaps: 0
        File system inputs: 743616
        File system outputs: 140216
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
In SGE land, you can see information about a job, including how much time and memory it consumed and its exist code with:
$ qacct -j [jobid] 
On our system, this command is only available on the head (or login) nodes. Knowing how much memory and time a job consumed will give you a good idea of how much to allot for similar future jobs.

You can get information about a given job, including reasons for error with:
$ qstat -j [jobid]

Dependencies in SGE and Jobs that Submit Jobs

The simplest dependency in SGE is making one job wait on another job. Suppose job A should wait for job B, whose job id is 2. Then to make job A wait on job B, use the -hold_jid flag:
$ qsub -N job_B -cwd ./
Your job 2 ("job_B") has been submitted
$ qsub -N job_A -cwd -hold_jid 2 ./
Now job A won't run until job B is finished. If you want job A to wait on multiple jobs, use a comma-delimited list of job ids after the hold flag. The following figure shows how you can run Job-1, Job-2, ..., Job-n in parallel and make Job-final wait on all of them:


You can also hold a job directly (although I rarely have occasion to do this):
$ qhold [jobid] 
and release it:
$ qrls [jobid] 
Now let's imagine a multiple-step process involving a pipeline you want to parallelize—meaning submit multiple jobs in parallel—at certain points, but not at others. Here's a picture:


Suppose Job-2 should wait on Job-1, which submits J1_1, J1_2, .., J1_n and J1_final. This is a common scenario: in the SGE framework, you may have occasion to want a script with jobs that submit jobs. But this leads to a new problem: how do you arrange dependencies such that jobs can depend on jobs submitted by jobs?

Up until now, it seems that we can only make a job depend on another job which has already been submitted to the queue, since this is the point when we get its job id. If we want to make job A depend not on job B but, rather, on a job which job B submits, we have a problem because at the time we release job A and job B into the queue together, B's "child" job doesn't even exist yet. It's like an unborn child and, as such, we have no way to get its id—which is what we need to feed the -hold_jid flag to make our dependencies work out. What to do? The answer comes from one of the best qsub flags of all time: -sync y. The sync flag acts as a brake and halts the script's gears until the jobs finish. In so doing, it makes an SGE script behave a lot like a regular unix script: commands are executed in order.

To illustrate, let's first consider the example of a script with a hold but without a sync:
qsub 1 			# into queue
qsub 2 			# into queue
qsub 3 –hold_jid 1,2 	# into queue, waiting on 1 and 2
command n
command n+1
Here, jobs 1 and 2 execute in parallel and job 3 waits for them, but command n will execute as soon as all these jobs are in the queue, irrespective of whether they're finished.

Now let's consider the script with both a hold and a sync:
qsub 1 				# into queue
qsub 2 				# into queue
qsub 3 –hold_jid 1,2 -sync y 	# into queue, waiting on 1 and 2
command n
command n+1
In this case, jobs 1 and 2 execute in parallel and job 3 waits for them, but the script is paused waiting for job 3 to finish, so command n won't execute until all the jobs above it have finished. Do you see the power of this technique? With it we can, say, heavily parallelize our script up until one point, then collect all the output and proceed linearly again, then parallelize again, and so on. Think about about running a blast job in parallel then concatenating all the results and doing something with them.