Scientific and High-Performance Computing News
Since we’re now in the new fiscal year, the DSCR has converted over to the new business model:
Groups that have provided fund-codes, or are grandfathered due to ownership of newer machines, should all have “regular” access to the cluster as before. The only major change is discussed below.
If you find that your group cannot launch any jobs, please contact scsc at duke edu and we can help get you set back up (hint: we probably need a fund-code from you).
The only major change that users should notice is that low-priority parallel environments are being disabled — low-core, low-all, smmpi, and single. The ‘threaded’ (single-machine) parallel environment is unchanged.
If you have previously used these to launch multi-machine jobs, you will have to convert your jobs to use high-priority instead — “-pe high” instead of “-pe low-core”. Any running jobs will run to completion, but any jobs in the queues will not execute — you will need to try ‘qalter’ to re-direct them to high-priority, or else ‘qdel’ and then re-submit them as high-priority.
This change will help ensure that everyone who gains access to a high-priority resource gets the full use of that resource, with no interference from low-priority jobs. The unfortunate side-effect is that low-priority parallel jobs would crash if/when a high-priority job was scheduled on top of them.
Here’s a quick hack in case you have long-running jobs, or long sequences of jobs that you’d like to track the status of …
You may know that you can have Grid Engine email you when a job begins or ends. In the prologue of your submission script (or on the command-line for qsub), you can include the ‘-m’ option:
#!/bin/tcsh #$ -m b,e
The ‘b,e’ argument means to send email at the beginning and end of the job (or just use one or the other). If you don’t specify where to send the message, SGE will send it to whatever email address we have on file for you (probably netid@duke.edu or firstname.lastname@duke.edu).
However, many wireless/cell-phone providers have email gateways set up for text messages — e.g. send an email to 1234567890@vtext.com and it will be sent as a text-message to phone number 123-456-7890 — and you can use this email address in your SGE script with the ‘-M’ option:
#!/bin/tcsh #$ -M 1234567890@vtext.com -m b,e
That’s it! You’ll now get the job-begin and job-end messages as text-messages on your phone. For a list of wireless provider gateways, see
(of course, regular text-message rates apply so don’t go overboard)
For users who want to launch 1000′s of similar jobs (using an SGE Array Task), but don’t want to take over the entire cluster and inhibit other users from getting their work done … there is an option to your array task submission that will limit the total number of simultaneous tasks being run:
% qsub -t 1-2000 -tc 100 test.sh
The ‘-t’ option is the usual one to submit an array task (in this case 2000 tasks); the ‘-tc’ option sets the max number of simultaneous tasks (100). So the 2000 tasks will still be scheduled as efficiently as possible — if some tasks exit quickly, others will quickly take their place — but since only 100 simultaneous tasks will be running, other users will have ample access to the cluster’s computational resources.
Thanks to the gridengine.info blog: http://gridengine.info/2009/12/02/throttling-execution-of-array-job-tasks
We have released several of our SGE-related tools and scripts under an MIT-style open-source license. The ‘qb’, ‘ql’, and ‘qh’ tools as well as the ‘jobpar’ (parallel performance analyzer) script are now available at:
Note that these tools are already installed on the DSCR in /usr/bin — you do not need to download and install them yourself.