Scientific and High-Performance Computing News
DSCR Maintenance
Next Tuesday, October 30th, 8am to 5pm
Force-suspend of all jobs
We need to perform quarterly maintenance on the DSCR again, and we are scheduling it for NEXT TUESDAY, October 30th. This will primarily be for A/C maintenance.
All jobs on the DSCR will be force-suspended on Tuesday morning in order to reduce the heat-load in the machine room – an idle machine generates less than half the heat of an active one, so this is an effective mechanism for us to reduce heat-load without requiring user intervention. All queues will be temporarily disabled so new jobs will not be issued onto the compute machines until after the A/C unit is fully functioning.
Jobs that are running will be suspended and resumed with no loss of data or memory. Jobs in the queues will remain in the queues and do NOT need to be resubmitted. You should not have to do anything to prepare for this maintenance event.
We have done the force-suspend/resume operation several times in the past and have not seen any problems. If you find that your job has been killed or corrupted due to this maintenance event, please let us know as soon as possible.
For more information, see:
If you have any questions, please contact scsc at duke.edu
UPDATE — as of 6:30pm, the maintenance is complete and the DSCR is back on-line. Any jobs that were running prior to the maintenance should now be resumed; any jobs in the queues will now be started on the compute nodes as usual.
Just a reminder that we have a planned maintenance for the DSCR going on right now (Thursday, July 12th).
All jobs are currently suspended and will be resumed when the A/C maintenance is over. You should not need to re-start or alter any existing jobs on the DSCR. If you find that a suspended job did not resume, or produced some other kind of error, please let us know as soon as possible so that we can investigate what happened.
DSCR Scheduled Maintenance
8am to 5pm on Thursday, July 12th
Force-Suspend of All Jobs
We need to perform quarterly maintenance on the DSCR again, and we are targeting July 12th. This primarily be for A/C maintenance.
All jobs on the DSCR will be force-suspended on Thursday morning in order to reduce the heat-load in the machine room – an idle machine generates less than half the heat of an active one, so this is an effective mechanism for us to reduce heat-load without requiring user intervention. All queues will be temporarily disabled so new jobs will not be issued onto the compute machines until after the A/C unit is fully functioning.
Jobs that are running will be suspended and resumed with no loss of data or memory. Jobs in the queues will remain in the queues and do NOT need to be resubmitted. You should not have to do anything to prepare for this maintenance event.
We have done the force-suspend/resume operation several times in the past and have not seen any problems. If you find that your job has been killed or corrupted due to this maintenance event, please let us know as soon as possible.
For more information, see:
If you have any questions, please contact scsc at duke.edu
The DSCR maintenance was completed this afternoon and most nodes are back on-line now. We are opening up the queues and resuming normal cluster operation. The new file server system includes a significant increase in performance, as well as redundancy and fail-over capabilities.
As sometimes happens when we power-cycle the machines, some of them come up in odd states or with odd errors — these will be triaged/repaired over the next day or so.
The new, faster /netscratch storage area is still initializing and so is currently unavailable. This will replace the former /nfs/scsc/dscrtmp area for temporary storage of job-related files. If your job requires use of /netscratch, please wait for a second email message about the full availability of that file system.
——
Just a reminder of the file-systems that are available:
We will leave files on /netscratch and /scratch for 30 days, space permitting. Other users may have a need for that storage space, and we may need to delete older/unused files. For any job-relevant data, make sure you have a copy somewhere other than /netscratch or /scratch !!
And as always, we do not guarantee the long-term stability of any DSCR storage (nightly snapshot only, no backups!) — for important data, you should always keep a copy in your lab or department.
(1) DSCR Maintenance — Feb 21-23
(2) DSCR Storage Lease Options — from $500/TB/year
We have two major announcements in this posting — the first is a standard maintenance window (outage) for February 21st through 23rd. This will be a “Full” outage, not just a force-suspend of the jobs — all jobs on the cluster as of 2/21 at 6am will be killed. We will begin shutting down the queues the night before, so that jobs do not get queued up when there is no time for them to complete.
For more information on the full cluster outage, see:
As many of you are aware, the current file server often has performance issues, and it’s capacity is placing a limit on the files you’re able to store there — and the reason for this outage, and the short lead time, is that we will be installing a new set of file servers for the DSCR. We will have a new pair of NetApp FAS-3270 servers — each will have 512GB of cache memory and 10Gbps connections — so we expect to see a significant increase in performance.
With these new file servers, we will also be able to offer several storage-leasing options:
We should have enough disk to satisfy any initial purchasers, and we will continually expand the system based on demand. If you have a need for more than 5TB of space on the DSCR, please contact us so that we can plan accordingly.
Our expectations are that the slower SATA disk will probably NOT be the best option for running jobs — you may see much better performance if you stage data from SATA to the faster SAS disk prior to running a large batches of jobs. We will also be refreshing the “DSCRtemp” file-space, so that may be another option to consider. We will follow up with another email once we have details on the performance of the various storage tiers.
This maintenance is now resolved
DSCR Scheduled Maintenance
Starting 8am on Tuesday, November 1st through 8am on Thursday, November 3rd
Force-Suspend of All Jobs
Sorry for the late notice, we need to perform quarterly maintenance on the DSCR again, and we are targeting November 1st to 3rd. This will include A/C maintenance as well as some SGE accounting updates.
All jobs on the DSCR will be force-suspended on Tuesday morning in order to reduce the heat-load in the machine room – an idle machine generates less than half the heat of an active one, so this is an effective mechanism for us to reduce heat-load without requiring user intervention. All queues will be temporarily disabled so new jobs will not be issued onto the compute machines until after the A/C unit is fully functioning.
Jobs that are running will be suspended and resumed with no loss of data or memory. Jobs in the queues will remain in the queues and do NOT need to be resubmitted. You should not have to do anything to prepare for this maintenance event.
We have done the force-suspend/resume operation several times in the past and have not seen any problems. If you find that your job has been killed or corrupted due to this maintenance event, please let us know as soon as possible.
For more information, see:
If you have any questions, please contact scsc at duke edu
** Outage is now complete
We are currently in a scheduled maintenance period for the DSCR. All login machines and all computes nodes are currently inaccessible. We expect the system to be back in operation on Thurs, June 2nd. For more information, see this blog post![]()
DSCR Scheduled Maintenance
Starting 8am on Tuesday, May 31st, through 8am on Thursday, June 2nd
Full Cluster Outage
All machines are expected to be moved to the Centos-5 operating system at this time
We need to perform quarterly maintenance on the DSCR again, and we are targeting May 31st to June 2nd. This will include A/C maintenance as well as some blade/chassis updates.
We will also be upgrading ALL machines to the Centos-5 operating system unless you make a specific request for us not to do so. The majority of the cluster now runs on Centos-4 — an older operating system that many users are starting to have compatibility problems with (old versions of Java, Python, etc.).
If you have not had a chance to test your applications on Centos-5, please do so immediately. We have a number of “core” machines set up for Centos-5 testing. To access one of them, you can use either:
If you find that your programs do not work on Centos-5, please let us know and we can work with you to triage or debug the problems. If needed, we will keep your machines on Centos-4 until we can find a fix.
When the cluster resumes operation on June 2nd, the default will be for all jobs to flow onto Centos-5 machines.
NOTE: all running jobs will be lost and users will have to re-submit them. Any jobs in the queues will be saved and will be started when the cluster is back on-line. We will disable queued jobs from starting around 5pm on May 30th as a precautionary measure so that jobs don’t start and get killed several hours later.
We are targeting 2 days due to the time needed for the Centos-5 upgrade process. As always, if we finish early, we will release the queues as soon as possible and post another email.
For more information, see:
If you have any questions, please contact scsc at duke.edu
- – - – - – -
I forgot to mention that there is also a new Centos-5 login machine:
You can ssh to that machine and compile programs, run short tests, etc. to get ready for the Centos-5 transition on May 31st.
UPDATED: The DSCR maintenance window is now over; all queues are back to normal; all jobs should now be running again.
As we had previously announced, we’re about to start our quarterly maintenance for the DSCR. At 5pm this evening (Jan 11th), we will disable all queues on the cluster so that new jobs do not begin any calculations, only to be killed when the outage starts tomorrow morning.
As of 8am tomorrow (Jan 12th), any jobs in the queues will be saved and will be started when the cluster is back on-line; any running jobs will be lost and users will have to re-submit them.
As always, if we finish early, we will release the queues as soon as possible and post another email.
For more information, see:
If you have any questions, please contact scsc at duke.edu
DSCR Scheduled Maintenance – Jan 12th, 8am to Jan 13th, 5pm
Full Cluster Outage
Sorry for the late notice, but it is about time for another scheduled maintenance window and several of the tasks we need to do will require a full cluster outage. We will be doing blade-chassis BIOS/firmware updates and reconfiguration as well as preventative maintenance on the A/C units — these will require us to power down the entire cluster.
NOTE: all running jobs will be lost and users will have to re-submit them. Any jobs in the queues will be saved and will be started when the cluster is back on-line. We will disable queued jobs from starting around 5pm on Jan 11th as a precautionary measure so that jobs don’t start and get killed several hours later.
This is targeted for the first few days of classes — when most people are busy with other work, and the DSCR often sees low usage. We are targeting 2 days due to the number of items we need to complete. As always, if we finish early, we will release the queues as soon as possible and post another email.
For more information, see:
If you have any questions, please contact scsc at duke edu