Scientific and High-Performance Computing News
Starting May 1st, Coursera will be offering a High-Performance Scientific Computing class from Randall J. LeVeque of the University of Washington:
From the course description:
Programming-oriented course on effectively using modern computers to solve scientific computing problems arising in the physical/engineering sciences and other fields. Provides an introduction to efficient serial and parallel computing using Fortran 90, OpenMP, MPI, and Python, and software development tools such as version control, Makefiles, and debugging.
About the Course: Computation and simulation are increasingly important in all aspects of science and engineering. At the same time writing efficient computer programs to take full advantage of current computers is becoming increasingly difficult. Even laptops now have 4 or more processors, but using them all to solve a single problem faster often requires rethinking the algorithm to introduce parallelism, and then programming in a language that can express this parallelism. Writing efficient programs also requires some knowledge of machine arithmetic, computer architecture, and memory hierarchies.
We’ve posted another web-based video on the wiki/Training page:
“Intro to OpenMP” covers the basic use of OpenMP — a programming environment for parallel application development. OpenMP is a set of compiler “directives” for C/C++ and Fortran, so rather than having to learn a whole new programming language, you can continue to code in a familiar language and just add directives (instructions/hints to the compiler) on where you think parallelism could be extracted. A simple example:
#pragma omp parallel do
for(i=0;i<N;i++) {
y[i] = alpha * x[i] + y[i];
}
The addition of that one “pragma” statement converts this single-CPU loop into a multi-CPU/parallel loop. Take a look at the video and see how to add multi-CPU/multi-core parallelism to your existing applications.
We will be offering the following seminars and workshops on scalable computing techniques in Fall 2011:
We have intentionally scheduled just a few seminars this semester to leave time open for any “on-demand” training that you might request. Our Training wiki-page has a list of seminars that we currently teach, or have taught recently, and if any of them look applicable to your research lab, or to a small group of students, we’d be happy to make arrangement to teach it. Email us at scsc at duke edu if you’d like to set something up.
Victor Eijkhout from the Texas Advanced Computing Center (TACC) (at University of Texas at Austin), has released a new HPC book at Lulu.com:
Also, James Leigh, John McHugh and Sanjay Goil have put out an eBook on HPC development:
The first book is more appropriate for more programming-savvy users — it delves into CPU registers, cache hierarchy, memory bus architectures, etc. While all of those topics are essential for fully optimizing your application, they may require significant rewriting of your code.
The second book is more higher-level. It does include (Java) examples, but talks more about parallel programming concepts.
We will be hosting an Intel field engineer — J.D. Patel — for a discussion on some of Intel’s new software tools and ways that they might be applied to research computing needs at Duke.
As many of you know, the multi-core trend in CPUs is now in full-force. The majority of the cluster now has 8 CPUs per machine, and the newest machines have 12 CPUs per machine — and we expect to see 16 or more CPUs per machine very soon. However, many applications are not yet ready to take advantage of this increase in computational power.
One quick way to get an extra boost in your application is to leverage the multi-core-aware libraries that Intel provides. By calling a few different functions and linking with a new library or two, you may see significant improvements in performance on the new machines in the DSCR:
Please join us on Monday, November 8th — for directions to our building, please see:
I found this by way of InfoWorld, but the original announcement is directly from AMD:
AMD will soon release 4-socket, 12-CPU-core systems — a total of 48 cores in a single system! To get people excited about what they could do with all that compute-power, they are offering a contest. Write an essay, post a YouTube video, or add an entry to your blog about how you would use 48 cores (500 words/3-min limits on submissions). Then submit your entry on their entry form.
You could win a “starter” 48-core system of your own! (est. value $8,189, but you’ll need to add your own memory)
The deadline is Wednesday, March 24th at 11:59pm (EDT).
ScaleMP is another option for those researchers who need very large memory configurations. Our current blade-based servers can accept up to 12 memory sticks for a total of 96GB. ScaleMP allows you to combine multiple machines into a single system image — instead of 16 separate machines, you have 1 machine with 16 times as many CPUs and 16 times as much memory. One drawback is that ScaleMP requires an Infiniband network — a high-performance, low-latency interconnect — which is used to quickly access “remote” memory.
The current version of ScaleMP allows for up to 128 CPU-cores and 4TB of memory! The “system” appears to the user as one very large server, there is nothing new to learn and no changes need to be made to your applications.
We’ve seen a growing trend towards higher memory capacity in the machines that we order for the DSCR. Unfortunately, we’re also starting to see a trend towards higher prices for the raw memory chips. For most servers — given their compact size — you end up using the densest (most expensive) memory chips and so the price can really climb.
There are now at least two possible options to consider for very large memory systems — RNA Networks and ScaleMP.
RNA Networks is a kind of network-based virtual memory. You allocate pools of memory on several machines, and when one of those machines needs extra memory, it will reach out over the network and access the pooled memory on another server. Much like the way virtual memory systems (on a single machine) will store older, less frequently used memory pages to disk, this method pushes the page out to the network — with 10Gbps Ethernet or Infiniband, RNA Networks claims 100x faster results. One installation uses 300 machines and provides an 11TB shared memory pool.
Intel recently announced a “concept chip” that includes 48 x86 CPU-cores on a single piece of silicon. This really just another step in the recent trend from dual-core to quad-core and even 6- and 8-core CPUs. Intel had previously announced a “research prototype” with 80-cores.
The industry continues to add more computational power to chips through increased levels of parallelism — the buzzword used to be “multi-core” (now being used for dual- and quad-core CPUs), now it is moving to “many-core” (16-, 32- or even 100-cores per chip).
Intel’s Larrabee architecture looks to be the first candidate — available sometime in 2010 — and should provide up to 32-cores per chip. It’s initial versions will be targeted at discrete graphics, but it is expected to migrate into the HPC market.
A relatively new entrant, Tilera, already offers 36- and 64-core chips, and is working on a 100-core chip — that’s 100 processing units on a single piece of silicon, a cluster on a chip!