Starting May 1st, Coursera will be offering a High-Performance Scientific Computing class from Randall J. LeVeque of the University of Washington:

From the course description:

Programming-oriented course on effectively using modern computers to solve scientific computing problems arising in the physical/engineering sciences and other fields. Provides an introduction to efficient serial and parallel computing using Fortran 90, OpenMP, MPI, and Python, and software development tools such as version control, Makefiles, and debugging.

About the Course:  Computation and simulation are increasingly important in all aspects of science and engineering. At the same time writing efficient computer programs to take full advantage of current computers is becoming increasingly difficult. Even laptops now have 4 or more processors, but using them all to solve a single problem faster often requires rethinking the algorithm to introduce parallelism, and then programming in a language that can express this parallelism.  Writing efficient programs also requires some knowledge of machine arithmetic, computer architecture, and memory hierarchies.

 

We’ve posted another web-based video on the wiki/Training page:

“Intro to OpenMP” covers the basic use of OpenMP — a programming environment for parallel application development.  OpenMP is a set of compiler “directives” for C/C++ and Fortran, so rather than having to learn a whole new programming language, you can continue to code in a familiar language and just add directives (instructions/hints to the compiler) on where you think parallelism could be extracted.  A simple example:

#pragma omp parallel do
for(i=0;i<N;i++) {
    y[i] = alpha * x[i] + y[i];
}

The addition of that one “pragma” statement converts this single-CPU loop into a multi-CPU/parallel loop.  Take a look at the video and see how to add multi-CPU/multi-core parallelism to your existing applications.

 

Aug
11
Filed Under (Multicore, Optimization, Parallel Computing, Software Development, Training) by John Pormann, Ph.D. on 11-08-2011

We will be offering the following seminars and workshops on scalable computing techniques in Fall 2011:

We have intentionally scheduled just a few seminars this semester to leave time open for any “on-demand” training that you might request. Our Training wiki-page has a list of seminars that we currently teach, or have taught recently, and if any of them look applicable to your research lab, or to a small group of students, we’d be happy to make arrangement to teach it. Email us at scsc at duke edu if you’d like to set something up.

Dec
30
Filed Under (Multicore, Optimization, Parallel Computing, Programming, Training) by John Pormann, Ph.D. on 30-12-2010

Victor Eijkhout from the Texas Advanced Computing Center (TACC) (at University of Texas at Austin), has released a new HPC book at Lulu.com:

Also, James Leigh, John McHugh and Sanjay Goil have put out an eBook on HPC development:

The first book is more appropriate for more programming-savvy users — it delves into CPU registers, cache hierarchy, memory bus architectures, etc.  While all of those topics are essential for fully optimizing your application, they may require significant rewriting of your code.

The second book is more higher-level.  It does include (Java) examples, but talks more about parallel programming concepts.

Oct
29
Filed Under (Multicore, Parallel Computing, Programming, Training) by John Pormann, Ph.D. on 29-10-2010

We will be hosting an Intel field engineer — J.D. Patel — for a discussion on some of Intel’s new software tools and ways that they might be applied to research computing needs at Duke.

  • Intel Software Tools for Research Computing
  • Monday, November 8th, 2-4pm, RENCI Conference Room

As many of you know, the multi-core trend in CPUs is now in full-force.  The majority of the cluster now has 8 CPUs per machine, and the newest machines have 12 CPUs per machine — and we expect to see 16 or more CPUs per machine very soon.  However, many applications are not yet ready to take advantage of this increase in computational power.

One quick way to get an extra boost in your application is to leverage the multi-core-aware libraries that Intel provides.  By calling a few different functions and linking with a new library or two, you may see significant improvements in performance on the new machines in the DSCR:

  • MKL — Math Kernel Library provides highly optimized routines for linear algebra, FFTs, random number generation, sparse solvers; multi-core enabled
  • MKL/Summary Statistics Library — a new offering from Intel, provides multi-core-enabled statistical routines, including functions for very large data-sets
  • IPP – the Intel Performance Primitives — similar to MKL, provides highly optimized routines for image/video coding, audio and signal processing, speech coding, data compression, and more; multi-core enabled
  • Cilk Plus/Cilk Threads — a simplified, easier-to-use view of parallel programming that can leverage modern multi-core CPUs
  • Array Notation — for applications that do large matrix-vector operations, array notation will give you access to SSE/AVX capabilities (special math units inside the CPU)
  • AVX support in the Intel compilers — mechanisms to make use of the SSE/AVX capabilities of Intel’s CPUs
  • Array Building Blocks — derived from the former RapidMind software, ArBB allows you to specify vector/list operations that can be run in parallel; multi-core enabled

Please join us on Monday, November 8th — for directions to our building, please see:

Mar
12
Filed Under (Multicore, Shared Memory) by jpormann on 12-03-2010

I found this by way of InfoWorld, but the original announcement is directly from AMD:

AMD will soon release 4-socket, 12-CPU-core systems — a total of 48 cores in a single system!  To get people excited about what they could do with all that compute-power, they are offering a contest.  Write an essay, post a YouTube video, or add an entry to your blog about how you would use 48 cores (500 words/3-min limits on submissions).  Then submit your entry on their entry form.

You could win a “starter” 48-core system of your own!   (est. value $8,189, but you’ll need to add your own memory)

The deadline is Wednesday, March 24th at 11:59pm (EDT).


Jan
19
Filed Under (Multicore, Shared Memory) by jpormann on 19-01-2010

ScaleMP is another option for those researchers who need very large memory configurations.  Our current blade-based servers can accept up to 12 memory sticks for a total of 96GB.  ScaleMP allows you to combine multiple machines into a single system image — instead of 16 separate machines, you have 1 machine with 16 times as many CPUs and 16 times as much memory.  One drawback is that ScaleMP requires an Infiniband network — a high-performance, low-latency interconnect — which is used to quickly access “remote” memory.

The current version of ScaleMP allows for up to 128 CPU-cores and 4TB of memory!  The “system” appears to the user as one very large server, there is nothing new to learn and no changes need to be made to your applications.

Jan
14
Filed Under (Multicore, Shared Memory) by jpormann on 14-01-2010

We’ve seen a growing trend towards higher memory capacity in the machines that we order for the DSCR. Unfortunately, we’re also starting to see a trend towards higher prices for the raw memory chips. For most servers — given their compact size — you end up using the densest (most expensive) memory chips and so the price can really climb.

There are now at least two possible options to consider for very large memory systems — RNA Networks and ScaleMP.

RNA Networks is a kind of network-based virtual memory.  You allocate pools of memory on several machines, and when one of those machines needs extra memory, it will reach out over the network and access the pooled memory on another server.  Much like the way virtual memory systems (on a single machine) will store older, less frequently used memory pages to disk, this method pushes the page out to the network — with 10Gbps Ethernet or Infiniband, RNA Networks claims 100x faster results.  One installation uses 300 machines and provides an 11TB shared memory pool.

Dec
04
Filed Under (Multicore, Parallel Computing) by jpormann on 04-12-2009

Intel recently announced a “concept chip” that includes 48 x86 CPU-cores on a single piece of silicon.  This really just another step in the recent trend from dual-core to quad-core and even 6- and 8-core CPUs.  Intel had previously announced a “research prototype” with 80-cores.

Oct
30
Filed Under (Multicore, Parallel Computing) by jpormann on 30-10-2009

The industry continues to add more computational power to chips through increased levels of parallelism — the buzzword used to be “multi-core” (now being used for dual- and quad-core CPUs), now it is moving to “many-core” (16-, 32- or even 100-cores per chip).

Intel’s Larrabee architecture looks to be the first candidate — available sometime in 2010 — and should provide up to 32-cores per chip.  It’s initial versions will be targeted at discrete graphics, but it is expected to migrate into the HPC market.

A relatively new entrant, Tilera, already offers 36- and 64-core chips, and is working on a 100-core chip — that’s 100 processing units on a single piece of silicon, a cluster on a chip!