Scientific computation has become an integral part in most fields of modern research. Apart from the derivation and analysis of the models underlying the computation – as has been done in Chapters 3 and 4 for the WEMC method – the feasibility and optimization of the numerical calculations is paramount to obtain simulation results in a reasonable amount of time. Parallel computation is now the primary method1 to speed up, or even enable, computations on systems ranging from single workstations with multi-core CPUs to large supercomputers consisting of thousands of nodes2 . An efficient utilization of high performance computation resources requires consideration to be given to the architecture of the system on both a physical (hardware) and a logical (software) level. The most common system architectures for high-performance computing are outlined in the following.
A shared-memory system refers to an architecture in which multiple CPUs share a common memory space, i.e. every CPU can access the entire memory. The simplest scenario is a single workstation with a multi-core CPU where each core has access the same memory space. In terms of hardware, a shared-memory architecture is usually limited to a single computation node, which limits the total number of CPUs and memory which can be used. Although there exist supercomputers that facilitate a (logical) shared memory space distributed between many computation nodes, they are not widespread.
The advantage of a shared memory space is that no replication of data is required and data can be accessed/exchanged between different CPUs with little to no additional communication latency. The disadvantage is that special care must be taken to avoid race conditions and that the computations of one CPU do not inadvertently affect the data/variables another CPU is using – the computations are not isolated.
The parallelization of code in a shared-memory architecture usually makes use of computation threads, which can be implemented with OpenMP [146], for instance. This allows a simple parallelization of for-loops and other control statements commonly used in programming.
A distributed memory system refers to an architecture where each CPU has its own dedicated memory space, which cannot be directly accessed by other CPUs. Distributed memory systems are most common in large-scale supercomputers, which consist of multiple computation nodes connected by network interfaces. A distributed memory environment can also be emulated on a single workstation, where the same physical memory is logically separated.
The advantage of a distributed memory system is that vast computational resources can be utilized to solve problems which would otherwise be computationally intractable on a single node/workstation. The disadvantage is that the network interfaces are very slow (compared to the CPU-RAM bus) and communication to access data on other nodes is accompanied by a latency which can be a significant factor in the performance of parallel code.
The parallelization of code for distributed memory systems requires special consideration for the data structures and the timing of data communication. The communication of data between processes3 is commonly handled using the message passing interface (MPI) [147]. The programmer is forced to explicitly consider these aspects in the design of the algorithms, which makes the code more robust for scaling up to many CPUs.
The shared- and distributed-memory approaches can be combined to what is referred to as a hybrid system. One can, for instance, use parallelization by threads on a single node (shared memory) and use an MPI communication between the nodes (distributed memory). The optimal combination depends on the computational problem and the system on which it is run.
Accelerator cards are enjoying increased popularity in scientific computing as they offer significant raw computation power at low cost/energy. Examples of accelerator cards are dedicated graphical adapters, like the nVidia Tesla, or co-processors, like the Intel Xeon Phi. Accelerators are especially useful for problems where many calculations have to be performed with relatively little data; communication to obtain data is slow due to the limited bandwidth of the bus connecting the cards to the RAM.
Various frameworks are available for programming with accelerator cards in mind, like OpenCL.