Since solving linear equation systems is a very common computational task, a short overview of the available third party solutions shall be given.
The Basic Linear Algebra Subprograms BLAS provide high quality building block routines for performing basic vector and matrix operations: Level 1 BLAS for vector-vector operations, Level 2 BLAS for matrix-vector operations, and Level 3 BLAS for matrix-matrix operations [18]. Since BLAS libraries are efficient, portable, and widely available for many different platforms, many mathematical developments are based on them.
The original objective of the Linear Algebra PACKage LAPACK, which is based on the BLAS library, was to provide a version of the widely used EISPACK [202] and LINPACK [53] libraries that runs efficiently on shared memory vector and parallel processors. The bottle necks of these modules are the memory access patterns disregarding the multi-layered memory hierarchies of the machines. Thus, too much time is spent for data manipulation rather than for doing useful floating-point operations. LAPACK addresses this problem by making use of block matrix operations, such as matrix multiplication, in the innermost loops. Since these block operations can be optimized for various architectures to account for the memory hierarchy, a way to achieve high efficiency on diverse modern machines was found [13]. The Scalable LAPACK (SCALAPACK) library provides a subset of LAPACK routines for distributed memory parallel computers [21]. Whereas SCALAPACK can be applied for dense systems, a library for parallel linear algebra computation on sparse matrices is presented in [64]. That PSBLAS package addresses the parallel implementation of iterative solvers and is designed for distributed memory computers.
The NAG libraries of the Numerical Algorithms Group contain a wide range of robust numerical and statistical routines, for example for linear algebra, eigenvalue analysis, and differential equations [155]. The libraries are offered in FORTRAN 77 and 90, C, as well as an SMP Library for shared memory and a Parallel Library for distributed memory parallel computing.
The IMSL libraries of Visual Numerics provide accurate and reliable FORTRAN algorithms with full coverage of mathematics and statistics. According to [226], the libraries are claimed to be a cornerstone of high-performance and deep computing as well as predictive analytics applications in science, technical and business environments for well over three decades. IMSL stands for International Mathematical and Statistical Library.
The HSL Software Library, formerly known as the Harwell Subroutine Library and commercially distributed by Hyprotech UK Ltd, offers a collection of portable, fully documented FORTRAN packages. Besides the commercial distribution, there are special arrangements for licensing to academic users. The library can be particularly applied for sparse matrix computations and large-scale optimization. HSL routines can be found in advanced software applications such as for chemical engineering and finite element modeling.
Intel offers the Math Kernel Library MKL to provide highly optimized thread-safe mathematical routines for High-Performance Computing (HPC), science, engineering and financial applications, which are able to take advantage of the maximum performance on Intel processors [107]. The key features of the MKL are its optimization for recent Intel platforms, an automatic run-time detection of the CPU actually used, scaling on multi-processor environments, thread-safety, and royalty-free distribution of the run-time library. The Linux non-commercial development license is also free. The functionality embraces not only linear algebra (BLAS and LAPACK), but also discrete Fourier transforms, vector mathematics and a vector statistical library with random number generators.
The AMD Core Math Library ACML for Linux and Windows offers BLAS, LAPACK and FFT routines, which can be applied by a wide range of software developers to obtain excellent performance on AMD platforms. The highly optimized library provides numerical functions for mathematical, engineering, scientific and financial applications. ACML is available both as a 32-bit and 64-bit library which is able to fully exploit the large memory space and improved performance offered by new AMD 64-bit architectures [3].
Applications running on IBM pSeries computers can take advantage of optimized ESSL (Engineering and Scientific Subroutine Library) routines. This allows both to reduce programming requirements and performance improvements. The library provides optimized versions of dense matrix kernels contained in the BLAS and LAPACK libraries. Another approach to improve the performance of applications is the IBM Mathematical Acceleration SubSystem Library MASS. It provides high-performance versions of a subset of intrinsic functions while sacrificing a small amount of accuracy [2].
In [190], an overview of parallel frontal solvers for large sparse linear systems can be found. These codes were implemented in the HSL package. As a variant of Gaussian elimination, the frontal method was originally developed [109] to solve large linear equation systems. Because the computational work is limited to a relatively small frontal matrix, large problems could be solved even on computers with small memories. The assembly and elimination procedure were combined, and the additional data was written to secondary storage systems. Today the motivation for frontal solvers arise from the fact, that the frontal matrices can be stored in full format [191]. This allows one to employ high-level BLAS routines. Furthermore, multiple front methods were developed, which can simultaneously process subproblems, and allow to parallelize the factorization.
A new parallel direct solver for large sparse linear systems is presented in [56], which is also incorporated in the HSL package as the HSL_MP48 algorithm. This algorithm is designed to solve highly unsymmetric systems by employing several processors, typically up to 16. It is based on the serial code MA48, which implements a sparse variant of Gaussian elimination with sparse data structures and threshold pivoting. Since MA48 stores the matrix and its factorization in main memory, HSL_MP48 can parallelize the problem in order to reduce the memory demand as well as to speed up the factorization process. The parallel approach is based on partitioning the system matrix into a small number of loosely connected submatrices (singly bordered block diagonal form [56]). A modified version of MA48 is then used to factorize all matrices which can be done in parallel. The interface problem can be solved by any sparse solver. Three advantages in comparison to a general parallelization of a sparse solver are given: all processors can be preassigned all data in advance, the factorizations of the submatrices can be done in parallel, and interprocessor communication as required for the interface problem is limited and structured. For the majority of test cases arising from chemical process engineering problems HSL_MP48 outperforms MA48 if two processors are employed. For some problems and eight processors speed-ups in excess of four are achieved. Since the code does not require a single file system and is based on MPI [159], it can be applied both on shared memory and distributed memory machines (see Section 5.3.1).
The comparison in [93] involves different strategies mainly regarding the numerical factorization and the target architecture. However, it is concluded that improvements in all phases of solving sparse linear equation systems have been achieved, which are the main reasons for the significant speed-up already observed.
The following solvers are compared [93]:
Two additional modules are PARDISO [182] and SAMG [37]. The former is also provided by DESSIS [111] and the MKL [107]. The latter is industrially employed for computational fluid dynamics as used in the car industry, oil reservoir simulations, electromagnetics, groundwater flows, semiconductor physics, and structural mechanics [213]. Both packages are discussed later in this chapter in a more detailed way, see Section 5.3.2 and Section 5.3.3, respectively.
The Boeing Math Group commercially provides the Intelligent Iterative Solver Service IISS, which is based on the performance considerations regarding the advantage of iterative solvers for large sparse linear equation systems. The optimal selection of an iterative solver varies and these methods are also known to suffer from irregular performance in different applications. Furthermore, it can be difficult for users to choose the correct parameters required for performance. For that reason the Boeing Math Group provides a collection of state-of-the-art iterative solvers [24].
This collection contains BCSLIB-EXT [23], a preconditioned iterative solver, which has been also used for semiconductor device simulation. Hypre [63], a scalable software for solving large, sparse linear equation systems on massively parallel computers, and AZTEC [222], a massively parallel iterative solver library with advanced partitioning techniques and dense matrix algorithms, are additionally available [24].
Trilinos is a project aiming to provide parallel solver algorithms and libraries for the solution of large-scale, complex multi-physics engineering and scientific applications. Since an object-oriented software framework is developed, Trilinos is based on packages [180]. Each package is focused on important, state-of-the-art algorithms in its problem regime, developed by a small expert group, self-contained, as well as configurable, buildable and documented on its own [61]. For example, Trilinos supports the object-oriented version of AZTEC, SuperLU, Hypre, HSL and many more. In addition, three specific package collections are provided: one for linear algebra classes, one for abstract solver classes, and one for basic tools like BLAS or MPI interfaces.
The Automatically Tuned Linear Algebra Software ATLAS provides portable linear algebra software, including a complete C or FORTRAN 77 BLAS API and a very small subset of the LAPACK API. For all supported operations, ATLAS achieves performance comparable with machine-specific tuned libraries [224,238]. Thus, ATLAS can be used as an alternative to the BLAS library.
Blitz++ is a C++ class library for scientific computing showing a comparable performance with FORTRAN 77/90 [157]. The library incorporates dense arrays and vectors, random number generators and small vectors and matrices. It is available under an open source license. The basic idea is to use the advanced features of the C++ programming language while avoiding the performance drawback. The speed-up is not achieved by better compiler optimization and related measures, but by using templates, which allow to perform optimizations such as loop fusion, unrolling, tiling, and algorithm specialization automatically at compile time. Thus, instead of compiler optimizations, the Blitz++ library parses and analyzes array expressions at compile time and performs respective loop transformations in order to speed-up the code. In addition, Blitz++ extends the conventional dense array model to incorporate new and useful features, such as flexible storage formats, tensor notation and index placeholders [157].
Finally, there is Boost++, a set of free portable C++ source libraries [27] which fully comply to the C++ standard and work well with the C++ standard library. It is based on C++-templates and the system is said to be included into the next C++ standard. One template class library regards linear algebra: UBLAS provides BLAS functionalities for dense, packed, and sparse matrices.
The design and implementation is based on operator overloading and efficient code generation via templates. Classes are provided for dense, unit, and sparse vectors, dense, identity, triangular, banded, symmetric, hermitian, and sparse matrices. Ranges, slices, and adaptor classes can be defined to view into vectors and matrices. The library provides reductions like different norms, addition and subtraction of vectors and matrices and multiplication with a scalar, inner and outer products of vectors, matrix-vector and matrix-matrix products and triangular solver. The implementation is mostly STL conform including the iterator interface [26]. The known limitations are that the implementation assumes a linear memory address model and that dense matrices have been the tuning focus.
As can be easily seen, there is a large variety of solver systems and packages available and many of them are still developed and improved. As they are highly optimized, particularly for specific platforms, the two-strategy approach for the in-house solver module presented here seems to be promising. On the one hand side own implementations are provided, on the other hand side the application can benefit from external codes which are directly coupled to the solver module.
However, the integration of external modules is restricted by purpose and does not embrace the full spectrum of available systems presented above. By now, only two external modules with specific properties have been made available. The choice of the external modules depends on the following considerations: