In the course of this work the performance of these solvers has been evaluated. Rather than using a set of single matrices, the approach is based on complete simulations with consistent settings, as typically encountered during daily work.
In mathematical publications, different solver systems are normally compared by applying them for single linear equation systems. For example, the matrix market [139,25] provides such a data repository for use in comparative studies of linear algebra algorithms. The repository does not only provide matrices, but also matrix generation software in a wide variety of scientific and engineering disciplines. For each matrix and matrix set details regarding the matrix properties and a visualization of the matrix structure is provided. The matrices can be downloaded in several different text file formats. A matrix generator is either a static software for download, a Java applet, or a form-based web service to process requests for generating matrices. In 2000, 482 individual matrices and 25 matrix generators were available. The database includes the Harwell-Boeing Sparse Matrix Collection (Release I) [55], the SPARSKIT collection [178], and the Non-Hermitian Eigenvalue Problem collection [14].
Such an approach is particularly useful for the evaluation of new mathematical algorithms. The conclusions drawn from the results are used to improve the respective implementations. However, from the perspective of the implementation of a general-purpose simulator, the objective is quite different. Here, a set of solvers are available and the one which fits best for the respective problem shall be chosen. In contrast to the solver development, the research on this topic is restricted to practical evaluations regarding the respective simulators.
The 16 examples summarized in Table 5.1 were taken from current scientific projects at the institute. They consist of field effect, bipolar, and silicon-on-insulator transistors. Structures such as FINFETs are analyzed by means of three-dimensional simulations.
All examples were simulated on:
|
The extensible benchmark is processed by a single program, which has been implemented as a SEILIB application (see Section C.3). The real (wall-clock) and user time is measured by the GNU time command, and the fastest of three consecutive runs is taken.
Table 5.2 contains information on how the simulation time is actually spent. The simulator was started on the IBM cluster and the in-house ILU-BICGSTAB was selected. The first columns show the number of the example, the dimensions of the inner and complete system, and the ratio . A lower ratio indicates that more equations are pre-eliminated and thus more time is spent for pre-elimination. The CPU column shows the absolute user time required for the respective example. The remaining columns show the shares of the simulation time spent in selected modules or for selected tasks: for initialization, pre-processing, assembling the linear equation systems (including the calculation of the model contributions), the share of the linear modules including all steps shown in Table 5.3, the update of the quantities, and eventually the postprocessing.
The most important data in the context of that chapter is how much time is spent for the linear system after the assembly has been completed (the Lin column). For the smaller two-dimensional examples that share is below 50%, that means more time is spent for calculating and assembling the model contributions as well as for pre- and post-processing. However, for all other simulations including the larger two-dimensional ones, that share rises significantly even up to more than 90%. Thus, for increasing the efficiency of advanced device simulations and optimization, it is essential to reduce the relative effort spent on preparing and solving the linear equation systems.
|
In addition, the effort of the conversion to alternative sparse matrix formats was evaluated. Since the time required for the conversion is too short, no significant information can be given by comparison of complete run times. However, in order to give quantitative information on the negligibility of the matrix conversions, the input and output system of the assembly module was used.
The assembly module has been equipped with a comprehensive input and output system which enables the user to have the assembled complete and inner linear equation systems read from or written to files. These features can be used for debugging and quality assessment purposes. Furthermore, different sets of parameters or alternative solvers can be efficiently tested. Whereas the simulator is normally used to calculate the entries of the linear equation system, a test program is provided which reads the files and starts the solver.
In order to quantify the effort of the matrix conversions, the first linear equation system assembled for the three-input nand gate (dimension 219,920, 1,925,553 non-zeros) was written to a file. Afterwards, it was read by a modified version of this test program. The modification regards only the activation of one type of matrix conversion. The conversion time is measured by the gettimeofday function, the minimum time of three consecutive runs is taken. One conversion to CSR variant 1 takes 38.212ms, 266 conversions take 9,184.06ms. For the CSR variant 2, the times are 38.955ms and 8,798.3ms, respectively. Since the complete simulation takes 8,468.19s, the share of all conversions is 0.108% or 0.104%, respectively. These very small shares allow to conclude that the effort of the conversion can be neglected.
The complex-valued variant of that test program was also modified to evaluate another interesting issue. As outlined in Section 2.6.2, one can split a complex-valued equation system into a double-sized real-valued one in order to employ a real-valued assembly and solver system also for complex-valued linear equation systems. Regarding the memory consumption, this approach has disadvantages, which shall be quantified in the following discussion. Three examples were taken and the small-signal systems for 1GHz were written to files. These files were read by the test program - once for the original version employing the complex-valued solver system and once for the modified version. Thus, this evaluation approach emulates also the different assembling process of the respective matrices. The results can summarized as follows:
Several test runs of the in-house GMRES(M) solver were performed and the cumulated number of mathematical operations as a function of were recorded.
|
Since values of smaller than 16 cause errors for some three-dimensional and mixed-mode simulations, the commonly used has to be set to a value higher than 16. In Figure 5.5, all values are scaled to those of . The following ratios are shown: minimum, maximum, and average of the number of mathematical operations, the user time, and the memory consumption. By analyzing the solid black line with symbols for the average user time, a default value shall be selected which can be commonly set for all kinds of simulations. The first part of that curve is decreasing between 16 and 70 down to a level of approximate 55%. Afterwards, it slightly increases again. One reason for this might be the higher memory consumption. The curve for the mixed-mode device/circuit simulations has a similar shape, however the increasing part starts a little bit earlier. Finally, the curve for the two-dimensional simulations shows a contraproductive effect of an increased between 70 and 75.
So one can conclude that a value between 50 and 70 seems to be advantageous for all kinds of simulations if enough memory is available. For that reason, the default value was set to . As this value can be easily controlled by the user, different requirements depending on the simulation and the memory resources can be met. In addition it is to note that the simulator itself can automatically adjust this value if the respective information is available.
The objective of the performance evaluation is to compare the simulation times required by MINIMOS-NT depending on the solver systems selected. Note that the solver hierarchy (see Section 5.4) was deactivated. All three in-house solvers as well as the two PARDISO and four SAMG solvers are evaluated, although not all examples could be solved by all solver systems.
|
In Figure 5.6 a comparison of different solvers on the Intel computer is given. Due to the large simulation time differences, all times are scaled to the in-house ILU-BICGSTAB in the center of the graph. Interesting results are the superiority of the advanced implementations of LU factorization and iterative solvers for circuits and three-dimensional devices, respectively. The in-house GMRES(M) solver has advantages for circuits also, whereas the direct solver on the left hand side can in fact only be used for quality assessment of two-dimensional simulations. Whereas the iterative SAMG solvers show single significant performance advantages up to 35%, the multi-level algorithms require still some effort to be generally applicable.
|
To show the relative impact of multi-threading, the PARDISO-LU solving ratios (referring to the single-threaded version) against the number of processors/threads are shown in Figure 5.7. For the three-dimensional examples, also the PARDISO-LU-CGS ratios are given. The real (wall clock) time required for solving the example, the cumulated user (CPU) times are shown in Figure 5.7, which increase due to the parallelization overhead. Whereas for two-dimensional device and circuit simulations too many processors can be even contra-productive, the marginal additional utility for three-dimensional simulations is drastically diminishing. Thus, for the average simulation four processors should be sufficient. Especially under scarce conditions, for example during optimizations, assigning two tasks per node of eight processors appears to minimize the real time effort.
The iterative methods still show a significant performance advantage over the direct solvers. However, the 1983 quotation ``In 3D sparse direct solution methods can safely be labeled a disaster'' [16] describes the experiences (in regard to both time and memory consumption) with the in-house LU solver, but does not embrace the recent developments. Especially for mixed-mode device/circuit simulations the advanced direct methods show a significant performance advantage, even up to the highest dimensions.
The same concept as discussed in the last section is used to evaluate solvers for higher-order transport models. However, instead of the devices presented above, double-gate MOSFETs with different gate lengths were used, see Table 5.5.5.
The transport models as derived in Section 2.1.3 were used and compared on the Intel computer: the six moments transport model (SM), the energy-transport transport model (ET), and the drift-diffusion transport model (DD). Three different simulation tasks were performed:
As concluded in the evaluation of the last section, several solvers are better employable for more ill-conditioned problems. In contrast to the drift-diffusion transport models, simulations based on the energy-transport and six moments models tend to be more ill-conditioned. The underlying simulation setup is more sensitive to the mesh. As adaptive meshes are required, ongoing research outside the scope of this thesis is performed to improve this situation. However, to some extend, solver systems better equipped for ill-conditioned problems can significantly reduce the solving effort. The objective of this evaluation is to confirm the respective conclusions of the last section and to give a quantitative information on this.
The results for the two in-house iterative solvers BICGSTAB and GMRES(M) as well as those of the PARDISO solver are presented in Figure 5.8. Due to the more ill-conditioned problem and the small dimensions of the linear equation system, the direct solver significantly outperforms the two in-house solvers for the higher-order transport models. For the same reasons also the GMRES(M) solver shows advantages over the BICGSTAB. However, for the steady-state drift-diffusion simulations, the advantages of alternative solvers shrinks, whereas for small-signal simulations PARDISO is still up to 50% faster than the in-house ILU-BICGSTAB.
|