The paradigm shift from single- to multi- and many-core computing architectures places additional emphasis on the development of parallel algorithms. In particular, the use of graphics processing units (GPUs) for general purpose computations poses a considerable challenge due to the high number of threads executed in parallel. Already the implementation of standard algorithms like dense matrix-matrix multiplications requires a considerable amount of sophistication in order to utilize the vast computing resources of GPUs efficiently [70].
In the context of the discretized SHE equations (5.9) and (5.16), the assembly can be carried
out in parallel for each box and each adjoint box
, provided a suitable storage
scheme for the sparse system matrix is chosen. Similarly, the elimination of odd-order unknowns
from the system matrix can be achieved in parallel, since the procedure can be carried out
separately for each row associated with a box
of the system matrix. The iterative solvers
essentially rely on sparse matrix-vector products, inner products and vector updates,
which can also be run in parallel employing parallel reduction schemes. However, the
additional need for preconditioners is a hindrance for a full parallelization, because the
design of good parallel preconditioners is very challenging and typically problem-specific
[104].
In this chapter a parallel preconditioning scheme for the SHE equations is proposed and
evaluated. The physical principles on which the preconditioner is based are discussed in Sec. 7.1
and additional numerical improvements for the system matrix are given in Sec. 7.2.
The preconditioner scheme is then proposed in Sec. 7.3 and evaluated for a simple
-diode in Sec. 7.4. Even though parallel preconditioners suitable for GPUs
have already been implemented recently, cf. e.g. [33, 40], their black-box nature does
not make full use of all available information. The preconditioner scheme presented
in the following incorporates these additional information into otherwise black-box
preconditioners.
The scheme is derived for a given electrostatic potential, as it is for instance the case with a Gummel iteration, cf. Sec. 2.4. A block-preconditioner for the Newton scheme is obtained by concatenation of a preconditioner for the Poisson equation, the SHE preconditioner presented in the following, and a preconditioner for the continuity equation for the other carrier type.
As discussed in Chap. 3, carriers within the device can change their total energy only by inelastic
scattering events, thus the scattering operator is responsible for coupling different
energy levels. However, if only elastic scattering processes are considered, the total energy of the
particles remains unchanged and the different energy levels do not couple, cf. Fig. 7.1. Therefore,
in a SHE simulation using only elastic scattering and
different energy levels, the resulting
system of linear equations is consequently decoupled into
independent problems. Such a
decomposition has been observed already in early publications on SHE [21, 105] in a slightly
different setting: If the grid spacing with respect to energy is a fraction of the optical
phonon energy
, say
with integer
, then the system decomposes into
decoupled systems of equations. This observation, however, is of rather limited
relevance in practice, since different phonon energies
for inelastic scattering may be
employed simultaneously, cf. Sec. 3.3, hence the system no longer decouples into a
reasonably large number of independent systems in order to scale well to a higher number of
cores.
![]() (a) Electron Trajectories without inelastic scattering. |
![]() (b) Electron Trajectories with inelastic scattering. |
It is assumed throughout the following investigations that all unknowns of the discrete linear
system of equations referring to a certain energy are enumerated consecutively. A simple
interpretation of the system matrix structure is possible if first all unknowns associated with the
lowest energy
are enumerated, then all unknowns with energy
and so forth. The
unknowns for a certain energy can be enumerated arbitrarily, even though an enumeration
such as in Sec. 6.1 is of advantage for easing the understanding of the system matrix
structure.
The scattering of carriers is a random process in the sense that the time between two collisions
of a particle are random. Equivalently, the mean free flight denotes the average distance a carrier
travels before it scatters. As devices are scaled down, the average number of scattering
events of a carrier while moving through the device decreases. On the algebraic level of
the system matrix, a down-scaling of the device leads to a weaker coupling between
different energy levels. This can be reasoned as follows: Consider a one-dimensional device
using spherical energy bands, consisting of two boxes and
with adjoint box
. Now, consider the discretization (5.9) and (5.16). In the following, only the
proportionality with respect to the box interface area and the box volume are of interest.
Consequently,
is written for a term that carries a dependence on the interface area
, and
is written for a term that depends on the box volumes
or
.
Since only the asymptotic behavior is of interest, the signs are always taken positive.
The discrete system matrix can then be written according to (5.9) and (5.16) in the
form
With the enumeration of unknowns as suggested in the previous section and using only a single
inelastic phonon energy , the structure of the system matrix consists of three block-diagonals.
The location of the off-diagonal blocks depends on the spacing of the discrete energies
with respect to
. If the energy spacing equals
, the matrix is block-tridiagonal,
cf. Fig. 7.2.
The asymptotic exponential decay of the distribution function with energy induces a considerable asymmetry of the system matrix when inelastic scattering is considered. This asymmetry, however, is required in order to ensure a Maxwell distribution in equilibrium. Consider the scattering rate (3.25) and the SHE equations for the scattering operator only. The resulting equations for the zeroth-order expansion coefficients using spherical energy bands read
![]() | (7.4) |
where the symmetric scattering rate has been cancelled already. The first two terms refer to
scattering from lower to higher energy, and the last two terms to scattering from higher energy to
lower energy. Substitution of the relation
![]() | (7.6) |
It can readily be seen that a Maxwell distribution fulfills the
equation. For the system matrix one consequently finds that the off-diagonal block coupling to
higher energy is by a factor of
larger than the block coupling to lower energy.
For phonon energies of about
meV the factor is
, while for phonon
energies
meV the factor becomes
. The induced asymmetry of the system
matrix is in particular a concern for the convergence of iterative solvers if the off-diagonal block
coupling to higher energies dominates the diagonal block, which is typically the case for devices in
the micrometer regime.
The asymmetry can be substantially reduced in two ways. The first possibility is to expand the distribution function as
The second way of reducing asymmetry is based on modifications of the linear system.
Denoting again with a reasonable estimate for
, the system of linear equations
with eliminated odd expansion orders
A second look at the rescaled system matrix shows that norms of the columns
are now exponentially decreasing with energy, while the column norms of
are
of about the same order of magnitude. Hence, the rescaling with
has shifted the
exponential decay of the entries in
to the columns of the system matrix. However,
due to the block structure of the system matrix, a normalization of the row-norms by
left-multiplication of the linear system with a diagonal matrix
also rescales the column-norms
suitably:
Due to the large number of unknowns for the discretized SHE equations, the solution of the resulting systems of linear equations is typically achieved by iterative methods. The rate of convergence of these methods can be substantially improved by the use of preconditioners. As already observed by Jungemann et al. [53], good preconditioners are actually required in most cases to obtain convergence of iterative solvers for the SHE equations. One of the most commonly employed preconditioners is the incomplete LU factorization (ILU), which has been used in recent works on the SHE method [53, 42]. For this reason, a short description of ILU is given in the following.
ILU relies on an approximate factorization of the sparse system matrix into
, where
is a sparse lower triangular matrix, and
is a sparse upper
triangular matrix. During the iterative solver run, the current residual
is then updated
by
The inverse of is never computed explicitly. Instead, a forward substitution
followed by a backward substitution
is carried out. Since these substitutions are serial
operations, the application of the preconditioner to the residual is identified as a bottleneck for
parallelism.
In light of the discussion of the block structure of the system matrix for SHE, consider a
block-diagonal matrix
Algorithm 5 (Parallel Preconditioner Scheme for the SHE Method). Let the number of
unknowns at the discrete energies ,
, be given by
. Denote the
diagonal blocks of the system matrix
with size
as
. Then, a parallel
preconditioner for
is given by a block-preconditioner
, where the preconditioners
are computed from the diagonal blocks
of
and applied in parallel.
A physical interpretation of the proposed preconditioner is as follows: Consider the
discretization of the SHE equations without inelastic scattering mechanisms. In this case the
system matrix after row-normalization is block-diagonal if enumerating the
unknowns as suggested in Sec. 7.1. A preconditioner for
is given by a block
preconditioner
of the form (7.20). Since
is an approximation to the SHE
equations with inelastic scattering given by
, for a preconditioner
for
there
holds
It should be noted that the use of block preconditioners of the form (7.20) for the parallelization of ILU preconditioners is not new [88]. However, without additional information about the system matrix, the block sizes are usually chosen uniformly and may not be aligned to the block sizes induced by the SHE equations. Consequently, the application of block-diagonal preconditioners to the SHE equations in a black-box manner will show lower computational efficiency.
The proposed scheme in Algorithm 5 allows for the use of arbitrary preconditioners for each of
the blocks . Consequently, a preconditioner scheme is proposed rather than a
single preconditioner, enabling the use of established serial preconditioners in a parallel
setting. Since the number of energies
is in typical situations chosen to be at least
, the proposed scheme provides a sufficiently high degree of parallelism even for
multi-CPU clusters. The situation is slightly different for GPUs, where typically one work
group1
is used for the preconditioner at total energy
. Due to the massively parallel architecture of
GPUs, an even higher degree of parallelism is desired in order to scale the SHE method well to
multi-GPU environments. In such case, parallel preconditioner for each block
should be
employed.
Execution times of the iterative BiCGStab [102] solver are compared for a single CPU core using an incomplete LU factorization with threshold (ILUT) for the full system matrix, and for the proposed parallel scheme using multiple CPU cores of a quad-core Intel Core i7 960 CPU with eight logical cores. In addition, comparisons for a NVIDIA Geforce GTX 580 GPU are found in Figs. 7.3. The parallelization on the CPU is achieved using the Boost.Thread library [6], and the same development time was allotted for developing the OpenCL [57] kernel for ViennaCL [111] on the GPU. This allows for a comparison of the results not only in terms of execution speed, but also in terms of productivity [7].
As can be seen in Figs. 7.3, the performance increase for each linear solver step is more than one order of magnitude compared to the single-core implementation. This super-linear scaling with respect to the number of cores on the CPU is due to the better caching possibilities obtained by the higher data locality within the block-preconditioner.
The required number of iterations using the block-preconditioner decreases with the device
size. For an intrinsic region of nm length, the number of iterations is only twice than that of
an ILUT preconditioner for the full system. At an intrinsic region of
nm, four times the
number of iterations are required. This is a very small price to pay for the excellent parallelization
possibilities.
Overall, the multi-core implementation is on the test machine by a factor of three to ten faster than the single core-implementation even though a slightly larger number of solver iterations is required. It is reasonable to expect even higher performance gains on multi-socket machines equipped with a higher number of CPU cores. The purely GPU-based solver with hundreds of simultaneous lightweight threads is by up to one order of magnitude faster than the single-core CPU implementation, where it has to be noted that a single GPU thread provides less processing power than a CPU thread.
The comparison in Fig. 7.3 further shows that the SHE order does not have a notable influence on the block-preconditioner efficiency compared to the full preconditioner. The slightly larger number of solver iterations for third-order expansions is due to the higher number of unknowns in the linear system. The performance gain is almost uniform over the length of the intrinsic region and slightly favors shorter devices, thus making the scheme an ideal candidate for current and future scaled-down devices.