4.6.3.3 Simulation Flow

Previous: 4.6.3.2 Optimized Distribution Scheme Up: 4.6.3 Parallelization Method Next: 5. Applications

4.6.3.3 Simulation Flow

In order to demonstrate the behavior of the parallelization method the complete simulation flow of the master process and of the slave processes is shown in Fig. 4.25.

**Figure 4.25:** Schematic description of the simulation flow of the master process and of a slave process. The thick arrows denote communication events between the master and the slave.
$\begin{figure}\begin{center} {\resizebox{1.0\linewidth}{!}{\rotatebox{0}{\includegraphics{fig/monte/Flow.eps}}}} \end{center}\end{figure}$

Master process:

Initialization as in the single processor version.
Sending the initialization data to the slaves:
The amount of data transfer is mainly depending on the complexity of the input geometry and increases if damage information is reused from a previous implantation or if a backup file is used to start the simulation (0.1 MB - 200 MB).
Creating the subdomains, evaluating a distribution scheme for the subdomains and sending the distribution scheme to all slaves ( $\approx$ 30 kB).
Starting the simulation of the first time step:
The initial conditions for all ions are calculated and the properties of the ions are stored in stacks related to the slaves according to the entrance point into the simulation domain. Approximately just 0.2 kB are necessary to describe one single ion. The completed stacks are sent to the slaves, one stack for each slave.
Evaluating the performance of the slaves:
The performance of a slave and the end of a time step are determined by analyzing responses of the slaves. Two types of messages are sent to the master during the simulation:
- Whenever a slave has finished all tasks a Ready Message together with the number of received and processed ions (including ions that have been forwarded to other slaves) is sent to the master.
- Before sending an ion to another slave the master is informed about this activity by sending a Ready Message in common with -1.
While the master is waiting for the completion of a time step it collects all Ready Messages and decreases an internal counter by the number it received together with the Ready Message. At the beginning of the time step this counter is set to the number of ions that has been sent to the slaves. If this counter is zero the master knows that there is nothing left to do for the slaves.
This slightly complicated protocol is necessary to correctly handle particles that are generated during the simulation either by the trajectory split method or by the Follow-Each-Recoil method, and to avoid errors due to communication delays, because the master always knows how many ions were sent to the network. The Ready Message in common with -1 informs the master that an additional particle is sent to the network. When all particles within the network are processed the time step is finished. The performance of a slave is derived from the first Ready Message the master receives from the slave, by measuring the time interval $\Delta t$ between the sending of the ion package of the first time step and the receiving of the Ready Message. This is a significant interval because no slave is idle during that interval.

$\displaystyle CPU_i = \frac{\Delta t}{\text{number of processed ions (sent with the {\it Ready Message})}}$ (4.15)

Sending of a Reset Message in common with the optimized distribution scheme to the slaves to clear the simulation results of the initial time step,because the redistribution of the subdomains requires also the transfer of a lot of simulation results. This communication takes significantly more time than recalculating the first time step with the new distribution scheme. Restarting the simulation of the first time step:
The initial conditions of all ions of the first time step are calculated again (according to the new distribution scheme) and prepared for distribution.
Processing the main control loop until the simulation is finished:
- Distributing the prepared ion packages among the slaves.
- Calculating the initial conditions for all ions of the next time step and preparing them for distribution.
- Waiting until all slaves have finished their calculations by collecting the Ready Messages.
Sending an End Message to all slaves to terminate their main simulation loop and collecting all simulation results from the slaves (up to several hundred MB). To reduce collisions in the network due to the huge amount of data which are sent simultaneously from all slaves, the simulation results are collected piecewise.
Performing statistical analysis of the resulting doping and point-defect distributions, preparing the generation of the output and writing of output files.

Slave process:

Receiving the description of the simulation domain and of the implantation conditions for initialization.
Receiving the initial distribution scheme. Thereby each slave knows all scopes of responsibility, which allows a direct communication between the slaves.
Processing the main control loop until it is terminated by the master:
Waiting for a request and processing the request.
- Reset Memory request:
  The histogram where the simulation results are stored is cleared and a new distribution scheme is received. This request is used by the master to reset the simulation after it has evaluated the performance of the slaves.
- Next Time Step request:
  The slave has to be informed about the beginning of a new time step because the trajectory stack used by the Trajectory-Reuse method has to be reinitialized after each time step.
- Simulation Finished request:
  The slave leaves the main simulation loop, sends the simulation results to the master and terminates operation.
- Store Data request:
  The slave receives simulation results and the coordinates where to store them and writes the data to the local histogram.
- Deliver Data request:
  The slave receives the coordinates of the required data and sends the requested data to the slave who has asked for the data. This request is also processed during the calculation of an ion trajectory because the slave who has sent the request is blocked until he receives the response.
- New Ion Package request:
  The slave receives a package of several ions which are processed successively until they come to rest or leave the scope of responsibility of the slave. If the ion leaves the scope of responsibility the master is informed by a Ready Message and the ion is sent to the slave, whose scope of responsibility is entered by the ion. When all ions of a package are processed and no other request is pending the slave sends a Ready Message together with the number of processed ions to the master. Besides the transfer of complete ions two other types of communication events can occur. Simulation results located outside the scope of responsibility of the slave can be generated or required by certain models. Therefore a method for non-local memory access is implemented. If simulation results have to be stored outside, a Store Data Request in common with the simulation data and the coordinates where to store them is sent to the appropriate slave. If simulation results outside have to be accessed a Deliver Data Request is sent together with the coordinates of the required data. The slave has to wait for an answer, before continuing the simulation. Fig. 4.26 summarizes all slave to slave communications.

Figure 4.26: Schematic presentation of the slave to slave communication events. Transfer of an ion (a), storing simulation results outside the local memory (b), accessing simulation results from outside (c).

$\resizebox{0.45\linewidth}{!}{\rotatebox{0}{\includegraphics{fig/monte/SendIon.eps}}}$	$\resizebox{0.45\linewidth}{!}{\rotatebox{0}{\includegraphics{fig/monte/SendData.eps}}}$
(a)	(b)

$\resizebox{0.61\linewidth}{!}{\rotatebox{0}{\includegraphics{fig/monte/AccessData.eps}}}$

(c)

The speedup due to parallelization increases almost linear with the number of slaves as could be demonstrated by a three-dimensional simulation on a cluster of identical workstations using one to six slave processes. Fig. 4.27 shows the speedup as a function of the number of slaves. The speedup is determined by the ratio between the simulation time of a parallelized simulation and a simulation with a single slave. The only restriction of the parallelization method is that just slightly varying processor loads are acceptable to achieve a good performance gain.

**Figure 4.27:** Speedup as a function of the number of slaves compared to an ideal speedup.

Worth mentioning is that the Parallelization method is not designed to be failsafe. Whenever one of the slave terminates operation due to a hardware failure the whole simulation ends up in an endless loop. In case of a failure of one slave the master process is not able to determine the end of a time step, because the ions that have to be processed by the terminated slave get lost. This could be avoided by keeping a backup information at the master process and by regularly checking for the operation conditions of the slaves. The implementation of such a mechanism is not recommendable for several reasons.

First the major advantage of the Parallelization method, that the communication overhead due to parallelization is almost negligible, gets lost. If a backup mechanism is implemented not only the initial condition of the ions of one time step have to be stored by the master process, but also the status of the simulation results at the beginning of the time step in order to be able to restart a certain time step in case of a failure of a slave. Since the simulation results are stored locally at the slave all slaves have to send these data to the master process at the end of each time step. In the current implementation of the Parallelization method this is only done at the end of the simulation and nevertheless this is the most communication intensive task during the simulation.

Furthermore a method had to be implemented to replace the failing slave. The most convenient method would be to look for a workstation in the cluster of workstation which does not actually participate in the simulation and to start a new slave process at this workstation. The problem is that the version of MPI which is used for parallelization does not support the spawning of processes during a simulation run. The implementation of such a feature is just announced for future versions of MPI.

An alternative is to redistribute the simulation domain among the remaining slaves, but this requires a huge amount of communication, because all slaves had to be updated with simulation results.

Even if a rigorous implementation of a failsafe mechanism is not recommendable it is probably worth to store the status of the slaves after certain backup intervals and to restart the simulation from such a backup point in case of a failure of one of the slaves. Such a mechanism could be implemented in common with a load balancing mechanism, which anyhow causes an additional communication overhead, because load balancing also requires a redistribution of the simulation domain. The advantage of load balancing would be that strong variation of the loads of the workstation could be compensated and that the performance gain could be increased for the case of strong processor load variations. The biggest challenge for such a parallelization strategy would be to find a clever compromise between the performance gain due to an improved distribution of the simulation domain and the additional communication overhead.

Previous: 4.6.3.2 Optimized Distribution Scheme Up: 4.6.3 Parallelization Method Next: 5. Applications

A. Hoessiger: Simulation of Ion Implantation for ULSI Technology