The particle transport calculation in either bottom-up or top-down schemes relies on a large number of ray-surface intersection tests (cf. Section 2.2). The result of an intersection test is the distance from the origin of the ray to the closest intersection with the surface in the direction of the ray. Figure 5.1 illustrates two choices for the representation of a surface: An implicit representation (e.g., a signed-distance field) and an explicit representation (triangle mesh) of a sphere.
Figure 5.1: Illustration of two surface representations of a sphere: Left: Implicit representation using a signed-distance field. The blue and red surfaces are the isosurfaces delimiting the narrow-band of 3 grid cells around the interface to the inside (blue) and outside (red). The gray surface is the zero-level-set. Right: Triangle mesh extracted from the zero-level-set.
If an explicit representation of the surface is used, the intersection test identifies the closest intersection point and the corresponding primitive, e.g., a specific triangle out of a triangle mesh. A straightforward implementation is to test each primitive of a mesh for an intersection with the ray and store the resulting distance to the intersection point. Finally, the closest intersection point and the corresponding primitive is reported. The primitives need not necessarily form a closed polygonal surface, e.g., in [10] tangential disks are used and in [13] spheres are used to represent the surface.
For an implicit representation, e.g., a signed-distance field, the intersection point is a coordinate (on the isosurface) between the vertices of the grid holding the signed-distance field. A straightforward implementation is to advance a test point in steps along the ray direction. The step length is thereby equal to the absolute value of the signed-distance field at the location of the test points. The signed-distance value is then re-evaluated at the new position of the test point. The procedure is repeated until the signed-distance value changes its sign or the absolute value falls below a threshold value. The last position of the test point is then reported as the intersection point. This procedure is referred to as sphere tracing [81].
In applications requiring a high throughput of ray-surface intersections, e.g., rendering or non-imaging optics, typically a spatial data structure is constructed from the surface representation to accelerate the intersection tests. The most common data structure is a bounding volume hierarchy (BVH), which reduces the number of necessary intersection tests by grouping primitives hierarchically into bounding boxes.The intersection tests are then performed with the bounding boxes. Only if a bounding box is intersected by a ray, the hierarchically lower bounding boxes (or contained primitives) are tested for intersection. Hardware-tailored ray tracing libraries (cf. Section 2.2) originally designed for computer graphics applications provide highly-optimized ray tracing kernels for common modern computing platforms. As target applications commonly include real-time rendering of dynamic geometries, the libraries also focus on optimizing the performance of the BVH construction. Most libraries solely support single-precision arithmetics; this allows to fully utilize SIMD instructions on modern CPUs and to utilize the high single-precision processing power of GPUs.
In a feature-scale process simulator, the surface is typically represented as a narrow-band level-set (cf. Section 2.1). The most widely used approach, and thus the frame of reference for the here presented evaluations, is to perform the ray casting on the implicit representation, i.e., the level-set function. An advantage is that no explicit representation of the surface has to be extracted in each time step. However, the consequential lack of a closed polygonal explicit surface representation is disadvantageous, as the surface topology in the neighborhood of a surface point is not readily accessible, which is demanded by spatially adaptive approaches for the calculation of the particle transport (cf. Section 6.1).
In the following, an in the course of this work developed approach is presented which extracts an explicit polygonal mesh in each time step of a feature-scale simulation [2]. This mesh is then used for the ray casting during the particle transport. The underlying motivation is to utilize the higher throughput of ray-surface intersections during the particle transport; however, the overhead introduced by mesh extraction from an implicit surface representation must be considered. The ray casting is performed using single-precision arithmetic.
First, the accuracy of ray casting using single- and double-precision floating point representations is compared using a generic test scenario. Then, the performance difference for single-precision ray casting is compared using two highly optimized open-source libraries for ray casting on implicit and explicit surfaces. Finally, the newly developed approach is presented in more detail and an evaluation of the performance is presented for a simple deposition test case.
The sufficiency of single-precision arithmetics depends on the numerical method and the further use of the result. The methods to calculate the particle transport (cf. Section 2.2) use ray casting to detect ray-surface intersections along a certain direction. The resulting information is binary and therefore the influence of the arithmetic precision of the ray casting is isolated.
To evaluate the applicability of single-precision ray casting for the ray-surface intersections during the particle transport, the angular resolutions which are achieved for single- and double-precision ray casting are compared. A spherical mesh
(double-precision vertex coordinates), generated by subdividing an icosahedron 6 times (cf. Section 4.1, i.e., about vertices), serves as a reference mesh. The radius of the reference mesh is scaled from
to
and rays are traced from the origin towards each vertex of the mesh. Figure 5.2 conceptually illustrates the spherical reference mesh at different scales and the corresponding rays towards the vertices of the mesh. The distance between the
intersection point (found by ray casting) and the vertex coordinate on the reference mesh is computed for all vertices. The ratio of this distance and the radius
provides information about the maximum angular resolution of the intersection test.
Figure 5.2: Conceptual visualization of the geometric configuration of the ray casting precision evaluation: The cross section of the spherical mesh is shown (for different scales) together with the rays which start at the origin and point into the directions of each vertex of the spherical mesh. The distance between the intersection position of each ray with the mesh (found by ray casting) and the corresponding vertex coordinate are calculated for single- and double-precision ray casting.
The NanoRT library [92] is used for single- and double-precision ray casting; the Embree library [89] is additionally used for single-precision. In Figure 5.3, the results for the maximum distance and the distance normalized to the radius
are plotted over the radius
of the reference sphere. The normalized distance is constant for single- and double-precision, respectively; the
constant values are
and
falling in the range of significant digits for single- and double-precision, respectively. For
and
no intersection is found for single-precision ray casting using
NanoRT; the same holds for Embree where the lower limit is
. The angular resolution of subdivided icosahedrons
(cf. Section 4.1) are indicated with dotted lines.
Figure 5.3: Evaluation of the angular resolution using single-precision (green, red) and double-precision (blue) ray casting. The maximum distance between the intersection point (found by ray casting) and the reference point
is shown as solid line and the scale on the left. The normalized distance
is shown as dashed line and the scale on the right. Two lines
are shown for each result which correspond to two different ways of calculating the intersection point with the mesh: (a) using a scaling of the direction of the ray with the distance of the intersection (solid for
, dashed for
) or (b) using the coordinates of the intersected primitive combined with
the local intersection coordinates (dotted). Additionally, the angular resolutions of a subdivided icosahedron are shown as horizontal dotted lines for up to
subdivisions.
The results reveals that the achieved angular resolution is more than two decades smaller than the angular resolution of a 12 times subdivided icosahedron for both, single-precision and double-precision ray casting.
In conclusion, the arithmetic precision achieved with single-precision ray casting is sufficient to calculate the ray-surface intersection tests in practical three-dimensional feature-scale process simulations: For bottom-up flux calculation
schemes, the angular resolution is sufficient even for the highest practical spatial sampling resolutions; the same holds for top-down flux calculation schemes. For instance, considering a cylindrical hole of aspect ratio , i.e., depth
and diameter
: Taking the area of the opening aperture
and approximating a ray direction starting from the bottom of the
structure with coverage
on
, the aperture can be sampled with
ray directions. The bottom of the hole can consequently be
sampled with
rays starting at a single position on the source plane.
In rendering applications, the ray tracing performance is commonly reported in achieved frames per second for a given scene and perspective. An alternative metric, and suitable for topography simulations in Process TCAD, are the traced rays per second, usually denoted in million rays per second (Mrays/s). The actual computations assigned to the tracing of a single ray are defined by the rendering algorithm and may include followup computations, e.g., to calculate shading. The following benchmark solely aims at evaluating the performance of the raw intersection test (i.e., the calculation of the closest intersection point with a surface) for spatially incoherent rays.
The surface is a unit sphere with centered at the origin for all configurations in the following. The
origins of the rays are distributed on another sphere with
; this leads to rays which start inside the surface for
and rays which start outside the surface for
. At each origin, rays are traced towards the centroids of the triangles of a
subdivided icosahedron (cf. Section 4.1). Figure 5.4 illustrates this
benchmark scheme conceptually for two-dimensions.
Figure 5.4: Conceptual two-dimensional illustration of the benchmark scheme: The red circle represents the unit sphere and the blue dotted circles represent the spheres on which the ray origins are distributed. The black arrows indicate
the scheme for the ray directions which are defined using a subdivided icosahedron. The left and right side illustrate the configurations and
, respectively.
The performance of two open-source libraries from the field of computer graphics is compared: OpenVDB [95] is used for implicit, and Embree [89] for explicit surfaces. The rays are traced against a narrow-band
level-set representation of a sphere with radius using OpenVDB’s LevelSetRayTracer and against a
triangulated mesh (extracted from the level-set) using Embree. The implicit mesh is represented with OpenVDB’s default tree configuration Tree4
. The for loop which iterates over the ray origins is
OpenMP-parallelized. All performance measurements were performed on WS1. The benchmarks were performed using Embree 2.12, OpenVDB 4.0, and compiled using gcc 6.1.1. Table 5.1 summarizes the parameter range of the performance analysis.
Parameter | Values |
Number of threads | 1, 2, 4, 5, 6, 8 |
Subdivisions for search directions |
1, 2, 3, 4, 5 |
Radius of origins |
1.5, 1.15, 0.85, 0.5 |
Voxel size |
0.05, 0.01, 0.005, 0.0025 |
Dependent Parameter | |
Number of active voxels |
30K, 0.8M, 3.0M, 12.1M |
Number of triangles |
15K, 0.4M, 1.5M, 6.0M |
Number of search directions |
80, 320, 1280, 5120, 20480 |
Table 5.1: Parameter variations used in the performance comparison (K = thousand, M = million). An active voxel is a level-set grid cell in the narrow-band level-set around the surface.
Fig 5.5 compares implicit and explicit ray casting performance for different ray origins and different surface resolutions. The limits of the achieved performance
with 8 threads for Embree are about 100 Mrays/s (
, 15K triangles, Figure 5.5c) and about 10 Mrays/s (
, 6.0M triangles, Figure 5.5b). For the implicit ray casting using OpenVDB, the limits are about 13 Mrays/s (
, 0.4M triangles, Figure 5.5c) and 2 Mrays/s (
, 6.0M triangles, Figure 5.5b).
The resulting performance gain is between 3 and 6 for all possible combinations of the parameters in Table 5.1, excluding low spatial resolutions (i.e.,
less than 0.4 million triangles). The performance gain is higher for the multi-threaded runs, the main reason being the higher speedup for Embree in the hyper-threading regime (cf. Figure 5.6). For low spatial resolutions (less than 0.4 million triangles) the performance ratio is increasing. For high spatial resolutions (i.e, more than 1 million triangles)
the multi-threaded performance ratio is for
and
for
, i.e., Embree profits more than OpenVDB if a large
portions of the rays do not intersect the geometry at all.
Figure 5.5: Performance comparison 1 and 8 threads between ray casting on the implicit surface (using OpenVDB) and on the explicit surface (using Embree). (a)-
(b): Ray origins . (c)-
(d): Ray origins
. The top axis plots the number of active voxels in the narrow-band
representation of the surface while the bottom axis labels the corresponding number of triangles in the extracted mesh. The error bars show the spread in performance when varying the number of search directions for each origin according to
Table 5.1. The filled gray area is the range of the performance ratio of the explicit approach (Embree) over the implicit approach (OpenVDB).
Fig 5.6 plots the achieved speedup for the parallelized for loop (which iterates over the ray origins) for explicit and implicit ray casting. Both
parallelizations show nearly an ideal speedup for 2 threads. The speedup spreads for 4 threads with a minimum speedup of 2.7 (Embree) and 2.0 (OpenVDB). The speedup for 8 threads is up to 6 for Embree and up to 5 for
OpenVDB. The parameter combinations which show nearly no speedup between 2 to 4 threads (cf. Figure 5.6b) were identified to have a
low hit ratio () and a large
voxel size (
). For 5 and more threads (entering the hyper-threading
regime) this influence is compensated leading to an overall speedup between 3 and 4 for 6 threads.
Figure 5.6: Speedup for the OpenMP-parallelized ray casting with Embree (a) and OpenVDB (b) on WS1. The speedup range represents limits resulting of the full parameter range (cf. Table 5.1). The dashed line indicates an ideal speedup and the dotted line marks the number of physical cores in the system. The circle in (b) marks a group of combinations with nearly no speedup between 2 and 4 threads.
Based on the results from the previous sections a scheme is introduced which aims to accelerate the particle transport calculation of a level-set-based process simulation. It bases on the extraction of a temporary explicit surface mesh (from the level-set) in each time step of the simulation. Figure 5.7a and 5.7b illustrates the sequence of computational tasks in a time step of a level-set-based process simulation.
Figure 5.7: Difference in the sequence of computational tasks in a time step between the fully implicit scheme (a) using a level-set to represent the surface and the implicit/explicit scheme (b) using a temporary explicit mesh to represent the surface during the particle transport calculation. In (c), the overhead introduced by the extraction of the explicit mesh (using OpenVDB) and the initialization of the acceleration structure (using Embree) is shown for different resolutions of the unit sphere.
The scheme introduces new computational tasks for each time step, namely
• the extraction of the explicit surface mesh from the level-set,
• the generation of an acceleration structure (for efficient intersection tests) from the extracted mesh, and
• the interpolation of the surface rates from the explicit mesh to the positions of the level-set grid (computationally negligible).
Using the libraries benchmarked in Section 5.2 the overhead introduced by the approach is estimated. Figure 5.7c plots the runtime on WS1 (8 threads) for the generation of the temporary explicit mesh and the construction of the acceleration structure, when using OpenVDB to represent the level-set and to extract the mesh, and the acceleration structure of Embree. Different resolutions of the unit sphere (analog to the benchmarks above) are tested; the maximum runtime is identified with a total of less than 1.4 seconds for a mesh with about 6 million triangles. The overhead per million triangles is about 0.2 seconds for meshes with more than 0.4 million triangles.
The simulation platform described in Section 3 is used to implement a simple deposition test case: A cube with edge length is centrally placed in a
domain with periodic boundary conditions. The direct flux
from an ideal-diffuse source is calculated and a simple linear deposition
is used as surface velocity model. The direct flux is calculated using a
bottom-up scheme with a 4 times subdivided icosahedron (cf. Section 4.1) to sample the spherical directions. The lateral level-set resolution is set to
and the resulting explicit surface meshes for the initial and final
geometry count about
K and
K triangles, respectively. Figure 5.8 illustrates the initial, intermediate, and final geometry of the deposition simulation.
Figure 5.9 shows the runtime of the main computational tasks throughout the simulation. The runtime for the calculation of the surface rates (i.e., the direct flux calculation in this case) is dominating for both simulations. Nevertheless, the speedup is about 7 to 9 when using explicit ray tracing. The speedup is in accordance with the estimates of the generic benchmark in Section 5.2. However, differently to the benchmark above, the ray origins are located near the surface, i.e., they start inside the narrow-band of the level-set. Considering the geometry visualized in Figure 5.8c, it is apparent that most rays start in the narrow-band and intersect the narrow-band. This configuration is not considered in the benchmark above, where all ray origins are located at some distance to the narrow-band.
Single-precision ray tracing is attested sufficient accuracy, for top-down and bottom-up approaches for the particle transport in practical simulation scenarios.
The ray casting performance when using OpenVDB for implicit surfaces and Embree for explicit surface representations is studied using a generic test for non-imaging application. The performance gain is at minimum a factor of 3 for a wide range of scenarios.
An approach to perform the ray-surface intersections not on the level-set-based implicit representation of the surface but on a temporary explicit mesh was presented and compared. The performance gain in a deposition test case using a bottom-up approach to calculate the direct flux is a factor between 7 and 9 using the benchmarked libraries.