Figure 6.3: Data sharing across splits.
As splitting occurs on the step level, and chaining of the two executables is defined at a lower level, the same splitting would only prevent step 1 to be recomputed, but it would not prevent running the pre-processing tool twice on the same input.
To prevent these excess computations, a mechanism has been implemented that takes care of registering operations and looking up existing results. Figure 6.4 gives an overview of the events occuring during simultaneous submission of two runs splitting at a step that requires a sequence of tools to be run.
Figure 6.4:
Detailed view of
the events taking place during data sharing between split branches.
Before the pre-processing operation is started, existing results are looked up using a formal description of the operation. In general, this description is a string containing all information that specifies the operation completely, i.e., the name of the tool, the input files, and all parameters that affect the result. If no existing results are found, the operation is started and a work-in-progress entry is generated to indicate that the result is being computed. At the successful completion, the entry is marked as done. The next step that intends to run this operation receives the existing output file.
During parallel operation, the second inquiry may appear any time after the first one, especially during the time the first operation executes. In this case, no output is available yet, but there is no need to compute it, either, as it is in progress. Therefore, a callback can be registered that gets called when an operation completes to deliver the generated output.
In addition to checking the existence of an operation's result, it is also checked for validity in terms of file modification times. If an operation's result exists and is inquired, the existing output file is checked for being younger than all files that were used as input to the operation. Otherwise, recomputation is initiated. This strategy is similar to the one employed by the UNIX make utility.