In general, the step detection algorithm has to extract unknown steps of height given an unknown mean value of the signal,
. Based on the mean shift model given by
with the step function and the emission time
, a statistical treatment of this problem is possible using the bootstrapping and cumulative sum (BCSUM) algorithm [170].
The BCSUM detection algorithm has the ability of detecting either positive or negative steps in a data set. From the BCSUM procedure the decision that either a discrete step in the underlying data samples exists or not is obtained by selection of a detection sensitivity
parameter . In the following the cumulative sum charts [167, 168] are discussed, the bootstrapping mechanism is introduced and finally the advanced detection method is presented.
First consider a series of independent measurement points with
the number of samples. The signal distributions before and after a change in the signal mean are given by
and
, respectively, with their corresponding means
and
, where
is the step height. Introducing the log-likelihood ratio given by
a change in the signal mean is reflected by a change of the sign
Figure 10.1: The log-likelihood ratio (bottom) is calculated for normally distributed data for ,
and
(top). When the test signal
changes its underlying distribution function from
to
the resulting log-likelihood ratio
, whereas
is the Gaussian distribution function for
and
the Gaussian distribution function for
, shows a change in the sign at the transition time instance, called change point. When the underlying data distribution of
goes from
to
a sign change is observed in the log-likelihood ration
[MWC26].
of the mean value of the log-likelihood ratio [168] of the data set
given by
and is illustrated in Figure 10.1.
Figure 10.2: Illustration of the bootstrap and cumulative sum algorithm to detect one step in a very noisy data set. Even though the data set contains two steps, the algorithm has to be applied recursively because at most one, that
is, the biggest change point is detected. Using the initial data set a number of
bootstraps
are created. These bootstraps
are resampled data sets of the initial data set
by applying resampling with replacement scheme. Afterwards, the cumulative sum charts of the bootstrapped and the original data are created. The spans of the single cumulative sum (CSUM) charts of the bootstrapped data are collected by a sorted list, from
the minimum to the maximum,
. The decision threshold
equals the element of
with index
, with
called the detection sensitivity. Finally, the decision threshold
is compared to the span of the CSUM chart obtained from the initial data set
. For
the initial data set contains a change point. In that case the initial data set has to be split at the change point and the algorithm is applied recursively to both sequences. This divide and conquer procedure is continued until no further change points larger than a
threshold are detected in the subsequences [MWC26].
Together with the assumption of normally distributed samples with the probability density function
the log-likelihood ratio for and
is
and the change point position is at the sample with index
In the case of trapping events, a negative mean shift is obtained and the log-likelihood ratio for reads
with the change point at the sample index position
Since the change of the mean is unknown the detection of either positive or negative changes has to be provided by the algorithm. Combining both cases, and
, the log-likelihood ratio for normally distributed samples can be formulated as
where the threshold value already considers the prefactor
. With the choice of
the CSUM chart function is
The sample index of the change point is given by
A frequently used method for parameter estimation of a set of data samples for unknown underlying distributions is bootstrapping. Its big advantage is that it is fully automatic and it does not matter how complicated the mathematical model for the probability distribution is.
For a given set of samples of length
a bootstrap sample
is obtained by randomly choosing
samples from
with replacement. Since bootstrapping is a resample technique with replacement, the samples out of
can occur never, once or more often in the bootstrap estimate
[171].
The statistical nature of bootstrapping includes the necessity of a huge number of bootstrap samples
. This circumstance inevitably leads to computationally expensive procedures.
The BCSUM algorithm is a Monte Carlo based combination of bootstraps and cumulative sum charts to detect changes in the signal mean. The divide and conquer procedure to detect one change point in the measurement data set is:
1. Calculate the bootstrap samples
from
.
2. Create the cumulative sum charts for each bootstrap sample
with
.
3. Estimate the spans of each chart and collect them into a sorted list from the minimum to the maximum
.
4. With the detection sensitivity the decision threshold is obtained as
, where
denotes the list index operator. For a data set
the list index operator is
5. Create the CSUM chart from the initial data set and evaluate
6. A change point in the underlying sequence occurs if . Split the data set at the change point position and recursively continue with 1) for each subsequence until all change points are detected.
Critical parameters for the detection outcome of the BCSUM algorithm is the number of bootstraps and the detection sensitivity
. For statistical relevance, the bootstrapping and the calculation of
has to be performed many times,
is recommended.
A choice of means that just the most dominant changes in mean are detected. For very fast events, which often just span two or three measurement data samples, the detection sensitivity is a crucial parameter. To extract small and very fast changes
is recommended. Moreover, the position of the change point in the measurement data is directly accessible via the CSUM chart obtained from the initial data series.
Last, but not least it has to be noted that the algorithm can be applied to uniformly and non-uniformly sampled data as well. This property is very important in the context of the TDDS. To cover several decades in time a compromise between the sampling rate and the accumulated recovery time is necessary in order to achieve a feasible amount of measurement data.
« PreviousUpNext »Contents