1. Introduction
Dynamic time warping (DTW) is a dynamic programming algorithm that calculates the optimal alignment between two sequences under certain assumptions. Though designed to handle an unknown amount of time warping, in practice DTW’s performance often degrades when the two sequences differ substantially in scale [
1,
2,
3], resulting in a global time warp factor that deviates from one. Scale differences arise naturally in many domains: query-by-humming systems must handle singing at different tempos [
4], gait recognition involves different walking speeds [
5], query-by-sketch systems must handle differences in scale [
6], and similar issues arise in multimedia [
7] and bioinformatics [
8]. This paper studies the effect of global time warping conditions on the alignment accuracy of DTW in a systematic manner and experimentally explores several ways to improve the robustness of DTW to varying levels of global time warping. To make our study concrete, we will focus on an audio–audio alignment scenario in which the goal is to accurately estimate the temporal alignment between two different audio recordings of the same piece of music (e.g., two different piano performances of a composition).
Previous works on DTW generally fall into one of four groups. The first group focuses on speeding up exact DTW (or subsequence DTW), often in the context of a database search. This has been accomplished in many ways, including lower bounds [
9,
10], early abandoning [
11,
12], parallelizing across multiple cores [
13,
14], or using specialized hardware [
15,
16]. Several recent works have utilized GPUs to reduce the runtime of computing exact DTW on long sequences [
17,
18]. The second group focuses on reducing the quadratic memory and computation costs through approximations of DTW. These include approximate lower bounds [
19,
20], imposing bands in the cost matrix to limit severe time warping [
21,
22], performing alignments at multiple different resolutions [
23,
24], parallelizable approximations of DTW [
18,
25], and estimating alignments within strict memory limitations [
26]. The third group focuses on extending the behavior of DTW to make it more flexible. Examples in the music information retrieval literature include handling structural differences due to repeats or jumps in music [
27,
28,
29], aligning sequences in an online setting [
30,
31,
32], handling partial alignments [
33,
34], utilizing multiple performances of a piece to improve alignment accuracy [
35], and handling pitch drift in a capella music [
36]. The fourth group focuses on integrating DTW into modern neural network models. These include differentiable approximations of DTW suitable for backpropagation [
37,
38,
39] or adopting a hard alternating scheme [
40,
41,
42]. We note that our present study falls into the third group (more flexible behavior), as our focus is on improving robustness under certain conditions.
This paper poses the following question: “How can we make DTW more robust to global time warping conditions?” There are two practices that are commonly used to handle differences in scale. The first practice is to select an appropriate set of allowable transitions to handle the amount of expected time warping. For example, the set
allows for a maximum time warping factor of 2, while the set
theoretically allows for an infinite amount of time warping. This is a design decision that must be made whenever DTW is invoked. The second practice is to re-scale the sequences to be the same length before invoking DTW [
1,
2]. Though these two practices are commonly adopted, we are not aware of any studies that systematically study their effectiveness as a function of the global time warping factor. This paper compares these two approaches—and others—to determine how well they perform across a range of global time warping conditions. To study this question in a systematic manner, we adopt a dataset commonly used to study audio synchronization, modify it to construct benchmarks in which the average global time warping conditions are carefully controlled, and then use our controlled benchmarks to study the effectiveness of several strategies for handling global time warping. Our goal is to understand the effect of global time warping on the alignment accuracy of DTW and to identify a set of best practices for handling varying global time warping conditions.
This paper has three main contributions. First, we introduce a framework for systematically studying the effect of global time warping conditions in an audio–audio alignment task. Second, we explore several ways to improve the robustness of DTW to varying levels of global time warping, and we characterize their effectiveness in our controlled benchmarks. Third, we provide a clear recommendation for best practice in handling global time warping conditions: sequence length normalization with downsampling. This method achieves the best alignment accuracy across a wide range of global time warping conditions, while maintaining or reducing runtime compared to standard usages of DTW. Code for reproducing our experiments can be found at
https://github.com/HMC-MIR/ExtremeTimeWarping (accessed on 5 February 2024).
The rest of the paper is organized as follows.
Section 2 introduces four different methods for handling or mitigating the effect of global time warping.
Section 3 describes our experimental setup to study the effect of global time warping under controlled conditions and presents our empirical results and findings.
Section 4 conducts two additional analyses to gain a deeper insight into the results.
Section 5 concludes the work.
2. Materials and Methods
In this section, we describe several methods for dealing with global time warping. First, we explain the standard DTW algorithm for completeness, and then we describe four different ways to handle global time warping.
Standard DTW estimates the alignment between two sequences
,
, …
and
,
, …,
in the following manner. First, a pairwise cost matrix
is computed, where
indicates the distance between
and
under a particular cost metric (e.g., Euclidean distance, cosine distance). Next, a cumulative cost matrix
is computed with dynamic programming, where
indicates the optimal cumulative path cost from
to
under a pre-defined set of allowable transitions and transition weights. For example, with a set of allowable transitions {
,
,
} and corresponding transition weights {2, 3, 3}, the elements of
D can be computed using the following recursion:
During this dynamic programming stage, a backtrace matrix
is also computed, where
indicates the optimal transition ending at
. Once
D and
B have been computed using dynamic programming, we can determine the optimal path through the cost matrix by following the backpointers in
B starting at position
. The optimal path defines the predicted alignment between the two sequences.
2.1. Different Transitions and Weights
One obvious way to deal with global time warping is to simply select a set of allowable transitions to explicitly handle a specified amount of time warping. This is a design decision that must be made whenever DTW is invoked. For example, the transition set
can theoretically handle an infinite amount of time warp. In practice, however, these transitions often lead to degenerate alignments and unstable or undesirable behavior. A more conservative way to handle global time warping is use transitions like
and
, which limit the maximum amount of time warping that can be handled. A commonly used set of transitions in the audio synchronization literature is {
,
,
}, which imposes a maximum allowable time warp factor of 2. In a similar manner, the set of transitions {
,
,
,
,
} imposes a maximum allowable time warp factor of 3, and the set {
,
,
,
,
,
,
} imposes a maximum allowable time warp factor of 4. These additional transitions come with a significant computational cost, however, since each step of the dynamic programming stage needs to consider every possible transition (cf. Equation (
2)). In
Section 3 and
Section 4, we will compare both the alignment accuracy and the runtime of several different settings for transition types and weights.
2.2. Normalizing Sequence Length
Another common way to deal with global time warping is to normalize the sequence lengths before estimating the alignment. This technique has been explored in many forms in previous works [
43,
44,
45]. Let
,
, …,
and
,
, …,
be the two sequences that we would like to align, where
. Before computing the pairwise cost matrix
C, we can downsample the sequence
,
, …,
to match the length of the other sequence, yielding a modified sequence
,
, …,
. One simple way to perform this downsampling is to simply calculate a weighted combination of the two nearest neighbors. In this approach, we first create
N linearly spaced indices between 0 and
, and then use these indices to compute values as a weighted combination. For example, a desired sample at index
would be calculated as
. Once both sequences have been normalized in length, we can use standard DTW to estimate the alignment between
,
, …,
and
,
, …,
, and then account for the global downsampling factor to infer the alignment between
,
, …,
and
,
, …,
.
There are multiple ways one might normalize sequence length. In our experiments, we consider two different dimensions of behavior. One dimension is to either (a) downsample the longer sequence to match the length of the shorter sequence, or (b) upsample the shorter sequence to match the length of the longer sequence. The second dimension is to perform upsampling/downsampling by either (a) using linear interpolation between the two nearest neighbors (as described in the previous paragraph) or (b) simply using the nearest neighbor (e.g., a desired sample at index
would be represented as
). In
Section 3 and
Section 4, we characterize the performance and runtime of the four possible combinations: downsampling with linear interpolation, downsampling with nearest neighbor, upsampling with linear interpolation, and upsampling with nearest neighbor.
2.3. Adaptive Weighting Schemes
A third way to deal with severe time warping is to select transition weights adaptively during the run time rather than using a fixed set of transition weights tuned on a training set. Let , , …, and , , …, be the two sequences that we would like to align, where . If we use standard DTW with allowable , transitions {, } and corresponding weights {, }, then we can adaptively set and for each pair of sequences to be aligned. This weighting scheme ensures that both axes contribute the same weighted Manhattan distance cost from to , regardless of the values of N and M. (Note that if , the axis along the shorter sequence contributes less to the total path cost simply because it is shorter in length.) This prevents one axis from dominating the total path cost. Similarly, we can use standard DTW with allowable , transitions {, , } and corresponding weights {, , }, where we can adaptively set , , and . This preserves the property that both axes contribute equally to the total path cost, and it allows transitions as well.
2.4. Non-Uniform Transition Patterns
A fourth way to deal with severe time warping is to use non-uniform transition patterns at different positions in the cost matrix. This approach is identical to standard DTW except that the set of allowable transitions at each position
in the pairwise cost matrix may be different. Consider a nominal set of allowable transitions {
,
,
} with corresponding weights {1, 1, 2} and a desired maximum time warping factor
. If we remove the
transition from all positions
where
, and likewise remove the
transition from all positions
where
, then we effectively impose a maximum time warping factor of
.
Figure 1 shows an illustration of the case when
. The benefit of this approach is that it can handle more extreme time warping, but it limits horizontal/vertical degenerate paths and avoids the computational cost of adding additional transition types.
In
Section 3, we will characterize how effective the above methods are in dealing with varying global time warping conditions.
4. Discussion
In this section, we conduct two additional analyses to gain a deeper insight into our experimental results.
The first analysis is to characterize the runtime of different algorithms. We can do this in two ways: a theoretical analysis and an empirical analysis. The theoretical runtime can be considered in the following manner. Let N and M represent the sequence lengths of the two sequences to be aligned, where we assume without loss of generality. Let T represent the number of possible transition types at each location in the cost matrix. Then, the computational complexity of standard DTW is . This applies to the seven systems that explore the use of different transition types and weights (DTW1-5, DTW1-add3, DTW1-add4) and the adaptive weighting schemes (AdaptiveWeight1-2). For the sequence length normalization methods, the computational complexity is when downsampling and when upsampling. For the non-uniform transition patterns, the average number of transition types per location is reduced to , so the computational complexity is .
We used the following procedure to measure the empirical runtime. Given two sequence lengths, N and M, we randomly initialize two feature matrices of size and , where the 12 is selected to simulate a chroma feature representation. We begin profiling after the feature matrices have been generated, and we include feature preprocessing (e.g., downsampling for sequence length normalization), pairwise cost matrix computation, dynamic programming, backtracking, and any postprocessing (e.g., compensating for downsampling). For each setting of , we repeat this process 10 times and report the average runtime. All experiments were run on a 2.1 GHz Intel Xeon processor with 192 GB of DDR4 RAM, and all alignment algorithms are implemented in cython.
Figure 4 compares the runtime of all 17 alignment algorithms. Each group along the horizontal axis corresponds to one setting of
, where
N and
M indicate the lengths of the shorter and longer sequence, respectively. We have included results for
1k, 3k, 10k, 30k, and
,
,
. These settings allow us to see the effect of both the sequence length and the global time warp factor. For ease of viewing, the groups have been split into two subplots. Within each group, the different bars correspond to the 17 alignment algorithms; we use the same color scheme as in
Figure 3. The height of each bar indicates the average runtime across 10 runs, where runtime is shown on a log scale. Note that the results for (30k, 120k) are missing for the two upsampling systems—this is because the upsampling approach creates a 120k × 120k pairwise cost matrix and cumulative cost matrix, and this exceeded the RAM limits for our server.
There are several things to notice about
Figure 4. First, the DTW1 through DTW5 systems have similar runtimes, since all are standard DTW with three allowable transitions. This is a baseline that represents typical DTW usage. Now, we consider the remaining groups. DTW1-add3 and DTW1-add4 (orange) have increased runtimes compared to the baseline, since they require more transition possibilities to be considered at each step of the dynamic programming stage. The sequence normalization methods (purple) have the most interesting runtime trends. When
, the runtimes of all four sequence normalization methods are similar to the baseline. However, as the global time warping factor
increases, the downsampling variants show a significant decrease in runtime compared to the baseline, while the upsampling variants show a significant increase in runtime. This makes sense: when
and
, the downsampling variants will perform DTW on a
pairwise cost matrix while the upsampling variants will perform DTW on a
pairwise cost matrix. The adaptive weighting schemes (green) show similar performance to the baseline, with a decrease in runtime for the variant that only has two allowable transitions. Interestingly, the systems with non-uniform transition patterns (red) show a significant increase in runtime compared to the baselines. While in theory these methods require fewer transition possibilities to be considered, in practice the implementation of non-uniform transition patterns requires additional logic within the nested loops, which results in an increase in runtime.
The main takeaway from our runtime analysis is this: the downsampling sequence normalization methods are the clear winner. They have the best alignment accuracy across all global time warping conditions, and they also have the best runtime characteristics. The runtime of these methods roughly matches standard DTW when , and it substantially reduces the runtime under more extreme global time warping conditions.
The second analysis is a sanity check. We have performed extensive experiments under carefully controlled global time warping conditions, and we have found a clear winner: sequence length normalization with downsampling. But the real question is this: does this method help in practice under real conditions? To answer this question, we compared the performance of the best standard DTW configurations (DTW1, DTW2) with their corresponding sequence normalized methods (DTW1-downsampleNN, DTW1-downsampleLin, DTW2-downsampleNN, DTW2-downsampleLin) on the original Mazurka Dataset.
Table 5 compares the error rates of these six systems at three different error tolerances: 100 ms, 200 ms, and 500 ms.
There are two things to notice about
Table 5. First, we see that sequence normalization consistently improves alignment accuracy. This improvement holds across different weighting schemes (DTW1, DTW2) and across all error tolerances. This suggests that the sequence normalization method may be useful as a general best practice, and not just when there is extreme global time warping. Second, we observe that the improvement is more pronounced for larger error tolerances. For example, when applying sequence normalization to DTW2, the error rate at a 100 ms tolerance decreases from 14.1% to 13.2% (a 6.4% reduction in errors), and at a 500 ms tolerance decreases from 3.6% to 2.6% (a 27.8% reduction in errors). Intuitively, we do not expect that downsampling a sequence will greatly improve its fine-grained alignment accuracy since downsampling throws away fine-grained information. But the main effect we are observing here is that sequence normalization reduces the likelihood that the ground truth alignment path will exceed a maximum allowable time warping factor (imposed by the selected set of allowable transitions). This results in a substantial improvement in alignment accuracy at mid- and coarse-level error tolerances.
It is important to point out the limitations of our study. The audio–audio alignment task we have selected has several distinctive characteristics that may limit the generalizability of our findings: (a) sequence elements (e.g., chroma features) have strong correlations over time, (b) sequence elements are continuous valued rather than discrete symbols, and (c) the true alignment path is (roughly) monotonically increasing, meaning that the alignment is a one-to-one mapping (rather than a many-to-one mapping). We cannot say if our conclusions and findings will necessarily generalize to settings in which these characteristics are different. For example, if the sequences consist of discrete tokens with no temporal correlation, then downsampling may have a very different effect. Nonetheless, we are eager to study to what extent these conclusions may generalize to other domains in future work. Furthermore, we also point out that the sequence normalization method has two restrictions. First, it requires knowing the average global time warping factor a priori, usually through a boundary assumption, in which we assume that both sequences begin and end together. For situations like subsequence DTW, in which the boundary conditions are not known a priori, sequence normalization methods cannot be applied. Second, it only provides benefit in handling global (rather than local) time warping conditions. For example, if the true alignment path has an average global time warping factor close to one but contains a local section with extreme time warping, the sequence normalization method will not provide any benefit. This is also an area to explore in future work.