1. Introduction

Citing Paulus et al. (), “[…] it is the structure, or the relationships between the sound events that create musical meaning”. In that sense, researchers in MIR developed the Music Structure Analysis (MSA) task, which focuses on the retrieval of the structure in a song. Music structure is ill-defined, but is generally viewed as a hierarchical description, from the level of notes to the level of the song itself (; ). A tentative definition is that structure is a simplified representation of the organization of the song.

In that sense, motifs which arise from the organization of notes are a first level of structure. These motifs create patterns, progressions and phrases. In general, the highest level of structure defines musical sections, corresponding to “chorus”, “verse” and “solo”, which is a macroscopic description of music (). Some work focuses on estimating structure in its hierarchical nature (e.g. , ; ; ), but this work focuses on a “flat” level of segmentation, i.e. a macroscopic level, corresponding to musical sections. Facing the high diversity of music, and the many ways structure can be designed, we restrict this work to the study of Western modern (and in particular Western Popular) music. In particular, this work relies on both the RWC Pop () and the SALAMI () datasets, which are open-source and standard datasets in MSA.

MSA is subdivided into two subtasks, not necessarily mutually exclusive: the boundary retrieval task and the segment labelling task. The boundary retrieval task consists in estimating the boundaries between different sections, hence partitioning music into several non-overlapping segments, covering the entire song. The segment labelling task consists in grouping similar segments with the same label, typically letters such as ‘A’, ‘B’, ‘C’, etc. In this article, only the boundary retrieval task is considered. A schematic example of musical structure is presented in Figure 1.

Figure 1 

A schematic example of musical structure.

Algorithms aimed at solving MSA are designed according to one or several criteria among the following: homogeneity, novelty, repetition and regularity (). The homogeneity criterion assumes that a section consists of similar musical elements (notes, chords, tonality, timbre, …). Novelty is the counterpart of homogeneity: this criterion considers that boundaries are located primarily between consecutive musical elements that are highly dissimilar. A high novelty is salient between two distinct homogeneous zones (“break” of homogeneity), and conversely, homogeneity is evaluated within successive dissimilar portions in the song. The third criterion, repetition, relies on a global approach to the song. The rationale is that a motif (e.g. a melodic line) may be time-varying (and thus heterogeneous), but can define a segment if it is repeated across the song (for example, a chorus). The repetition criterion may also be used to partition long segments into smaller repeated segments. Finally, the regularity criterion assumes that, within a song (and even within a musical genre), segments should be of comparable size.

Many MSA algorithms make use of matrices representing the similarity and dissimilarity in music, sometimes referred to as “self-distance matrices” (), “self-similarity matrices” (), “recurrence matrices” or “pair-wise frame similarities” (). These representations differ in their details, but share the same conceptual idea of computing some form of similarity (or, conversely, some form of distance) between the different frames of music, and representing it in a square matrix (its size being the number of frames). In this work, we will use the term “self-similarity matrix”.

As a particular example, the novelty kernel (), which may be some of the earliest work on audio MSA, estimates boundaries as points of high dissimilarity between the recent past and the near future, by applying a square kernel matrix on the diagonal of the self-similarity matrix. This kernel works ideally when the recent past and the near future are homogenous in their respective neighborhoods, but very dissimilar with each other. In practice, this kernel is convolved with the self-similarity matrix of the song, centered on the diagonal, which gives a “novelty” value for each temporal sample of the song, finally post-processed into boundaries with a thresholding operation.

While this technique is rather simple, it is still used as a standard segmentation tool in recent work (e.g. ; ) focusing on improving the boundary retrieval performance by enhancing the self-similarity matrix. In particular, both of these works belong to the domain of representation learning (), consisting of designing machine learning algorithms to learn relevant representations instead of focusing on solving a particular task. In that context, using prior knowledge, both McCallum () and Wang et al. () design neural network architectures and optimization schemes with the objective to obtain enhanced (nonlinear) similarity functions, more prone to highlight the structure in the self-similarity matrices.

Notably, McCallum () develops an unsupervised learning scheme where the prior knowledge enforced in the representation is based on the proximity of samples: the closer the frames in the song, the more probable they belong to the same segment. In the same spirit, Wang et al. () develop a supervised learning scheme: the neural network learns representations where segments annotated with the same label are close, and segments annotated differently are far apart. The rationale for both methods is to learn a similarity function which is not only representing the feature-wise correspondence of two music frames, but can also discover frequent patterns in the learning samples.

McFee and Ellis () propose an algorithm based on spectral clustering, aiming at interpreting the repetitive patterns in a song as principally connected vertices in a graph. The structure is then obtained by studying the eigenvectors of the Laplacian of this graph, forming cluster classes for segmentation. This technique is amongst the best-performing unsupervised techniques nowadays, and was improved by recent work by Salamon et al. (), which replaces or enhances the acoustic features on which spectral clustering is applied with nonlinear embeddings, learned by means of a neural network.

Serrà et al. () develop “Structural Features”, which, by design, encode both repetitive and homogeneous parts. The rationale of these features is to compute the similarity between bags of instances, composed of several consecutive frames. In that sense, the similarity encodes the repetition of any sequence, which can be stationary (homogeneity) or varying (repetition). Boundaries are obtained as points of high novelty between consecutive structural features.

Finally, Grill and Schlüter () develop a Convolutional Neural Network (CNN) which outputs estimated boundaries. This CNN is one of the few techniques which does not compute a self-similarity matrix to later post-process it into boundaries, but it still uses self-similarities as input. The network is supervised on two-level annotations, on the SALAMI dataset (), and, according to the authors, using these two levels of annotations is beneficial to the performance.

While many algorithms are devoted to the task of boundary retrieval (see for instance literature reviews from Paulus et al. () and from Nieto et al. ()), research is still conducted towards more effective estimation algorithms. As presented above, in the past decade, research has mainly shifted from unsupervised to supervised algorithms, i.e. from low-informed estimation algorithms, generally designed with strong hypotheses, to algorithms which take advantage of (generally huge) annotated databases to learn mappings between the musical features and annotated structural elements. While this shift has resulted in more effective algorithms, it has the disadvantages of requiring large training datasets, and reproducing potential bias in relation to the annotations, known to be prone to high subjectivity and ambiguity ().

1.2 Contributions

In order to improve unsupervised algorithms, we propose in this article a novel approach based on the Correlation “Block-Matching” (CBM) algorithm. This algorithm was briefly introduced in previous work (, ) and is worth a more detailed presentation, which is one of the objectives of this work. Firstly, in line with the findings in (, ), we conjecture that the bar-scale is the most appropriate temporal scale from which to infer structure in Western modern music, and we present a framework which inherently processes music at this temporal scale. To the best of our knowledge, only a few works used such a hypothesis (e.g. ; , ). This hypothesis is supported by experiments which compare segmentation performance when aligning state-of-the-art algorithms and the CBM algorithm on either the beat or the bar-scale. We show a consistent advantage for the bar-scale alignment approach.

The CBM algorithm estimates the musical structure based on the principles introduced in the work of Jensen (), later extended by Sargent et al. (). In a nutshell, the CBM algorithm is based on the definition of a score function (further denoted as u) applied to segments, with the overall segmentation of the song resulting in the maximum total score of the set of segments. This defines an optimization problem, which can be solved by dynamic programming.

The novelty of the CBM algorithm lies in its ability to extend previous work by incorporating new hypotheses regarding the design of the score function u. As a consequence, the algorithm is highly customizable and can be tailored to specific hypotheses and applications, which is a potential area for future research. We also present a study of different similarity functions to account for the similarity between musical features, and notably the Radial Basis function, which, to the best of our knowledge, was never previously used for MSA. Finally, we present experimental results which appear competitive with the most effective algorithm known to date ().

The CBM algorithm is unsupervised in the sense that the segment boundaries are estimated as solutions of an optimization problem which does not depend explicitly on annotated examples. Nonetheless, in order to accurately tune internal hyperparameters, the following experiments are carried out by separating data between a “train” and a “test” dataset. In addition, we acknowledge that we use a learning-based toolbox for the bar estimation, but this toolbox is independent from our work. In that sense, although we label this algorithm as “unsupervised”, it could also arguably be qualified as “weakly-” or “semi-” supervised.

This article is organized as follows: Section 2 presents in more detail the hypotheses and framework to process music in a barwise setting, Section 3 presents the CBM algorithm and Section 4 presents an evaluation of the CBM algorithm on the boundary retrieval task, along with a comparison with state-of-the-art algorithms.

2. Barwise Music Analysis

In most work in MSA (), the signal of a song is represented as time-sampled features, related to some extent to the frequency content of the song’s signal. In previous work on MSA, features have been either computed with a fixed hop length, typically between 0.1s and 1s according to Paulus et al. (), or (in more recent work), aligned on beats (; ; ). Beat alignment is musically-relevant because it aligns the features and the estimations with respect to a time segmentation consistent with music performance. In this work, we hypothesize that the bar-scale is more relevant than the beat-scale to study MSA in Western modern music.

Bars seem well suited to express patterns and sections in Western modern music. Indeed, in Western musical notation, musical note lengths are expressed relatively to beats, and beats are combined to form bars. Bars finally segment the musical scores (with vertical lines), and similarities occur generally across different bars (which is particularly visible by the use of repeat bars, or symbols as “Dal Segno”, “Da Capo”, etc). In addition, the intuition that musical sections are synchronized on downbeats is experimentally confirmed by Mauch et al. () and Fuentes et al. (), where the use of structural information improves the estimation of downbeats. Experiments supporting this hypothesis are presented in Section 2.3.

The direct drawback of barwise alignment is the need for a powerful tool to estimate bar boundaries. In this work, we use the madmom toolbox (), which uses a neural network to perform bar estimation (). In the 2016 MIREX contest, which was the last edition of the contest comparing downbeat estimation algorithms, this neural network obtained the best performance, and can hence be considered as one of the state-of-the-art algorithms for the task. Even if some algorithms obtained better performance since (e.g. ; ; ), we consider that the madmom toolbox achieves a satisfactory level of performance for our intended application.

2.1 Barwise TF Matrix

In this work, we represent music as barwise spectrograms, and more particularly as a Barwise TF matrix, following Marmoret et al. (). The Barwise TF matrix consists of a matrix of size B × TF, B being the number of bars in the song (i.e. a dimension accounting for the bar-scale), and TF the vectorization of both time (at bar-scale) and feature dimensions (representing the frequency to some extent) into a unique Time-Frequency dimension. The number of time frames per bar is fixed to T = 96, as in Marmoret et al. (). Following the work of Grill and Schlüter (), the signal is represented in log mel features, i.e. the logarithm of mel coefficients, expressed with F = 80 mel coefficients, but any other feature representation could be used instead. The rationale for using log mel spectrograms is that they lead to high segmentation performance (; ) while constituting a compact spectral representation, suited for music analysis.

2.2 Barwise Self-Similarity Matrix

As stated in Section 1.1, a common representation in MSA is the self-similarity matrix, representing the similarities at the scale of the song. An idealized self-similarity matrix, extracted from Paulus et al. (), is presented in Figure 2. Similar passages are identified by two typical shapes: blocks and stripes. A block is a square (or a rectangle) in the self-similarity matrix, representing a zone of high inner-similarity, i.e. several consecutive frames which are highly similar, hence corresponding to the homogeneity criterion. A stripe is a line parallel to the main diagonal representing a repetition of the content, i.e. a pattern of several frames repeated in the same order, hence corresponding to the repetition criterion. As a general trend, the segmentation algorithms using self-similarity matrices are designed so as to retrieve segments based on blocks and stripes.

Figure 2 

An idealized self-similarity matrix, extracted from Paulus et al. ().

Given a Barwise TF matrix XB×TF, the self-similarity matrix of X is defined as A(X)B×B where each coefficient (i, j) represents the similarity between vectors Xi,XjTF. Self-similarity matrices are computed from the Barwise TF representation of the song, therefore, each coefficient in the self-similarity matrix represents the feature-wise similarity for a pair of bars.

The similarity between two vectors is subject to a similarity function (the dot product for instance), and, as a consequence, different self-similarity matrices can be constructed. The main diagonal in a self-similarity matrix represents the self-similarity of each vector, and is in general (and in this work in particular) normalized to one. This work studies three different similarity functions, namely the Cosine, Autocorrelation and RBF similarity functions. The latter two represent novel contributions compared to our previous work ().

2.2.1 Cosine self-similarity matrix

The Cosine similarity function computes the normalized dot product between two vectors, and leads to the Cosine self-similarity matrix, denoted as Acos(X). Practically, denoting as X˜ the row-wise l2-normalized version of X (i.e. the matrix X where each row has been divided by its l2-norm), the Cosine self-similarity matrix is defined as Acos(X)=X˜X˜, or, elementwise, for 1 ≤ i, jB:

(1)
Acos(X)ij=Xi,XjXi2Xj2=k=1TFX˜ikX˜jk.

2.2.2 Autocorrelation self-similarity matrix

The Autocorrelation similarity function is defined for 2 bars Xi and Xj as corr(Xi,Xj) = (Xix¯)(Xjx¯), denoting as x¯TF the mean of all bars in the song.

The barwise Autocorrelation similarity function yields the Autocorrelation self-similarity matrix Acorr(X) as:

(2)
Acorr(X)ij=Xix¯,Xjx¯Xix¯2Xjx¯2.

In other words, the Autocorrelation matrix is exactly the Cosine self-similarity matrix of the centered matrix X  1B x¯, i.e. Acorr(X) = Acos(X1Bx¯).

2.2.3 RBF self-similarity matrix

Kernel functions are symmetric positive definite or semi-definite functions. In machine learning, kernel functions are generally used to represent data in a high-dimensional space (sometimes infinite), enabling a nonlinear processing of data with linear methods (e.g. nonlinear classification with Support Vector Machines, SVM).

The Radial Basis Function (RBF) kernel is a kernel function defined as RBF(Xi,Xj) = exp(γXiXj22),γ being a user-defined parameter. The RBF can be used as a similarity function between two bars Xi and Xj, hence defining the RBF self-similarity matrix ARBF(X) as:

(3)
ARBF(X)ij=RBF(X˜i,X˜j)=expγXiXi2XjXj222.

Bars are normalized by their l2 norm in the computation of ARBF, in order to limit the impact of variations of power between bars. The self-similarity of a bar is equal to e0 =1.

Parameter γ is set relatively to the standard deviation of the pairwise Euclidean distances of all bars in the original matrix (self-distances excluded), to adapt the shape of the exponential function to the relative distribution of distances in the song. Hence, denoting as: σ=std1i,jBijXiXi2XjXj222,wesetγ=12σ.

The RBF function may be useful for MSA as the self-similarity of dissimilar elements fades rapidly due to the properties of the exponential function. Hence, the RBF similarity function emphasizes similar components (i.e. homogeneous zones). The three self-similarities are presented in Figure 3, on the Barwise TF of song POP01 from RWC Pop.

Figure 3 

Cosine, Autocorrelation and RBF self-similarities for the song POP01 of RWC Pop.

2.3 Barwise MSA Experiments

Section 2 is based on the hypothesis that the bar-scale is more relevant than other time discretizations (in particular the beat-scale) to study MSA in Western modern music. To support this hypothesis, we present hereafter three experiments studying the differences in performance between beat-aligned and barwise-aligned estimations.

2.3.1 Aligning the annotations on the downbeats

As a preliminary experiment to study the impact of barwise alignment on the segmentation quality, we evaluate the loss in performance when aligning annotations on downbeats for both the RWC Pop () and SALAMI () datasets. In this experiment, each annotation is aligned with the closest estimated downbeat, and the barwise-aligned annotations are compared with the initial annotation using the standard metrics P0.5s, R0.5s, F0.5s and P3s, R3s, F3s (detailed in Section 4.1). Results are presented in Table 1. The RWC Pop annotations are barely impacted by the barwise alignment, suggesting that annotations are precisely located on downbeats. A loss in performance with short tolerances is observed on the annotations of the SALAMI dataset, either suggesting imprecise bar estimations or boundaries not located on the downbeats. Still, the levels of performance exhibited in Table 1 largely outperform the current state-of-the-art (≈ 80% vs ≈ 54% for the F0.5s metric, respectively for the downbeat-aligned annotations in Table 1 and for Grill and Schlüter (), whose results are presented in Figure 12). In that sense, the loss in performance induced by downbeat alignment may be compensated if estimations are indeed more precise due to this alignment.

Table 1

Standard metrics (see Section 4.1) when aligning the reference annotations on the downbeats (compared to the original annotations).


DatasetP0.5s R0.5s F0.5s P3s R3s F3s

SALAMIAnnotation 182.47%82.14%82.30%99.94%99.56%99.74%
Annotation 280.97%80.92%80.94%99.92%99.84%99.88%
RWC Pop96.46%96.21%96.33%100%99.73%99.86%

2.3.2 Downbeat-alignment for several state-of-the-art algorithms

A second experiment consists of post-processing the boundary estimations of three unsupervised state-of-the-art algorithms (; ; ), computed with the MSAF toolbox (), by aligning each boundary with the closest estimated downbeat. As these algorithms originally use beat-aligned features (resulting in beat-aligned estimations), this experiment compares beat-aligned estimations with downbeat-aligned estimations. Segmentation scores are presented in Figure 4 for both SALAMI and RWC Pop datasets.

Figure 4 

Segmentation results of state-of-the-art algorithms on the SALAMI-test and RWC Pop datasets, for beat-aligned (original) vs. downbeat-aligned boundaries. The SALAMI-test dataset is defined by Ullrich et al. (), and introduced in Section 4.2.1.

Results show that aligning estimated boundaries on downbeats results in a strong increase in performance for F0.5s, and to comparable results for F3s, on both datasets. Hence, aligned on downbeats, estimations are more accurate, but the F3s metric is not significantly impacted by this alignment. These results suggest that downbeat-alignment is beneficial on these datasets.

2.3.3 Focusing on Foote’s algorithm

Finally, a third experiment compares the results obtained with different time discretizations for Foote’s algorithm (), implemented in the MSAF toolbox (). In particular, this experiment compares the results when self-similarities are computed with beat-aligned and downbeat-aligned features.

As presented in Section 1.1, Foote’s algorithm estimates boundaries as points of high novelty. A novelty score is computed at each point of the self-similarity matrix by applying a kernel matrix, and this novelty score is finally post-processed into boundaries with a thresholding operation. In the original implementation, the Cosine self-similarity is computed on beat-synchronized features, i.e. one feature per beat. In this experiment, beats are estimated with the algorithm of Böck et al. (), which is one of the state-of-the-art algorithms in beat estimation, and is implemented in the madmom toolbox.

As in Section 2.3.2, the original results are compared with the ones obtained when aligning the estimations on downbeats. In addition, we compare the beat-synchronized results with two new feature processing approaches: bar-synchronized features, i.e. one feature per bar (instead of one per beat), and the Barwise TF matrix, introduced in Section 2.2.

The kernel is of size 66 for the beat-synchronized features (as originally set in MSAF), and is of size 16 for both the bar-synchronized features and the Barwise TF matrix. Results are presented in Tables 2 and 3, respectively for the SALAMI-test and the RWC Pop dataset. In order to fairly compare the algorithms, we also fitted two hyperpameters of the original MSAF implementation for the bar-scale, namely the size of median filtering applied to the input spectrogram and the standard deviation in the Gaussian filter applied to the novelty curve. These parameters were fitted in a train/test fashion, as detailed in Section 4.2.1.

Table 2

Different time synchronizations for the Foote (2000) algorithm on the SALAMI-test dataset. The SALAMI-test dataset is defined by Ullrich et al. (2014), and introduced in Section 4.2.1.


Time synchronizationP0.5s R0.5s F0.5s P3s R3s F3s

Beat-synchronizedOriginal26.98%34.58%29.21%50.10%63.30%54.02%
Re-aligned on downbeats31.05%39.15%33.33%50.08%62.95%53.78%
Bar-synchronized37.68%36.36%35.97%58.06%56.11%55.57%
Barwise TF Matrix39.22%42.66%39.67%59.60%64.82%60.36%

Table 3

Different time synchronizations for the Foote (2000) algorithm on the RWC Pop dataset.


Time synchronizationP0.5s R0.5s F0.5s P3s R3s F3s

Beat-synchronizedOriginal31.86%24.38%27.29%67.21%51.92%57.95%
Re-aligned on downbeats42.30%32.82%36.52%66.67%51.44%57.44%
Bar-synchronized43.53%26.32%32.46%69.25%42.22%51.97%
Barwise TF Matrix53.09%37.19%43.30%79.35%56.03%65.04%

Both Tables 2 and 3 conclude in the same direction: the best performance for the F0.5s and F3s metrics is obtained with the Barwise TF matrix. Similarly to Section 2.3.2, aligning beat-aligned estimations to downbeats in post-processing increases the performance for the F0.5s metric, indicating more precise estimations. It is worthwile noting that, using bar-synchronized instead of beat-synchronized features increases the performance on the SALAMI dataset, while it decreases it on the RWC Pop dataset.

However, the Barwise TF matrix representation appears to be beneficial for MSA on both datasets. In future experiments, we denote as “Foote-TF” the condition where Foote’s algorithm is applied to the Barwise-TF matrix, whose results are shown in Tables 2 and 3.

3. Correlation “Block-Matching” Algorithm

With a self-similarity matrix as input, the Correlation “Block-Matching” segmentation algorithm (CBM) estimates boundaries by means of dynamic programming. This algorithm is detailed in this section, along with a study of important parameter settings which were not discussed in the previous work (, ), such as the block weighting kernels and the penalty functions. The CBM algorithm estimates boundaries based on the homogeneity/novelty and regularity criteria. The principles of dynamic programming are presented first, followed by the definition of a score function u applied on segments.

3.1 Dynamic Programming for Boundary Retrieval

3.1.1 Boundary retrieval problem

Given a music piece (song) sampled in time as N time steps, the subtask of boundary retrieval can be defined as finding a segmentation (set of boundaries) Z representing the start of each segment, i.e. Z = {ζi1,N,i1,E},E representing the number of boundaries estimated in this song. The set of admissible segmentations is denoted as Θ, i.e. Z ∈ Θ. Each segment Si is composed of the time steps between two consecutive boundaries, i.e. Si = {l1,N | ζil< ζi+1}. The second bound is exclusive as it represents the start of the next segment Si+1. By definition, E boundaries define E-1 segments. Boundary ζi is called the antecedent of boundary ζi+1.

3.1.2 Barwise boundary retrieval problem

In the proposed barwise paradigm, the song is discretized into B bars using N = B + 1 bar boundaries. Hence, the first boundary is the start of the song, i.e. ζ1 = 1, the last boundary is the end of the last bar in the song,i.e. ζE = B + 1, and each boundary is located on a bar, i.e. ∀i, ζi ∈ ⟦1, B + 1⟧. Each segment Si is composed of the bar indices between two consecutive boundaries.

As a consequence, there exists(B1E) different sets of boundaries composed of exactly E boundaries, and, more generally, at most Σk=0B1(B1K)=2B1 segmentations for each song. Hence, the segmentation problem admits a finite number of solutions, which can theoretically be solved in a combinatorial way. In practice though, evaluating all possible segmentations leads to an algorithm of exponential complexity O(2B), considered intractable in practice.

3.1.3 Dynamic programming

The boundary retrieval problem can be approached as an optimization problem (; ). In particular, by associating a score u(S) to each potential segment S, the optimal segmentation Z* is the segmentation maximizing the sum of all its segment scores:

(4)
Z*=argmaxZΘi=1E1uζi, ζi+11=argmaxZΘ u(Z)

by extending notation u for a set of segments.

The problem can be solved using a dynamic programming algorithm (; ), the principle of which is to solve a combinatorial optimization problem by dividing it into several independent subproblems. The independent subproblems are formulated in a recursive manner, and their solutions can be stitched together to form a solution to the original problem. Notice that in the current formulation of the segmentation problem, defined in Equation 4, each potential segment is evaluated independently, via its score, and is never compared with the others. In other terms, repetitions of the same section are not considered, while they could inform on the overall structure, typically considering the repetition criterion. Thus, the segmentation problem defined in Equation 4 is a relaxation of the general segmentation problem. This relaxation is considered because it allows to use principles of dynamic programming, by evaluating the score of all segments as independent subproblems. In particular, this relaxed problem is said to exhibit “optimal substructure” ().

3.1.4 Longest-path on a directed acyclic graph

Following the formulation of Jensen (), the segmentation problem can be reframed into the problem of finding the longest path on a Directed Acyclic Graph (DAG). The rationale of the solution algorithm is that the optimal segmentation up to any given bar bk can be found exactly by recursively evaluating the optimal segmentations up to each antecedent of bk, i.e. (without any constraint) all bars bl < bk, and the score of the segments ⟦bl, bk 1⟧. Formally, denoting as Z[1:bk]* the optimal segmentation up to bar bk, the CBM algorithm consists of:

  1. Lookup for {u(Z[1:bl]*),bl<bk}, i.e. the optimal segmentation up to each antecedent, which is stored in an array when first computed,
  2. Computing {u(bl,bk1),bl<bk}, i.e. the segmentation score between bars bl and bk,
  3. Finding the best antecedent of bk, denoted as ζbk1*, with the following equation:
(5)
ζbk1*=argmaxbluZ[1:bl]*+ubl,bk1.

Finally, at the last iteration, the algorithm computes the best antecedent for B + 1, i.e. the last downbeat of the song. Then, recursively, the algorithm is able to backtrack the best antecedent of this antecedent, and so on and so forth back to the first bar of the song, thus providing the optimal segmentation. A graph visualization for a 4-bar example is presented in Figure 5. Pseudo-code for the CBM algorithm, assuming that the score function u is given, is detailed in the appendix (Algorithm 1).

Figure 5 

Example of computing an optimal segmentation with 4 bars.

In the end, for any bar bk, the optimal segmentation up to bk can be computed in O(bk1) operations, i.e. parsing each antecedent only once. Hence, the solution algorithm boils down to O(B(B+1)2) evaluations, which corresponds to a polynomial complexity. In practice, we even limit the size of admissible segments to be at most 32 bars (set empirically), which further reduces the complexity.

3.2 Score Function

Finally, the segmentation problem boils down to the definition of the score function u(i,ζi+11) for a segment. In the CBM algorithm and following (), the score of each segment is defined as a mixed score function, presented in Equation 6 as the balanced sum of two terms:

(6)
uζi,ζi+11=uKζi,ζi+11λpζi+1ζi.

The first term, uK(ζi,ζi+11), is based on the homogeneity criterion, and is presented in Section 3.2.1; the second one, p(ζi+1ζi), is based on the regularity criterion, and is presented in Section 3.2.2. Parameter λ is a balancing parameter.

3.2.1 Block weighting kernels

The first term uK of the score function in Equation 6 is obtained from the self-similarity values within a segment. Practically, given a self-similarity matrix A(X), the score uK(Si) of segment Si =ζi, ζi+1 1 (of size n = ζi+1ζi) is computed by evaluating the self-similarity values restricted to Si, i.e. A(XSi) = A(X[ζi:ζi+11]). It can be understood as cropping the self-similarity A(X) on this particular segment, around the diagonal.

The CBM algorithm aims at favoring the homogeneity of estimated segments, i.e. favoring sections composed of similar elements. Thus, the score function uK is defined so as to measure the inner similarity of a segment. In practice, this is obtained through weighting local self-similarity values, by using a (fixed) weighting kernel matrix K, such as:

(7)
uK:n×nAXSi1nk=1nl=1nAXSiklKkl.

The kernel is called a “weighting kernel”. A first observation is that the weighting kernel needs to adapt to the size of the segment. A very simple kernel is a kernel matrix full of ones, i.e. K =1n×n, resulting in a score function equal to the sum of every element in the self-similarity matrix, normalized by the size of the segment. The normalization by the size of the segment is meant to turn the squared dependence of the size of the segment in the number of self-similarity values (n2) into a linear dependence. A linear dependence is desired as it ensures a length-n segment contributes similarly to the sum of segment scores as n segments of length 1.

The design of the weighting kernel defines how to transform bar similarities into segment homogeneity, which is of particular importance for segmentation. The remainder of this section presents two types of kernels, namely the “full” kernel and the “band” kernel. We consider that the main diagonal in the self-similarity matrix is not informative regarding the overall similarity in the segment, as its values are normalized to one. Hence, for every weighting kernel K used in the CBM algorithm, Kii = 0, ∀i.

Full Kernel The first kernel is called the “full” kernel, because it corresponds to a kernel full of 1s (except on the diagonal where it is equal to 0). The full kernel captures the average value of similarities in this segment, excluding the self-similarity values. Practically, denoting as Kf the full kernel:

(8)
Kijf=1if ij0if i=j

Hence, the score function associated with the full kernel is equal to:

(9)
uKf(Si)=1nk=1nl=1nAXSiklKklf=1nk=1nl=1,lknAXSikl

A full kernel of size 10 is presented in Figure 6.

Figure 6 

Full kernel of size 10.

Band Kernels A second class of kernels, called “band” kernels, are considered in order to emphasize short-term similarity. Indeed, in band kernels, the weighting score is computed on the pairwise similarities of a few bars in the segment only, depending on their temporal proximity: only close bars are considered. In practice, this can be obtained by defining a kernel with entries equal to 0, except on some upper- and sub-diagonals. The number of upper- and sub-diagonals is a parameter, corresponding to the maximal number of bars considered to evaluate the similarity, i.e. an upper bound on |bi–bj| for a pair of bars (bi, bj).

Hence, a band kernel is defined according to its number of bands, denoted as v, defining the v-band kernel Kvb such that:

(10)
Kijvb=1if 1|ij|v,0otherwise (i=j or |ij|>v).

Three band kernels, of size 10, are represented in Figure 7. Section 4 presents experiments which compare quantitatively the impact of the number of bands on the segmentation performance.

Figure 7 

Band kernels, of size 10.

3.2.2 Penalty functions

Sargent et al. () extended the score function of Jensen () to take into account both the homogeneity and the regularity criteria, resulting in Equation 6. In practice, this is obtained through defining a regularity penalty function p(n), corresponding to the second term in Equation 6, and penalizing segments according to their size n, to favor particular sizes.

The penalty function is based on prior knowledge, and aims at enforcing particular sizes of segments, which are known to be typical in a number of music genres, notably Pop music. In particular, Figure 8 presents the distributions of the sizes of segments, in terms of number of bars, in the annotations of both RWC Pop and SALAMI datasets. It appears that some sizes of segments are much more frequent in the annotations. Hence, penalty functions p can be derived from these distributions.

Figure 8 

Distribution of segment sizes in terms of number of bars, in the annotations.

Two different penalty functions p are studied in this section, namely the “target-deviation” and “modulo” functions. In what follows, n denotes the size of the segment, i.e. n = ζi+1ζi.

Target-Deviation Functions The first set of penalty functions, called “target-deviation” and denoted as ptd, is defined by Sargent et al. (). Target-deviation functions compute the difference between the size of the current estimated segment and a target size τ, raised to the power of a parameter α, i.e. ptd(n) = |nτ|α where parameter α takes typical values in {0.5, 1, 2}. The target size is set by Sargent et al. () to 32, to favor segments of size 32 beats, in line with their respective evaluations of most frequent segment sizes. In our barwise context,τ = 8, which is the most frequent segment size in both RWC Pop and SALAMI datasets.

This penalty function is adapted to enforce one size in particular, and tends to disadvantage all the others. Hence, this function is adapted to datasets where one size is predominant, which seems true for RWC Pop with MIREX 10 annotations (more than half of the segments in the annotation are of size 8 bars), but not so definite for the SALAMI dataset, where the segment sizes are more balanced between 4, 8, 12 and 16, as presented in Figure 8. In particular, segments of size 16 are strongly penalized (|8–16|α = 8α).

Modulo function The second set of penalty functions, called “modulo functions”, is designed to favor particular segment sizes, directly based on prior knowledge. In this study, we only present the “modulo 8” function pm8(n) based on both RWC Pop and SALAMI annotations. Indeed, in both datasets, most segments are of size 8, and the remaining segments are generally of size 4, 12 or 16. Finally, outside of these sizes, even segments are more frequent than segments of odd sizes. Hence, the modulo 8 function models this distribution, as:

(11)
pm8(n)=0if n=814else, if n0mod412else, if n0mod21otherwise

Penalty values for the different cases were set quite intuitively, and would benefit from further investigation.

In order to mitigate both the weighting score function uK and the penalty function p, we implemented an additional normalization step based on the weighted values obtained in each song, resulting in the score function defined in Equation 12.

(12)
uζi,ζi+11=uKζi,ζi+11umaxK8λpζi+1ζi,

In Equation 12, umaxK8 is the maximal weighting value obtained by sliding a kernel of size 8 on this self-similarity matrix, i.e. the highest score among all possible segments of size 8. This size of 8 for the kernel is chosen as the most frequent segment size in terms of number of bars in both RWC Pop and SALAMI datasets, as presented in Figure 8. Parameter λ is a constant parameter, which is fitted as detailed in Section 4.

Finally, in the CBM algorithm, the score of each segment is defined as in Equation 12. The first term, uK(⟦ζi, ζi+1 – 1⟧), is a weighting score, measuring the self similarity of the segment. The second term, p(ζi+1ζi), penalizes or favors the segment depending on its size. Both these scores are subject to design choices, which are studied and compared in the subsequent section.

4. Experiments

4.1 Evaluation Metrics

The quality of the estimation obtained with the CBM algorithm is evaluated with the Hit-Rate metrics, comparing a set of estimated boundaries with a set of annotations by intersecting them with respect to a tolerance t (Ong and Herrera, 2005; ). In practice, given two sets of boundaries Ze and Za (respectively the sets of estimated and annotated boundaries), an estimated boundary ζieZe is considered correct if it is close enough to an annotated boundary ζjaZa (“close enough” meaning that the gap is no larger than the tolerance t), i.e. if ζjaZa such that ζieζjat. Each estimated boundary can be coupled with a maximum of one annotated boundary, and vice versa. The set of correct boundaries subject to the tolerance t, denoted as Ct, contains at most as many elements as the annotations or the estimations, i.e. 0|Ct|min(|Ze|,|Za|). In case of perfect concordance between Ze and Za, Ct = Ze = Za. In practice, the concordance of Ct with Ze and Za is evaluated by the precision Pt, recall Rt and F-measure Ft:

  • Pt=|Ct||Ze|, i.e. the proportion of accurately estimated boundaries among the total number of estimated boundaries.
  • Rt=|Ct||Za|, i.e. the proportion of accurately estimated boundaries among the total number of annotated boundaries.
  • Ft=2PtRtPt+Rt is the harmonic mean of both aforementioned measures. The harmonic mean is less sensitive to large values than the arithmetic (standard) mean, and is conversely more strongly penalized by low values. Hence, a high F-measure requires both a high recall and a high precision.

These metrics are computed using the mir_eval toolbox ().

4.1.1 Tolerances in absolute time

In the boundary retrieval subtask, conventions for the tolerance values are 0.5s () and 3s (). The 3-second tolerance, citing Ong and Herrera (), is justified as being equal to “approximately 1 bar for a song of quadruple meter [NB: 4 beats per bar, e.g. 44 metric] with 80 bpm in tempo”, while the 0.5 second tolerance is within the order of magnitude corresponding to the beat. In this work, we use both tolerance values to compare our algorithm with the standard algorithms, leading to 6 metrics P0.5s, R0.5s, F0.5s and P3s, R3s, F3s.

4.1.2 Barwise-aligned tolerances

In this work, estimated boundaries are located on downbeat estimates, as explained and motivated in Section 2. In that sense, rather than evaluating the estimates in absolute time, we align each annotation with the closest estimated downbeat, leading to barwise-aligned annotations. This allows us to introduce additional metrics: P0bar, R0bar, F0bar and P1bar, R1bar, F1bar. The first three metrics (e.g. F0bar) consider that the tolerance is set to 0 bars, i.e. expecting estimates and annotations to fall precisely on the same downbeat, and the latter three metrics (e.g. F1bar) set the tolerance to exactly one bar between estimates and annotations. In particular, these metrics will be used to compare different settings of our algorithm.

4.2 Parametrization of the Algorithm

The CBM algorithm is evaluated on the boundary retrieval task on the entire RWC Pop dataset (), and on the test subset of SALAMI (), defined by Ullrich et al. (). The three similarity functions defined in Section 2.2 are used to compute the self-similarity matrices, namely Cosine, Autocorrelation and RBF. The CBM algorithm itself is subject to the choice of kernel, and particularly to the number of bands when using a band kernel. In addition, the score function depends on the design of the penalty function. Rather than studying all of these parameters at the same time, experiments focus on each aspect independently. In particular, the experiments aim at answering the following three questions:

  • Which similarity function is the most suitable for boundary retrieval in our context?
  • Which weighting kernel is the most suitable for boundary retrieval in our context?
  • Which penalty function is the most suitable for boundary retrieval in our context?

Each question is addressed sequentially, and the conclusion of each question serves as the basis to study the next ones.

4.2.1 Train/test datasets

These questions are addressed by comparing several parameters in a train/test fashion: a subset of the SALAMI dataset, called “SALAMI-train”, is used to evaluate several parameters, and the best one in this subset is evaluated on the remainder of the SALAMI dataset, called “SALAMI-test”, and on the entire RWC Pop dataset. The division between SALAMI-train and SALAMI-test is defined by Ullrich et al. (), based on the MIREX evaluation dataset. The details are available online, and are uploaded along with experimental Notebooks on the open-source dataset. The SALAMI-train dataset contains 849 songs, and the SALAMI-test dataset contains 485 songs. The entire RWC Pop dataset contains 100 songs, resulting in a total of 585 songs for testing.

4.2.2 Self-similarity matrices

Firstly, we study the impact of the design of the similarity function on the performance of the CBM algorithm. To do so, we use the CBM algorithm with the full kernel, as it does not need the fitting of the number of bands, and we do not use a penalty function. The boundary retrieval performance is presented in Table 4 for the train dataset.

Table 4

Boundary retrieval performance with the different self-similarities on the train dataset (Full kernel, no penalty function).


Self-similarityP0bar R0bar F0bar P1bar R1bar F1bar

Cosine50.83%30.82%36.77%62.80%37.72%45.19%
Autocorrelation32.59%64.69%41.30%42.10%83.73%53.41%
RBF50.27%45.38%45.84%64.79%58.81%59.30%

The RBF self-similarity is the best-performing self-similarity in terms of F-measure (with both tolerances), hence suggesting a better boundary estimation on average than the other similarity functions. The results obtained with the RBF similarity function on the test datasets are presented in Table 5.

Table 5

Boundary retrieval performance with the RBF self-similarity, on both test datasets (Full kernel, no penalty function).


DatasetP0bar R0bar F0bar P1bar R1bar F1bar

SALAMI – test48.52%48.65%46.68%62.76%63.09%60.51%
RWC Pop60.72%53.61%56.01%77.68%67.62%71.09%

The precision/recall trade-offs depend on the self-similarity matrices, and deserve to be studied to give further information on the quality of the estimated segmentations. The Cosine self-similarity exhibits a higher precision than recall on average, which suggests an under-segmentation, i.e. estimating too few boundaries. Conversely, the Autocorrelation self-similarity results in a higher recall than precision, suggesting over-segmentation. The RBF self-similarity performance is more balanced between both metrics.

These conclusions can be confirmed by studying the distribution of the sizes of the estimated segments, as presented in Figure 9 on the SALAMI-train dataset. These distributions must be compared with the distribution of segment sizes in the annotations, presented in Figure 8.

Figure 9 

Distribution of segment sizes, with the full kernel, according to the self-similarity matrix. Results on the SALAMI-train dataset.

The distribution of segment sizes with the RBF self-similarity is visually the closest one to the distribution of annotations, which we confirm numerically by studying the Kullback-Leibler (KL) divergences between the distribution of the sizes of the estimated segments and of the annotated ones. The KL-divergences are respectively equal to 2.25, 0.85 and 0.35 for the Cosine, Autocorrelation and RBF similarity functions. Again, this suggests that the RBF similarity function is the most suitable.

4.2.3 Block weighting kernels

Secondly, an important parameter in the CBM algorithm is the design of the kernel. We thus compare the full kernel with band kernels, the number of bands varying from 1 to 16. Results on the SALAMI-train dataset, computed on the RBF self-similarity matrices, and focusing on the F-measures, are presented in Figure 10. The 7-band kernel stands out as the best-performing kernel, even if performance is close to the 15-band kernel.

Figure 10 

Boundary retrieval performance (F-measures only) according to the full and band kernels (with different numbers of bands). Results on the train dataset with RBF self-similarity matrices.

The differences in performance between the different kernels may be explained by Figure 11, which presents the distribution of segment sizes according to the number of bands. The 7-band kernel leads to a majority of estimated segments of size 8 (more than 50%, twice as much as in the annotations), which is the most common segment size in the annotation, while the 15-band kernel mostly computes segments of size 16, and the full kernel is well distributed across the different segment sizes. The annotations are mostly composed of segments of size 8, then 4, 12 and 16. Hence, while the 7-band kernel does not accurately represent the annotations, it obtains better boundary retrieval performance than the other ones, indicating that this latter distribution is beneficial to the boundary estimation overall.

Figure 11 

Distribution of estimated segment sizes, according to different kernels, on the train dataset.

As an additional conclusion, the number of bands in the kernel largely influences the distribution of segment sizes, in particular the most frequent segment size. As a general trend, it seems that a kernel with v bands favors segments of size v + 1. We assume that this behavior stems from the fact that, for a v-band kernel and a large segment of size n > v, the number of elements equal to 0 is large, but the normalization remains adapted to kernels with n2 values. We found in practice that this effect could be dampened by normalizing the score associated with each kernel by the number of nonzero values plus the number of elements in the diagonal instead of the size of the kernel, as () did, but this resulted in all kernels performing similarly to the full kernel, hence performing worse than the 7-band one.

Finally, as the 7-band kernel is the best-performing one, we fixed this kernel for both test datasets. Results obtained with this kernel are presented in Table 6.

Table 6

Boundary retrieval performance with the 7-band kernel, on both test datasets (RBF self-similarity, no penalty function).


DatasetP0bar R0bar F0bar P1bar R1bar F1bar

SALAMI – test37.24%59.80%44.33%50.38%80.52%59.88%
RWC Pop59.41%68.19%62.82%75.53%86.56%79.81%

4.2.4 Penalty functions

Finally, the last experiments focus on the penalty functions. In this set of experiments, we compare the target deviation functions, with α ∈ {0.5, 1, 2}, with the modulo 8 function. The CBM algorithm is parametrized with the 7-band kernel, and is applied on the RBF self-similarity matrices. The parameter λ, balancing the penalty function, takes values between 1100 and 210, with a step of 1100. This parameter is fitted on the SALAMI-train dataset. Results are presented in Table 7.

Table 7

Boundary retrieval performance depending on the penalty function, for the SALAMI-train dataset, with the RBF self-similarity and the 7-band kernel.


Penalty functionBest λ P0bar R0bar F0bar P1bar R1bar F1bar

Without penalty40.26%57.38%45.81%54.26%77.67%61.81%

Target deviation α=12 0.0140.38%57.36%45.88%54.37%77.57%61.84%

α = 10.0140.45%56.98%45.81%54.61%77.20%61.89%

α = 20.0139.75%54.32%44.43%54.93%75.31%61.46%

Modulo 80.0441.04%58.34%46.63%54.25%77.44%61.72%

The modulo 8 function appears to slightly improve boundary retrieval performance for the metrics with a tolerance of 0 bar, indicating a more accurate estimation, but results with a tolerance of 1 bar are not strongly impacted by the choice of the penalty function. Results are close between the different penalty functions, except for the target deviation with a large α, which results in worse performance than the other conditions.

Overall, it seems that the modulo 8 function is the most suitable penalty function to estimate segments accurately. Hence, we use this penalty function for the test results, presented in Table 8, with parameter λ = 0.04, as optimized on the SALAMI-train dataset.

Table 8

Boundary retrieval performance with the modulo 8 penalty function (λ = 0.04), on both test datasets (RBF self-similarity, 7-band kernel).


DatasetP0bar R0bar F0bar P1bar R1bar F1bar

SALAMI – test38.36%60.96%45.44%50.76%80.51%60.09%
RWC Pop62.11%70.05%65.17%77.35%86.95%81.02%

4.2.5 Experimental conclusions

In light of these results, we finally sum up the situation regarding the choice of settings for the CBM algorithm.

  1. In our context, the RBF self-similarity matrix is the most suitable self-similarity matrix for boundary retrieval.
  2. In our context, the 7-band kernel is the most suitable kernel for boundary retrieval.
  3. In our context, the modulo 8 penalty function is the most suitable penalty function for boundary retrieval.

4.2.6 Metrics with tolerance in absolute time

As mentioned in Section 4.1, standard metrics for boundary retrieval performance consider the tolerance in absolute time (e.g. F0.5s and F3s metrics), while we opted for boundary-aligned metrics in our experiments. Hence, Table 9 compares the boundary retrieval performance obtained when the tolerance is defined relatively to the bars and in absolute time, which allows to compare with state-of-the-art algorithms. Boundary retrieval performance is almost equivalent on the RWC Pop dataset, and slightly altered for the metrics with short tolerances on the SALAMI dataset (F0bar and F0.5s). These discrepancies may be explained by the less precise downbeat alignment of annotations in the SALAMI dataset, presented in Table 1. Overall though, results remain similar, which tends to confirm the hypothesis of Ong and Herrera (2005) that a tolerance of 3 seconds corresponds approximately to a tolerance of 1 bar.

Table 9

Boundary retrieval performance, comparing the F-measures with tolerance expressed barwise and in absolute time.


DatasetF0bar F0.5s F1bar F3s

SALAMI-test45.44%42.00%60.09%60.61%

RWC Pop65.17%64.44%81.02%80.64%

4.3 Comparison with State-of-the-Art Algorithms

We compare the boundary retrieval performance obtained by the Foote-TF (introduced in Section 2.3.3) and the CBM algorithms with state-of-the-art algorithms. The performance of the CBM algorithm is obtained using the hyperparameters learned in Section 4.2.

This work considers seven different algorithms as state-of-the-art, categorized as either unsupervised or supervised algorithms, i.e. algorithms that either estimate boundaries without the use of training examples or analyze annotated examples before making predictions: four unsupervised algorithms (; ; ; ) and three supervised algorithms (; ; ). We additionally use previous work on the CBM algorithm () as a baseline.

All state-of-the-art algorithms use beat-aligned features, except Grill and Schlüter (), who use a fixed hop length and Wang et al. (), who use downbeat-aligned features. Results for Foote (); McFee and Ellis, (); Serrà et al. () are computed with the MSAF toolbox (), and realigned on downbeats in post-processing. Results for the CNN () are extracted from the 2015 MIREX contest. Results for McCallum (); Wang et al. (); Salamon et al. () are copied from the respective articles.

Figures 12 and 13 compare the results obtained with the CBM and the Foote-TF algorithms with those of the state-of-the-art algorithms. In this comparison, the CBM algorithm globally outperforms the other unsupervised segmentation methods, most of the supervised algorithms, and is competitive for the metric F3s with the global (supervised) state-of-the-art (). These results are promising and show the potential of the CBM algorithm, which is performing well despite its relative simplicity. Additionally, results of the CBM algorithm are on par with those of Foote-TF on the SALAMI dataset, showcasing interest for the Barwise TF representation.

Figure 12 

Boundary retrieval performance of the CBM algorithm on the SALAMI dataset, compared to state-of-the-art algorithms. Hatched bars correspond to supervised algorithms. The star * represents algorithms where the evaluation subset is not exactly the same as ours, thus preventing accurate comparison.

Figure 13 

Boundary retrieval performance of the CBM algorithm on the RWC Pop dataset, compared to state-of-the-art algorithms. Hatched bars correspond to supervised algorithms.

4.4 CBM: Bar-Scale vs. Beat-Scale

In order to distinguish the impact of barwise-alignment and of the CBM algorithm itself on the segmentation results, we compare results obtained on the Barwise TF matrix, presented above, with results obtained on the “Beatwise TF matrix”, i.e. the equivalent of the Barwise TF matrix at the beat-scale. Following the intuition that most bars in Western modern music are composed of 4 beats per bar, the Beatwise TF matrix is sampled with T = 24 samples per beat, i.e. 964. Beats are estimated with the algorithm of Böck et al. (), implemented in the madmom toolbox, as in Section 2.3. For fairer comparison, the results at both beatwise and barwise scales are computed without penalty function. Corresponding results are presented in Tables 10 and 11.

Table 10

CBM algorithm, performed on Barwise TF matrix vs. Beatwise TF matrix, on the SALAMI-test dataset. For fairer comparison, results at both scales are computed without penalty function.


SALAMIP0.5s R0.5s F0.5s P3s R3s F3s

Beatwise (cosine, 63-band kernel)35.90%41.61%37.36%55.75%64.52%58.03%
Barwise (RBF, 7-band kernel)34.49%54.56%41.04%50.70%80.78%60.51%

Table 11

CBM algorithm, performed on Barwise TF matrix vs. Beatwise TF matrix, on RWC Pop. For fairer comparison, results at both scales are computed without penalty function.


RWC PopP0.5s R0.5s F0.5s P3s R3s F3s

Beatwise (cosine, 63-band kernel)46.22%44.38%44.57%72.54%68.85%69.51%
Barwise (RBF, 7-band kernel)59.09%67.13%62.28%75.17%85.90%79.47%

Results confirm that barwise-alignment is beneficial for the performance of the CBM algorithm, especially on the RWC Pop dataset, but the CBM algorithm with beat-aligned features obtains similar or better performance than the unsupervised state-of-the-art algorithms (; ; ; ), and is competitive with the supervised algorithm of Wang et al. () on both datasets for the metric F3s.

5. Conclusions

This article presented the CBM algorithm for performing Music Structure Analysis on audio signals, where boundaries between musical sections are computed by maximizing the homogeneity of each segment composing the segmentation, using dynamic programming under a penalty function. Moreover, barwise processing of music is shown to increase segmentation performance, using the Barwise TF matrix. This work has also investigated several metrics to represent similarities between pairs of bars in a song. While the CBM algorithm has room for improvement, it achieves a level of performance which is competitive to the state-of-the-art, and therefore appears as a meaningful approach to investigate a variety of music representations without needing large collections of training data.

The design of the kernel clearly impacts boundary retrieval performance. Hence, future work could focus on studying alternative types of kernels. The kernel values could depend on the particular song or dataset considered, or follow particular statistical distributions. Of particular interest could be the learning of such kernels instead of an (empirical) definition. These latter comments are also valid for the penalty functions, whose values were set quite empirically, and which would benefit from deeper investigation. The number of bands in the weighting kernels seems to enforce particular segment sizes. This effect can be mitigated with normalization, or, conversely, further exploited, for instance by using different kernels concurrently, each one accounting for a different level of structure, hence studying segmentation hierarchically.

Weighting kernels presented in this article focus on the homogeneity of each segment, but other kernels could be considered in order to account for repetition in the song. In fact, the proposed framework is highly customizable with respect to weighting kernels, and could be adapted to the expected shape of segments. In particular, we expect that bridging this work with previous work () could further enhance performance.

6. Reproducibility

All the code used in this article is contained in the open-source toolbox (), along with experimental Notebooks used to compute the experimental results.8

A Algorithm, in details

The detailed algorithm, in pseudo-code, is presented hereafter.

Practically, the advantage of the algorithm is to be able to store in memory both the optimal antecedent for each bar and the scores of the optimal segmentation up to each bar when they are computed for the first time, respectively denoted as the arrays A* and U in Algorithm 1.


Algorithm 1 CBM algorithm, computing the optimal segmentation given a score function U().

Input: Bars {bk ∈ ⟦1, B⟧}, score function u
Output: Optimal segmentation Z* = {ζi}