1. Introduction

Classification experiments are arguably the most widespread tool for the evaluation of Music Information Retrieval (MIR) systems and methods (; ). Lack of proper control in such experiments leads to conclusions of questionable validity, yielding results that may fail to generalise beyond the experiment (). This hampers progress by obfuscating which research paths are worth pursuing, and demands revising conventional experimental practices (). We propose and illustrate a procedure for assessing how failing to control for particular sources of information in evaluation collections affects experimental results.

The partitioning of collections into training and testing materials affects validity in classification experiments, as the MIR community has long acknowledged. For instance, the presence of the same artists or albums in both training and testing recordings artificially inflates performance estimates; this is known as artist or album effects, respectively (; ). Performance can also decrease if one tests using a separate collection () or manipulates recordings in presumably irrelevant ways (; ).

Pampalk et al. () introduced artist “filters” to counteract artist effects in music similarity experiments. Their approach, which we call “filtered partitioning”, creates training and testing collections not sharing a level of the factor one aims to control (e.g., artist information). This provides a single “regulated” testing condition (all testing instances follow a particular rule), alleviating the impact of the replication of that factor on performance estimates. Comparing regulated results from filtered partitioning with those from a conventional random partitioning enables assessing the impact of leaving a factor unregulated. Using this approach, studies (e.g., Flexer (); Sturm ()) show not only that unregulated collections might bias performance estimates, but also that the magnitude of such bias varies across feature representations and learning algorithms.

A major limitation of filtered partitioning for assessing the effect of leaving collections unregulated is that the regulated training and testing collections it creates likely contain different instances than those included in their unregulated counterparts. No single trained system is exposed to both regulated and unregulated testing conditions, which impedes disentangling the effects of training and testing. Moreover, as Marques et al. () note, the makeup of some collections constrains how many disjoint regulated partitions one can create (e.g., the number of cross-validation folds cannot exceed the number of artists per class). This may conflate the effect of the particular instances — their “difficulty” — with that of the (lack of) regulation.

Apart from altering the collection partitioning strategy, manipulating the raw data can also create regulated evaluation conditions (, ). This avoids the aforementioned limitations as instances in both conditions match, but cannot regulate all factors (e.g., artists). Previous studies combine filtered partitioning with manipulations to control multiple factors simultaneously (), but suffer from the aforementioned limitations of filtered partitioning.

In this article, we describe both partitioning and manipulation approaches as alternative, but complementary, types of interventions in the experimental pipeline. These interventions create regulated evaluation conditions that can be used to characterise how the outcomes of music classification experiments are affected by “confounding”, a validity threat we examine in Sec. 2. We then introduce in Sec. 3 a procedure for combining multiple interventions that overcomes the limitations of filtered partitioning, including a novel resampling strategy aimed at gauging confounding effects. We focus on the effects of particular sources of confounding information on test results, as this is paramount for MIREX and similar evaluation exchanges, but the approach could be extended to assess effects in training. We illustrate our approach in Sec. 4 by analysing two known confounders in the GTZAN music genre collection (): artist replication and infrasonic content. This could be adapted to other domains with minimal adjustments. We finally discuss in Sec. 5 the main limitations and broader implications of our work, and provide concluding remarks in Sec. 6.

2. Confounding in Classification Experiments

Classification experiments dominate evaluation in both pure and applied machine learning research (; ). A classification experiment essentially involves measuring how well a prediction system, or family of systems, reproduces the annotations of a collection, which acts as a proxy for success in some real-world problem (). The diagram in Fig. 1 represents a simplified pipeline of a music classification experiment, introducing notation used later in this article.

Figure 1 

Pipeline of a single iteration k of a classification experiment evaluating a system construction method m (combination of feature extraction and learning algorithm) on a music collection D. Square-shaped nodes represent data structures; diamond-shape nodes represent processes. A double border indicates a treatment factor with fixed level. Solid lines indicate information flow; dashed lines join components of the same data structure. π is a data assignment/partitioning function. Dt is the training collection; Dp is the testing collection, with Rp the raw data (e.g., recordings) and Ap the corresponding annotations. (Rt and At omitted for simplicity.) s is the trained system, A^p the predicted annotations, ϕ the performance metric function, and y^ an estimate of the theoretical performance y – i.e., given the true distribution.

Any empirical study is subject to diverse validity threats that challenge the veracity and generality of its outcomes (; ). Among these, confounding is particularly relevant as it leads to invalid conclusions about causal relationships (). Two variables potentially influencing measurements are confounded if the experimental design cannot disentangle their effects (). Many experimental and quasi-experimental designs thus alleviate confounding by controlling extraneous variables other than the target of the study – explicitly setting or accounting for their values in the different experimental conditions – to avoid them impacting the measurements (; ).

Simple experimental design choices overcome the most obvious risks of confounding in classification experiments (). For instance, if one measures the performance of multiple systems each on different instances, the influence of such systems – the outcome of interest – becomes confounded with the selection of instances – an extraneous variable. This is easily avoided by comparing measurements on the same instances, a standard evaluation practice.

Subtler forms of confounding affecting the conclusions of classification experiments are receiving increasing attention in the applied machine learning literature (e.g., Chen and Asch (); Charalambous and Bharath ()). In particular, information not intrinsically linked with the problem of interest might incidentally relate with the annotations of evaluation collections, providing alternative means for systems to predict annotations in classification experiments. Causes of this phenomenon include selection bias (e.g., Mendelson et al. ()) and leakage (), which induce confounding by conflating success in addressing the target problem – the outcome of interest – with the exploitation of auxiliary information – an extraneous influence (). In this article, we focus on identifying and analysing the effects of these forms of confounding information.

If a collection is used in the evaluation of diverse problems and use cases, each case implicitly determines which content is potentially confounding. For instance, tempo information in a collection may be legitimate for identifying dance style, as the speed of a piece influences which dance moves are feasible, but not for identifying rhythmic patterns, as these should be invariant to reasonable variations in speed (; ). Artists tend to compose or perform music pieces of one or a few genres, yet artist properties are not essential to those genres (). If one’s sole aim is to attach genre tags to a fixed set of recordings, artist information will likely help; if one aims to assess whether a system captures the defining characteristics of music genres instead, then artist-specific content is extraneous. Other properties, however, such as the infrasonic content present in GTZAN (; ), are unlikely to be legitimately informative for any real problem.

The main risk of confounding in classification experiments is that conclusions fail to generalise. Systems might not succeed when deployed if they rely on information about a potential confounder being present, as there is no guarantee that the observed association will remain outside the experimental setting. The MIR community has adopted evaluation practices to counter this pitfall. The aforementioned filtered partitioning approach yields performance estimates free of the influence of the regulated potential confounder (). Others suggest leveraging data augmentation to avoid confounding information influencing the training process (; ). This synthetically generates combinations of background information and target categories that force systems to learn general concepts rather than incidental correlations.

As a homage to Clever Hans (), some MIR publications refer to systems exploiting confounding information as “horses” (, ). To assess whether a system is indeed a “horse”, one might test on a completely separate collection than the one used for training (). This, however, does not reveal the source of discrepancy. Others propose to illuminate the behaviour of trained systems through interpretable explanations of predictions () or interventions in the experimental pipeline (). We extend the latter approach to gauge how confounding impacts the outcomes of classification experiments.

3. Characterising Confounding Effects

We propose a simple procedure that uses interventions to characterise the effects of confounders in performance measurements from classification experiments, overcoming the limitations of filtered partitioning via a novel resampling strategy. We here focus on the effects in testing, but the procedure could be easily adapted to assess the effects in the training of systems, or a factorial combination of both.

3.1. Interventions on the Experimental Pipeline

In empirical studies, an intervention is the act of explicitly fixing a factor to one of its levels (). A conventional music classification experiment involves intervening on the system creation method, as Fig. 1 represents with a double-bordered node. This specifies evaluation conditions to compare, each with different feature extraction and/or learning algorithms, yielding estimates of differences in performance. Apart from such conventional intervention, one might also intervene on other steps of the pipeline to create further evaluation conditions. These may reveal information unavailable otherwise, such as the impact of a potential confounder.

Consider the train/test pipeline of a classification experiment, with training and testing materials drawn from a collection D. Let z be a potential confounder. If z correlates with the classes in some way within D, legitimately or not, then such correlation should appear in both training and testing instances unless a regulation is introduced, making z available for both training and prediction. Interventions regulating z thus impede its availability in such steps by breaking its correlation with the classes.

A classification experiment pipeline offers many opportunities for intervening. One might intervene on training or prediction, altering methods and systems to avoid relying on z. For instance, knowing which dimensions of the feature representations capture information related with z, one might regulate by removing or masking such dimensions in the feature extractor. This is the case in the tempo-invariant features of Dixon et al. (). Previous studies, however, often intervene on the creation of training and testing materials, through either “instance assignment” or “data manipulation” interventions.

Instance Assignment interventions regulate π, the criterion for assigning instances to either training or testing, taking z into account. These interventions thus require knowledge of z, i.e., the value that z takes for each instance. Properties such as artist, album, file format, or recording device are suitable for this approach.

Filtered partitioning belongs to this category, with the intervention involving an assignment function π′(D) that creates Dt and Dp both containing different instances than their unregulated counterparts. Other strategies may distinguish between regulated and unregulated conditions only for testing, using the exact same training materials in both (i.e., π′ (D)=(Dt,Dp)). This enables isolating the potential effect of z in the evaluation of fixed systems. If one aims to estimate the impact of z in system construction instead, a suitable intervention might fix the testing collection and create regulated and unregulated conditions distinguished only in the selection of training instances (i.e., π′(D)=(Dt,Dp)).

Data Manipulation interventions alter the raw data (e.g., audio recordings) in a way that preserves their membership to a class, but modifies the correlation between z and the classes. Manipulations such as pitch-preserving time-stretching () and high-pass filtering () have been used to this end. These interventions do not require instance-level knowledge of z, and they permit comparing predictions on the same instances (manipulated and not). Nevertheless, they require identifying and implementing suitable manipulations.

Similar to instance assignment interventions, data manipulation interventions may create regulated conditions in different ways. Given a class-preserving manipulation, one might transform instances in both Dt and Dp in the same way, thus obtaining a pair of regulated collections (Dt,Dp). This, however, may not break correlations if the manipulation is deterministic, failing to regulate z. It is more appropriate to keep either Dt or Dp unaltered and manipulate the other.

These types of interventions are complementary, as they affect different steps of the experimental pipeline, but it is feasible to stack various interventions affecting the same step (e.g., time-stretching and filtering recordings). They might be integrated into the experiment using a factorial design (), where each intervention creates an additional treatment factor with at least two levels: regulated and unregulated. Comparing measurements under combinations of such levels reveals the marginal and joint impact of the interventions, illuminating the effects of the potential confounders.

3.2. Analysing Confounding with Interventions

To date, interventions on the experimental pipeline have been used to reveal whether a potential confounder affects the evaluation of particular methods or systems. Given an annotated music collection D, we now describe the steps we propose to extend this approach to assess how such a potential confounder impacts evaluations conducted on D over multiple methods, and how several potential confounders interact.

a) Identify potential confounders

As a prerequisite of the analysis, one should determine which potential confounders are worth considering for the collection and problem at hand. This may come from exploratory analyses of collections, published systems and/or domain knowledge.

b) Design interventions

For each identified potential confounder z, one should specify at least one suitable intervention to distinguish regulated and unregulated evaluation conditions with respect to z. The adequate type of intervention depends on the nature of z.

c) Create train/test materials

Let Dt be a training collection drawn from D, and Dp and a pair of testing collections associated with Dt that differ only in whether they regulate a potential confounder z. In particular, Dp is drawn from D (usually D\Dt), and Dp comes from an intervention on the experimental pipeline. For instance, Dp might be a pruned version of Dp with instances whose value of z appears in Dt removed, or the result of a manipulation on the recordings in Dp for regulating z. If the analysis considers J interventions simultaneously, then one creates (at least) 2J testing collections associated with Dt′, one for each combination of regulation condition.

To avoid the performance estimates being confounded with the selection of instances, it is advisable to create multiple training collections through a resampling strategy (). In this case, one would draw K training collections Dt,k and derive the testing collections associated with each as above. Conventional resampling strategies, however, cannot ensure testing collections from instance assignment interventions fulfil the intended regulation. The strategy we propose later in Sec. 3.3 addresses this issue.

d) Select methods

Characterising the impact of a potential confounder z requires a wide range of performance estimates. One may then train multiple systems on each Dt,k using diverse combinations of feature extraction and learning algorithms. We denote the total number of combined methods as M. These methods should cover a broad spectrum of modelling approaches and expected performance values. Optimisation is not essential if the goal is to gauge how different approaches behave when exposed to particular perturbations on the data and not to maximise performance, but plays an important role if this procedure is integrated into real evaluations.

e) Obtain performance estimates

For each trained system sn‘, 1 ≤ nKM, one can then compute figures of merit (e.g., accuracy, mean recall) in the corresponding testing collections. For simplicity, we call y^ and y^ the generic unregulated and regulated performance estimates, respectively.

f) Relate regulated and unregulated estimates

As y^ and y^ differ only in their regulation of z, one assumes any observed difference reflects an effect of z. Given enough (y^,  y^) pairs, one might estimate the expected relationship between regulated and unregulated measurements y^~f(y^). Fitting a model of f(y^) from data pairs (y^,  y^) describes the confounding effect of z in evaluations on D. This reflects how a potential confounder tends to affect performance estimates of trained systems evaluated in the collection. For simplicity, we may use a linear model, such as

(1)
y^′~f(y^)=α⋅y^+κ

though other relationships (e.g., quadratic, exponential) could be preferable. If α ≈ 1 and |κ| ≫ 0, we say the confounding effect of z is mostly additive (i.e., the relationship between y^ and y^ appears as a fixed effect); if α ≉ 1 and κ ≈ 0, we say it is mostly multiplicative (i.e., a gain). To estimate κ in the former case, one could average performance differences between conditions per iteration. Denote y^m,k the performance of a system trained with Dt,k using method m measured on a test collection Dp,k, and y^m,k the measurement on the associated regulated test collection Dp,k, then:

(2)
κ^=Σk=1KΣm(y^m,ky^m,k)KM

with K and M defined as above.

In the general case, y^ and y^ will not keep a simple relationship over all observations. Different system-construction methods can exploit a potential confounder differently, and the effect might also differ across classes. One may thus analyse the data marginally to identify clearly distinct behaviours.

If the analysis involves multiple interventions, comparing marginal and joint measurements can elucidate whether the different confounders (or approaches to the same confounder) interact. Let y^ be the performance estimated in the original testing collection, y^1 and y^2 the performances in testing collections from two different interventions, and y^1,2 the performance on a testing collection subjected to both interventions. Apart from relating y^ with both y^1 and y^2 to analyse the effects of each confounder separately, one might compare the sum of those two marginal effects with the difference between y^ and y^1,2. Let ΔA be the “accumulated” variation, defined as:

(3)
ΔA=(y^y^1)+(y^y^2)

and ΔR be the “real” variation:

(4)
ΔR=(y^y^1,2).

The difference ΔR–ΔA indicates whether the two confounding effects under study reinforce each other, do not interact, or overlap. This can be generalised to higher-order interactions if more interventions coexist.

3.3. Regulated Bootstrap Resampling

The procedure above requires multiple distinct train/test pairs. Various resampling strategies address this, but none is entirely suitable for instance assignment interventions. In particular, the fixed size of the partitions in k-fold cross-validation (kCV) impedes adjusting to imbalances in the presence of the potential confounder z. Bootstrap sampling (), drawing |D| training instances with replacement from the whole collection D, overcomes this issue. Sampling with replacement is often preferred in the statistical learning literature (; ), as it enhances the statistical properties of the generated samples over kCV, such as reducing the variance of the derived estimates (; ). Nevertheless, training collections generated with bootstrap sampling may not permit suitable regulations if, e.g., too many instances in Dp = D\Dt have values of z also in Dt.

To address these issues, we propose regulated bootstrap, a multi-phase resampling strategy expressed in Alg. 1. The algorithm takes as input a collection D (sequence of instances, each a tuple (r, a, z)i of data element ri, class annotation ai from the set A, and attribute zi from the set Z) and the desired number of recordings per class nr. It first attempts to create a pair (Dt, Dp) using stratified bootstrap – sampling with replacement from each class separately. If this cannot derive a regulated testing collection Dp of size nr, it then proceeds to a partially-curated approach. This may be repeated an arbitrary number of times. The output of each sampling run can then be used to generate a Dp through pruning: removing all instances in Dp with z also in Dt. Although the pruned instances do not appear in Dp, they cannot be added to Dt as they remain in Dp. Supplementary material S1 describes a simple illustrative example of regulated bootstrap.

Some aspects of the algorithm deserve clarification. First, it does not immediately accept the pair generated after Step (3h), as instances might relate with more than one value of z (e.g., a song might be a collaboration between two artists), making different dz overlap. In that case, the number of unique elements of dh might fall short of the specified minimum, requiring multiple attempts until finally succeeding. Second, the algorithm does not impose any restriction regarding the same value of z appearing across different classes to avoid benefiting systems exploiting z. Finally, the sampling is performed at instance level to favour scalability of the algorithm, allowing in the future regulations over multiple z simultaneously.

Algorithm 1

Regulated Bootstrap resampling strategy, given a collection D and a threshold nr ∈ ℕ.


RegulatedBootstrap(D, nr):
- Initialise: Dt ← (∅), Dp ← (∅)
- For each a ∈ A:
    0. Define Da as the instances in D with ai = a;
    1. Phase 1: Stratified Bootstrap Sampling
        (a) Create dt by uniformly sampling with replacement |Da| instances from Da;
        (b) Create dpDa\dt;
    2. Phase 2: Size Verification
        (a) Define Zt as the union of all zi in dt;
        (b) Create dp by selecting all instances (r, a, z)i in dp with zi not in Zt;
        (c) If |dp|  <  nr, proceed to Phase 3, as it lacks enough regulated instances; otherwise, go to Phase 4;
    3. Phase 3: Curated Sampling
        (a) Define Za as the union of all zi in Da;
        (b) Initialise a hold-out collection dh ← (∅);
        (c) Randomly select a z ∈ Za, and remove it from Za;
        (d) Define dz as the instances in Da with z ∈ zi;
        (e) Append dz to dh: dhdh͡   dz;
        (f) If |dh| < nr, go to (3c), as dh still lacks enough instances;
        (g) Create dt by uniformly sampling with replacement |Da| instances from Da\dh;
        (h) Create dpDa\dt;
        (i) Go to Phase 2 to check size requirements;
    4. Phase 4: Concatenation
        (a) Append dt to Dt: DtDt͡   dt;
        (b) Append dp to Dp: DpDp͡   dp;
- Return: train/test pair (Dt, Dp)

Although class-wise computations ensure stratification in the training collections, the associated testing collections will likely be imbalanced and of different size across iterations. Moreover, pruning causes regulated and unregulated testing collections to differ in size. If these issues raise reliability concerns, it might prove useful to randomly prune test collections under both conditions to a fixed size per class, such as nr or a larger value if suitable. The choice of nr depends on the context, but aiming at a number of regulated instances at least equal to the size of a fold in 10CV might be a good rule of thumb, both overcoming these issues and avoiding sample size concerns. In case nr is too high and it becomes impossible to create Dp, it is trivial to include an exit condition in the algorithm. Along with collecting instance-level information about z, if missing, only the choice of nr requires human involvement in this otherwise automated resampling strategy.

4. Application to GTZAN

We now illustrate the analysis procedure proposed in Sec. 3, applying it to investigate the confounding effects of artist replication and infrasonic content in classification experiments involving the GTZAN music genre collection (). The presence of multiple known confounders that can be regulated using different intervention types makes this collection ideal to showcase the factorial analysis approach we propose. The code is available online.

4.1. Data and Machine Learning Methods

4.1.1. About the GTZAN Collection

GTZAN () is the most widely used public collection for music genre recognition. It contains 100 30-second music recordings of each of 10 categories: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. GTZAN has been used in the evaluation of over a hundred published studies (), and remains a benchmark collection in recent publications (e.g., Choi et al. ()).

Sturm () provides a thorough analysis of the contents of GTZAN, reporting repetitions, distortions and mislabellings, highlighting the replication of artists in many classes. At the moment of writing, all but 23 of the 1000 recordings in GTZAN have been identified. (An updated index is included with the code.) Fig. 2 summarises the artist distribution for each class in GTZAN, assuming all artists from still unidentified excerpts are unique. Queen is the only artist known to appear across classes in the collection (rock and metal). blues remains the class with highest artist replication, with all but one artist appearing in more than 10 excerpts. In reggae, a single artist (Bob Marley) appears in more than a third of the excerpts. This complicates creating conventional artist filters.

Figure 2 

Artist distribution across classes in GTZAN, showing the number of unique artists (Top) and the quartiles of the number of excerpts per artist (Bottom) in each class. Dots indicate outliers.

Rodríguez-Algarra et al. () highlight a further issue in GTZAN. Some recordings contain acoustic information at frequencies below 20 Hz associated with genre annotations, although it is not yet clear which. Such infrasonic information is arguably extraneous for the problem of genre recognition.

4.1.2. Evaluation Conditions

We draw training and testing instances from GTZAN using the regulated bootstrap resampling strategy described in Sec. 3.3, regulating over artist metadata. In particular, we draw K = 40 pairs with nr = 10. This ensures that at least 10 recordings per GTZAN class in each testing collection feature no artist that appears in its corresponding training collection. Table 1 includes estimates of the proportion of train/test samples that require curated sampling to achieve this.

Table 1 

Estimated proportion of train/test samples requiring curated sampling for each GTZAN class if drawn using Alg. 1 to regulate over artists, from 100,000 simulations with nr = 10.

Fig. 3 shows the distribution of the number of unique excerpts per class across iterations. Although all training collections contain exactly 100 excerpts per class, some of them are repeated. The expected number of unique instances in a bootstrap sample drawn from 100 elements is 63.2 (), approximately what Fig. 3 (Top) shows for the training collections despite the curation. The size of the testing collections (with and without pruning) matches their number of unique excerpts, as they contain no duplicates. Fig. 3 (Top) also shows that training collections generally include more unique excerpts than their corresponding testing collections. Some outliers appear in reggae due to the large proportion of Bob Marley recordings. Fig. 3 (Bottom) highlights the expected decrease in artist variety after pruning. As suggested by Fig. 2, blues suffers from the lowest variety in all collections.

Figure 3 

Distribution of the number of unique excerpts (Top) and artists (Bottom) per class in the training and testing collections sampled from GTZAN using bootstrap regulated over artists.

We also manipulate every recording in GTZAN similarly to the audio filtering intervention by Rodríguez-Algarra et al. (). We design a high-pass IIR filterbank, with stop-band frequency at 19 Hz, passband frequency at 20 Hz, 60 dB attenuation in the stop-band, and maximum 1 dB ripple allowed in the pass-band. Combining which recordings are included in the collections with their audio filtering status defines six distinct evaluation conditions for each iteration. We refer to these conditions as train, test, and pr. test, appending “(filt.)” to their name (e.g., train (filt.)) if the recordings have been high-pass filtered.

4.1.3. Feature Extraction and Learning Algorithms

We train prediction systems using multiple combinations of feature representations and learning algorithms. The learning algorithms we employ cover a wide range of supervised learning approaches, from parametric to non-parametric. In particular, we use scikit-learn implementations of: Naive Bayes (NB), 1- and 5-Nearest Neighbours (1-NN and 5-NN), Decision Trees with and without AdaBoost (ABDT and DT), Random Forests (RF), Support Vector Machines (SVM), and Multi-layer Perceptrons (MLP). In order to gauge how confounding affects measurements, we need a variety of modelling approaches whose performance on GTZAN spans the axis, including at its lower end, and not necessarily the best-performing. We thus use out-of-the-box implementations and avoid hyperparameter tuning, which allows us to increase the number of methods and iterations considered. Therefore, the reported performances should not be taken as representative of the potential of each method.

We select multiple feature representations, focusing on different aspects of the audio signals, from two sources: the essentia music extractor () and the scattering-based audio features by Andén and Mallat (). We group the features extracted from essentia into 8 disjoint sets: Rhythm, Tonal, Tim+Dyn (i.e., timbre plus dynamics), MFCC, GFCC, Barkbands, Melbands, and Erbbands, referred to jointly as non-scattering features hereinafter. Regarding the scattering-based features, we compute Mel-scaled (Mel Sc.), first-layer (1-L Sc.), and joint first- and second-layer time-scattering features (1&2-L Sc.). Unlike non-scattering features, these express frame-level information, so we add excerpt-level summary statistics of first-layer time-scattering features (Des. 1-L Sc.).

4.2. Instance Assignment: Artist Information

We first compare measurements obtained in test and pr. test to assess the effect of artist replication. Other than size, these conditions differ only in whether their artist content is regulated. We train systems using every combination of the selected feature extractors and learning algorithms on each of the K training collections drawn, yielding 40 × 12 × 8 = 3840 distinct systems. Fig. 4 shows performance statistics across iterations, using mean recall as metric to compensate for class imbalances derived from the resampling strategy employed. We see systematically lower performance in pr. test than the others, agreeing with results in Sturm ().

Figure 4 

Mean recall (± standard deviation) in train, test, and pr. test for each regulated bootstrap iteration over all combinations of feature extraction and learning algorithms on original GTZAN recordings. Position 0 represents the mean recall over all iterations.

Only 12.8% of all measurements in pr. test are greater or equal than their counterpart in test. From 100 simulations using randomly generated subsets of test with identical class sizes as in pr. test, we find that figure to be on average 53.7% (±2.3) without the regulation. Moreover, 15.6% (±0.5) of measurements in pr. test are greater than or equal to their counterpart in the simulations, compared to an average of 54.4% (±2.3) between simulations (see Supplementary Material S3). This suggests performance differences arise due to the regulation and not size.

An estimate of κ according to Eq. (2) yields a decrease in mean recall of approximately κ^   ≈   0.085 (8.5 percentage points). A closer look at the measurements reveals the naivety of this approach. Fig. 5 shows that, despite consistently lower results in pr. test than test, the distribution of the performance metrics varies widely when marginalised over class, feature set or learning algorithm. This suggests the confounding effect of artist replication in GTZAN does not impact performance measurements as an additive fixed effect, i.e., that there exist interactions between that metric and the classes, features, and learning algorithms.

Figure 5 

Quartiles of (mean) recall distribution obtained in train, test, and pr. test, marginalised over GTZAN class (Top), feature set (Middle), and learning algorithm (Bottom).

For each GTZAN class, we see clear differences in the distributions of recall. The largest difference by far occurs in blues, with an average drop of 19 percentage points — a relative decrease of more than 53%. This behaviour might be expected, as blues is the class in GTZAN with the least artist variety. Similarly, the average recall in reggae drops 9.7 percentage points (almost 30% relative decrease), which may relate to one artist dominating the class. The relative decrease in pop is even higher (32.4%), and might arise from duplicate recordings in that class ().

At the other end of the spectrum, we find metal, classical and disco suffer average relative decreases in recall below 10% (7.7%, 8.1%, and 9.6%, respectively). Fig. 2 shows disco is the class in GTZAN with largest artist variety. Despite having less than half the number of unique artists, however, metal and classical not only suffer the smallest relative average decrease, but also yield the highest average in both test and pr. test. This suggests these classes are so different from others in GTZAN that they are distinguished even without artist-specific information.

Marginalising over feature extraction method, Fig. 5 shows systems using scattering-based features tend to obtain higher performances than non-scattering, both in test and pr. test. The variance in frame-level approaches is substantially lower than for those computing whole-excerpt summaries, even in train. Overall, differences in mean recall between test and pr. test are highest in both Mel Sc. and 1-L Sc. features, with a decrease of approximately 15.8 percentage points in both — a decrease of 27.7% from test. The lowest drop, both in absolute and relative terms, occurs in Tim+Dyn (4 percentage points, 12% decrease from test).

Marginalising over learning algorithm also reveals clear differences in performance distribution. Systems constructed using the suboptimal MLP architecture tend to perform close to the random baseline of 0.1 mean recall. For every single learning algorithm, including MLP, performance decreases between train and test, and between test and pr. test. Apart from MLP, NB is the only other algorithm that shows an average relative difference in mean recall between test and pr. test below 20%. It is also the algorithm that seems to suffer the least from overfitting. Despite a far lower performance in train, NB systems perform on average equivalently to 1-NN systems in test, and slightly superior in pr. test, with substantially lower variance in both cases. Systems from all other algorithms decrease on average around 20.5% to 23.5% between test and pr. test, with DT having the largest drop.

Fig. 6 relates the performance trained systems achieve in test with that in pr. test, both individually (left) and grouped by feature representation and learning algorithm (right). A linear fit gives a slope α^=0.712   ±   0.003 and an intercept κ^=0.034   ±   0.001 (R2 = 0.929). The slope is thus lower than the case of no confounding, represented with a dashed line in Fig. 6. This suggests regulating by artist in GTZAN attenuates the estimated performance to around 70% of its unregulated value. This equates to considering the confounding effect of artist replication in GTZAN as a gain in mean recall of approximately 1/0.712 ≈ 1.4.

Figure 6 

Relationship between mean recall in test and pr. test obtained by systems constructed with different combinations of feature representations and learning algorithms on training collections sampled from GTZAN with bootstrap regulated over artists, represented both as individual values for each system (Left) and averages across iterations (Right). The dashed line indicates the case of equal mean recall in test and pr. test; the solid line indicates the linear regression model fitting the data as in Eq. (1).

The data points at the higher end of performance measurements in Fig. 6 deviate from the estimated regression line. This may suggest using more complex models, but exponential and polynomial models up to third degree do not substantially improve the fit. A model including both third degree polynomial and exponential terms increases R2 to 0.932, but at the cost of hard to interpret coefficients and the risk of overfitting.

4.3. Data Manipulation: Infrasonic Content

The analysis by Rodríguez-Algarra et al. () suggests that infrasonic content in GTZAN recordings affects performance estimates of scattering-based SVM systems. We here include non-scattering feature representations and a wider range of learning algorithms to gauge the extent of this effect. We compare performance measurements from the same systems in Sec. 4.2 in test and test (filt.), which differ exclusively in sub-20 Hz content. Overall, the average decrease in mean recall between these two conditions calculated as in Eq. (2) is κ^   ≈   0.098, slightly larger than the one we observe for artist replication.

Fig. 7 shows the observed performances, marginalised by GTZAN class, feature representation, and learning algorithm. The figure includes measurements on the training recordings and their filtered equivalents, revealing that performance estimates decrease between train and train (filt.) across system-construction methods and classes. Overall, the average decrease in mean recall between these two conditions is of 28 percentage points. Regardless of whether they exploit class-specific patterns of infrasonic content to predict annotations in unseen instances, systems trained in GTZAN seem to often rely on such content (or related information, such as the overall energy level) to identify recordings previously seen during training and predict their class.

Figure 7 

Quartiles of (mean) recall distribution obtained in train, train (filt.), test, and test (filt.), marginalised over GTZAN class (Top), feature set (Middle), and learning algorithm (Bottom). Note that the colours in this figure not matching those in Figs. 3, 4 and 5 correspond to different evaluation conditions.

The GTZAN class with largest relative average decrease in recall between test and test (filt.) is jazz, with 37.2%, followed by pop, the largest drop in absolute terms, and blues, with 34.9% and 33.7%, respectively. The smallest decrease by far occurs in hiphop, with an average 5.5% relative decrease. The closest classes are reggae and classical, both with over 16.5% relative decrease on average. Some might speculate these reductions in performance originate from removing information legitimately characteristic of some music genres, such as sub-bass kick drums in Hip-Hop recordings. Seeing how measurements in GTZAN’s hiphop are barely affected by the intervention compared to other classes that should not present any pattern at those frequencies (such as jazz), seems to disprove this explanation.

Marginal analysis of measurements by feature representation reveals two clearly distinct behaviours, and suggests models such as Eq. (1) might not apply in this case. The mean recall of scattering-based systems decreases on average between 41% (1&2-L Sc.) and 57% (1-L Sc.) when comparing test and test (filt.). On the other hand, no average decrease of non-scattering features exceeds 4%, one order of magnitude lower. This brings the average performance of all scattering-based systems except those using 1&2-L Sc. to the bottom of the list in test (filt.), despite appearing substantially more successful than any non-scattering feature set in test. Feature representations such as MFCC discard infrasonic information, with all filters centered at frequencies above the human hearing threshold (). Scattering-based features, even those supposedly Mel-scaled, have multiple filters centered below 20 Hz (). Fig. 8 shows the distinct behaviour of each group, where measurements from systems using non-scattering feature representations follow quite closely the ideal behaviour indicated by the dashed line, whereas those from scattering-based systems tend to create clusters away from that line.

Figure 8 

Relationship between mean recall in test and test (filt.) obtained by systems constructed with different combinations of feature representations and learning algorithms using training collections sampled from GTZAN with bootstrap regulated over artists, grouped by the source of feature set. Non-Scattering features are extracted with essentia. Instance-level scattering features correspond to Des. 1-L Sc.; the rest are frame-level. The dashed line indicates the case of equal mean recall in test and test (filt.).

Among the considered learning algorithms, SVM is the one with largest drop in performance between test and test (filt.) – an average decrease of 42.6% in mean recall. Other than MLP, NB is the algorithm that suffers the lowest average decrease (10.5%), with the remaining algorithms decreasing between 16.7% and 31.7% mean recall on average.

Fig. 8 separates measurements from systems using Des. 1-L Sc. because the clusters they form suggest interactions with learning algorithms different from frame-level scattering systems. Leaving MLP systems aside, the clusters close to the dashed line in the middle panel only contain measurements from NB systems. Their average decrease in mean recall is of 9 percentage points, corresponding to a 19% drop. NB systems with Des. 1-L Sc. feature representations, however, suffer an average 52% decrease. Conversely, the clusters closer to the ideal case for Des. 1-L Sc. systems correspond to algorithms of a similar kind: DT, ABDT, and RF. The average drop in performance for these algorithms is between 15% and 25% with Des. 1-L Sc. feature representations, but DT is the algorithm with the largest drop for the rest of the scattering-based representations, with an average 61.5% decrease in mean recall; ABDT follows with 55.8% decrease.

4.4. Factorial Integration of Interventions

The separate analyses above highlight the particularities of each confounding effect. We now conduct both interventions simultaneously in a factorial way: we expose each trained system to all evaluation conditions. In particular, pr. test (filt.) contains the same instances as pr. test but high-pass filtered.

Fig. 9 summarises the performance distributions in test and pr. test, both under original and filtered audio conditions, marginalised by GTZAN class, feature representation and learning algorithm. We see the distribution in pr. test (filt.) is centered around lower values than those on any other evaluation condition for scattering-based representations. Systems using non-scattering only suffer drops when regulating over artist, but not due to high-pass filtering.

Figure 9 

Quartiles of (mean) recall distribution obtained in test, test (filt.), pr. test, and pr. test (filt.), marginalised over GTZAN class (Top), feature set (Middle), and learning algorithm (Bottom). Note that the colours in this figure not matching those in Figs. 3, 4, 5 and 7 correspond to different evaluation conditions.

Combining multiple interventions allows us to analyse interactions between confounders. Using the notation in Sec. 3.2, let y^, y^1, y^2, and y^1,2 be the mean recall a system obtains in test, pr. test, test (filt.), and pr. test (filt.), respectively. Let ΔA be the “accumu-lated” variation of mean recall, defined as in Eq. (3), and ΔR be the “real” variation, defined as in Eq. (4). Fig. 10 shows the distribution of ΔR–ΔA, grouped by origin of feature set. We see the difference is centered around 0 for systems using non-scattering feature representations, as the overall confounding effect in those systems originates mainly from artist replication. On the other hand, we see the difference tends to be negative for systems using scattering-based feature representations. This suggests the two confounding effects overlap for those systems, which stands to reason as the recording conditions of excerpts from the same artist are likely to be similar.

Figure 10 

Distribution of differences between the real variation ΔR and the accumulated variation ΔA in mean recall for artist and infrasonic regulation interventions in GTZAN, grouped by the source of feature set.

Confounders not only impact the magnitude of performance estimates, as we saw before, but also alter their ranking. For instance, Fig. 11 shows that, for systems trained using 1&2-L Sc., NB goes from the lowest (ignoring MLP) to the highest position depending on whether one applies a data manipulation intervention; Sturm () reaches the same conclusion. Similar interactions arise in other methods (see Supplementary Material S3).

Figure 11 

Interaction between learning algorithm and evaluation condition in average mean recall for systems constructed using training collections sampled from GTZAN with bootstrap regulated over artists and 1&2-L Sc. feature representations.

Kendall’s τ provides estimates of concordance between rankings, with 1 meaning exact match, –1 completely reversed match, and 0 non-correlation (). The value of τ between test and pr. test is fairly high (0.91), which aligns with our interpretation that artist information biases performance estimates in a similar way across methods (i.e., without substantially altering their ordering). τ decreases between test and test (filt.) (0.52) and between test and pr. test (filt.) (0.45), reflecting the fact that infrasonic content affects ranking in higher degree.

5. Discussion

Our procedure for characterising confounding effects in music classification experiments facilitates understanding how particular confounders impact evaluation outcomes. It extends well-established practices in MIR, such as filtered partitioning, overcoming their limitations. In particular, our approach enables integrating multiple types of interventions, targeted to the same or distinct potential confounders but not necessarily multiple interventions of the same type. Introducing a suitable resampling strategy, such as the regulated bootstrap we describe, is key to this integration. This provides a distribution of regulated/unregulated measurement pairs instead of single sample comparisons, such as those found in previous studies (e.g., Rodríguez-Algarra et al. ()). It also enables disentangling the effects of confounding between training and prediction.

The example application using GTZAN showcases the benefits of our procedure. The factorial structure across runs of the experiment enables both marginal and joint analyses, revealing distinct behaviours when systems are exposed to each potential confounder, as well as their interactions. These observations, however, are subject to the caveats we discuss next.

Systems in our case study underperform due to the lack of hyperparameter tuning. We deliberately prioritise variety over optimisation to gather performance estimates of different magnitude and susceptibility to confounding. The evidently unsuitable MLP architecture chosen is a clear example of this, allowing us to obtain measurements close to the random baseline that could still be affected by the regulations. Alternatives to achieve measurements in the lower end, such as random or systematic classifiers, would by definition remain unaffected regardless of the condition. Tuning model hyperparameters, while relevant in real benchmarking studies, would likely concentrate performances at the high end of the axis, thus hampering the intended illustration of the proposed methodology. Further studies could incorporate optimisations as additional treatment conditions in the experimental design to illuminate how tuning impacts the susceptibility of systems to confounding effects.

Our analysis suggests the confounding effect of artist replication in GTZAN appears multiplicative rather than additive. This might seem obvious knowing that the performance metric used is bounded between 0 and 1. As Carterette () mentions, additive effects could easily make predicted values exceed those boundaries. In fact, current proposals for modelling measurements from classification experiments (e.g., Alpaydin (); Eugster ()) assume additive effects for all components of the experiment, ignoring the boundary problem. This motivates revising those models, potentially using logit transformations to convert multiplicative effects into unbounded additive components, although it might be unnecessary if one’s only concern is the ranking between systems.

The clear divergence between the proposed linear model and the observations of the highest end of performance measurements in Fig. 6 might require collecting further data, either from not yet considered methods or through the optimisation of existing ones. That divergence, however, illuminates a substantial difference in slope between observations using a particular feature representation and the overall trend. This seems to reflect Simpson’s paradox (; ), in which behaviour per group diverges from, or even completely reverses, the aggregated pattern. Together with the clusters suggested in Fig. 8 for the case of infrasonic content, this highlights the need to study interactions between learning algorithms and feature representations under various potentially confounding environments.

A general limitation of our method regards its scope, as it neither illuminates previously unknown confounders nor prevents confounding from affecting performance estimates. It is actually impossible to guarantee confounding does not appear at all, as there might be a plethora of yet unknown potential confounders still affecting observations to some extent. Devoted exploratory analyses informed by both domain knowledge and system analysis are necessary to uncover further potential confounders before assessing their impact using intervention-type approaches. This enables one to design or improve system-construction methods accounting for that risk and devise train/test mechanisms that prevent them from appearing. To this end, it is of paramount importance for MIR researchers to devote efforts to expose such potential confounders and assess their effects.

The current study does not consider all possible effects of confounding, focusing on characterising its effects on evaluation results, but leaving aside other equally relevant research questions for the moment. In particular, by introducing and comparing new conditions only at the prediction stage, we ignore the effects of confounding on the training of systems. This might be easily addressed for data manipulation interventions by adding training conditions with manipulated recordings, thus multiplying the number of models to consider and experimental conditions to analyse. In the case of instance assignment interventions, however, it would require modifying the regulated bootstrap resampling strategy to enable creating regulated and unregulated collections for both training and testing simultaneously. This is a promising research path for future work.

Some may argue that the curation process inherent to regulated bootstrap resampling introduces biases in the performance estimates, and thus in the comparisons between conditions, bringing into question the validity of the extracted conclusions. However, this process increases control over the measurements, not unlike the stratification performed in conventional classification experiments, as well as blocking in statistical design of experiments (). In particular, stratification preserves the distribution of annotations present in the original collection, thus facilitating performance estimates within the collection that approximate what systems would have achieved had they used the whole collection, but does not account for the likely imbalances that real life data could have. This favours internal over external validity, a methodological trade-off often encouraged to create experimental conditions that differ only in the factor under study and to warrant against external factors affecting the conclusions ().

The size of the testing collections generated might also cause concern, as there is no guarantee that the original class balance remains and, by definition, the number of instances decreases after pruning. The use of mean recall as performance metric should compensate for imbalances, and, in the case study we conduct, the differences in performance between collections of the same iteration clearly exceed the differences across iterations. This suggests unequal size should not affect our conclusions. As mentioned before, in the general case, one might want to introduce a further control step that forces all original and pruned testing collections, and all classes within those collections, to match in size, such as randomly selecting a fixed number of instances. This might also alleviate the likely lack of independence between instances from the curation involved in their sampling. Due to the infeasibility of pure random sampling from the whole population, evaluation collections are often constructed through convenience sampling (), hampering independence in the first place. Curation thus does not necessarily affect conclusions in this regard.

The analysis approach we describe and exemplify in this article can be applied to a wider range of collections, machine learning methods and potential confounders than the ones we show here. Published studies and evaluation exchanges, such as MIREX, could incorporate similarly extended pipelines to assess the susceptibility of proposed systems to a set of interventions. Domains other than music would also benefit from similar analysis approaches. Despite its caveats, the insights obtained through this kind of analysis should help building more robust systems and obtaining performance estimates that generalise to deployment scenarios.

6. Conclusion

In this article, we explored the nature of confounding in music classification experiments and described a procedure for assessing its impact in the evaluation of MIR systems and methods. We used interventions in the experimental pipeline and proposed a novel resampling strategy that introduces regulations on a conventional bootstrap sampling. Using our approach, we analysed the effects of artist replication and infrasonic content in GTZAN on performance estimates of a range of feature extraction methods and learning algorithms. We found the effects of artist replication appear to be multiplicative, while those from infrasonic content depend on the system-construction method employed. We also showed that these two potential confounders appear to partially overlap, and their effect might alter the ranking of different solutions with respect to their average performance. Further improvements of the approach could include introducing evaluation conditions through interventions on the training data, controlling the testing collection size, and analysing the effect of optimisation. We hope that future MIR research will focus not only on maximising performance estimates, but also on developing and assessing solutions with regards to their susceptibility to confounding effects.

Additional File

The additional file for this article can be found as follows:

Supplementary Material

Example of regulated bootstrap resampling and auxiliary figures. DOI: https://doi.org/10.5334/tismir.24.s1