Artist Similarity for Everyone: A Graph Neural Network Approach

Filip Korzeniowski; Sergio Oramas; Fabien Gouyon

1. Introduction

Music similarity has sparked interest early in the Music Information Retrieval community, (; ) and has since then become a central concept for music discovery and recommendation in commercial music streaming services.

There is however no consensual notion of ground truth for music similarity, as several viewpoints are relevant (). For instance, music similarity can be considered at several levels of granularity; musical items of interest can be musical phrases, tracks, artists, genres, to name a few. Furthermore, the perception of similarity between two musical items can focus either on (1) comparing descriptive (or content-based) aspects, such as the melody, harmony, timbre (in acoustic or symbolic form), or (2) relational (sometimes called cultural) aspects, such as listening patterns in user-item data, frequent co-occurrences of items in playlists, web pages, et cetera.

In this paper—which is an extended version of our previous work ()—, we focus on artist-level similarity, and formulate the problem as a retrieval task: given an artist, we want to retrieve the most similar artists, where the ground truth for similarity is cultural. More specifically, artist similarity is defined by music experts in some experiments, and by the “wisdom of the crowd” in other experiments.

In this sense, we aim at bridging the semantic gap () between content (music) and context (culture). Connecting these two disparate views is crucial for music recommendation: the user’s perception of similarity is driven by cultural aspects, but reliable context-related data (such as ratings) is available only for established artists; for the undiscovered long tail, we may only have content-based features. Thus, if we want to build a similarity model for everyone—popular, upcoming and niche artists—, we need our system to consider both content and context. Neglecting context, we would miss the cultural perspective of listeners; neglecting content, our model would only work well for the selected few.

A variety of methods have been devised for computing artist similarity, from the use of audio descriptors to measure similarity (), to leveraging text sources by measuring artist similarity as a document similarity task (). A significant effort has been dedicated to the study of graphs that interconnect musical entities with semantic relations as a proxy to compute artist similarity. For instance, Celma and Serra () combine user profiles, music descriptions and audio features in a domain-specific ontology to compute artist similarity, whereas Oramas et al. () extract semantic graphs of artists from artist biographies.

Other approaches use deep neural networks to learn artist embeddings from heterogeneous data sources and then compute similarity in the resulting embedding space (). Furthermore, metric learning approaches trained with triplet loss have been applied to learn the embedding space where similarity is computed (; ; ; ; ; ). While these models work well in objective tasks (e.g. genre classification, artist and song version identification), they do not consider cultural aspects of similarity.

Most recently, graph neural networks (GNNs) successfully improved upon metric-learning-based approaches: Salha-Galvan et al. () train a GNN for directed link prediction between artists, considering both artist similarity and popularity, with a focus on cold-start artists with no known connections; our previous work (), on the other hand, used a metric-learning objective to learn a GNN for artist similarity, and provided an overall evaluation for all artists, long-tail or not. Compared to older graph embedding methods such as node2vec (), GNNs easily leverage both graph structure and node features.

Our artist similarity model thus combines graph approaches and embedding approaches using GNNs. The proposed model, described in detail in Section 3, uses content-based features (audio descriptors, or musicological attributes) together with explicit similarity relations between artists made by human experts (or extracted from listener feedback). These relations are represented in a graph of artists; the topology of this graph thus reflects the contextual aspects of artist similarity. The proposed graph neural network is trained using triplet loss to learn a function that embeds artists using both content features and graph connections. In this embedding space, similar artists are close to each other, while dissimilar ones are further apart.

We use two datasets (described in-depth in Section 4) to evaluate our approach: the OLGA dataset, which is collected from publicly available sources, comprising 17,673 artists; and a larger, proprietary dataset, consisting of 136,731 artists. Our experiment setup—metrics, models, data partitioning, etc.—is detailed in Section 5.

Beyond overall results, we take a deeper look at the model’s performance on long-tail artists. Both are presented in Section 6. In contrast to Salha-Galvan et al. (), we consider not only artists with no known connections, but evaluate the change in performance at different levels of known connectivity. We do find that performance suffers for artists with fewer known connections. To impede this effect, we devise a simple and effective training method—which we call connection dropout—that drastically mitigates this problem. This deeper evaluation, and the presentation of a novel method which improves results, constitute the main extension of our previous work on this topic ().

3. Modelling

The goal of an artist similarity model is to define a function s(a, b) that estimates the similarity of two artists—i.e., yields a large number if artist a is considered similar to artist b, and small number if not.

Many content-based methods for similarity estimation have been developed in the last decades of MIR research (see Section 2). The field has closely followed the state-of-the-art in machine learning research, with general improvements coming from the latter translating well into improvements in the former. Acknowledging this fact, we select our baselines based on the most recent developments: Siamese neural networks trained with variants of the triplet loss (; ; ; ; ). Building and training this type of model falls under the umbrella of metric learning.

3.1 Metric Learning

The fundamental idea of metric learning is to learn a projection y_v = f (x_v) of the input features x_v of an item v into a new vector space; this vector space should be structured in a way such that the distances between points reflect the task at hand. In our case, we want similar artists to be close together in this space, and dissimilar artists far away.

There is an abundance of methods that embed items into a vector space, many rooted in statistics, that have been applied to music similarity (). In this paper, we use a neural network for this purpose. The idea of using neural networks to embed similar items close to each other in an embedding space was pioneered by Bromley et al. (), with several improvements developed in the following decades. Most notably, the contrastive learning objective—where two items are compared to each other as a training signal—was replaced by the triplet loss (; ). Here, we observe three items simultaneously: the anchor item x_a is compared to a positive sample x_p and a negative sample x_n. With the following loss formulation, the network is trained to pull the positive close to the anchor, while pushing the negative further away from it:

ℒ (t) = [d (y a, y p) − d (y a, y n) + Δ] +,

M1 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal L}\left(t \right) = {\left[{d\left({{{\bf{y}}_a},{\rm{ }}{{\bf{y}}_p}} \right) - d\left({{{\bf{y}}_a},{\rm{ }}{{\bf{y}}_n}} \right) + \Delta } \right]^ + }, \] \end{document}

where t denotes the triplet (y_a, y_p, y_n), d(·) is a distance function (usually Euclidean or cosine), Δ is the maximum margin enforced by the loss, and [·]⁺ is the ramp function.

As mentioned before, state-of-the-art music similarity models are almost exclusively based on learning deep neural networks using the triplet loss. We thus adopt this method as our baseline model, which will serve as a comparison point to the graph neural network we propose in the following sections.

3.2 Graph Neural Networks

A set of artists and their known similarity relations can be seen as a graph, where the artists represent the nodes, and the similarity relations their (undirected) connections. Graph methods thus naturally lend themselves to model the artist similarity problem (). A particular set of graph-based models that has been gaining traction recently are graph neural networks (GNNs), specifically convolutional GNNs. Pioneered by Bruna et al. (), convolutional GNNs have become increasingly popular for modelling different tasks that can be interpreted as graphs. We refer the interested reader to Wu et al. () for a comprehensive and historical overview of GNNs. For brevity, we will focus on the one specific model our work is based on—the GraphSAGE model introduced by Hamilton et al. () and refined by Ying et al. ()—and use the term GNNs for convolutional GNNs.

3.2.1 Model Overview

The GNN we use in this paper comprises two parts: first, a block of graph convolutions (GC) processes each node’s features and combines them with the features of adjacent nodes; then, another block of fully connected layers projects the resulting feature representation into the target embedding space. See Figure 1 for an overview.

Figure 1

Overview of the graph neural network we use in this paper. First, the input features x_v are passed through a front-end of graph convolution layers (see Section 3.2.2 for details); then, the output of the front-end is passed through a traditional deep neural network back-end to compute the final embeddings y_v of artist nodes. Based on these embeddings, we use the triplet loss to train the network to project similar artists (positive, green) closer to the anchor, and dissimilar ones (negative, red) further away.

We train the model using the triplet loss, in an identical setup as the baseline model. Viewing the proposed GNN from this angle, the only difference of the GNN from a standard embedding network is the additional Graph Convolutional Frontend. In other words, if we remove all graph convolution layers, we arrive at our baseline model, a fully connected Deep Neural Network (DNN).

3.2.2 Graph Convolutions

The graph convolution algorithm, as defined by Hamilton et al. (); Ying et al. (), features two operations which are not found in classic neural networks: a neighborhood function N(·), which yields the set of neighbors of a given node; and an aggregation function, which computes a vector-valued aggregation of a set of input vectors.

As a neighborhood function, most models use guided or uniform sub-sampling of the graph structure (; ; ). This limits the number of neighbors to be processed for each node, and is often necessary to adhere to computational limits. As aggregation functions, models commonly apply pooling operators, LSTM networks, or (weighted) point-wise averages ().

In this work, we take a simple approach, and use point-wise weighted averaging to aggregate neighbor representations, and select the strongest 25 connections as neighbors. If weights are not available, we use the simple average of random (but fixed) 25 connections. This enables us to use a single sparse dot-product with an adjacency matrix to select and aggregate neighborhood embeddings. Note that this is not the full adjacency matrix of the complete graph, as we select only the parts of the graph which are necessary for computing embeddings for the nodes in a mini-batch.

Algorithm 1 describes the inner workings of the graph convolution block of our model. Here, the matrix $X ∈ ℝ D × V$ M3 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\rm{X}} \in {\mathbb{R}^{D \times V}} \] \end{document} stores the D-dimensional features of all V nodes, the symmetric sparse matrix $A ∈ ℝ V × V$ M4 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\rm{A}} \in {\mathbb{R}^{V \times V}} \] \end{document} defines the connectivity of the graph, and N(v) is a neighborhood function which returns all connected nodes of a given node v (here, all non-zero elements in the v^th row of A).

To compute the output of a graph convolution layer for a node, we need to know its neighbors. Therefore, to compute the embeddings for a mini-batch of nodes V, we need to know which nodes are in their joint neighborhood. Thus, before the actual processing, we first need to trace the graph to find the node features necessary to compute the embeddings of the nodes in the mini-batch. This is shown in Figure 2, and formalized in lines 1–4 of Algorithm 1.

Figure 2

Tracing the graph to find the necessary input nodes for embedding the target node (orange). Each graph convolution layer requires tracing one step in the graph. Here, we show the trace for a stack of two such layers. To compute the embedding of the target node in the last layer, we need the representations from the previous layer of itself and its neighbors (green). In turn, to compute these representations, we need to expand the neighborhood by one additional step in the preceding GC layer (blue). Thus, the features of all colored nodes must be fed to the first graph convolution layer.

At the core of each graph convolution layer k ϵ [1…K] there are two non-linear projections, parameterized by projection matrices $Q k ∈ ℝ H Q k × D$ M5 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\bf{Q}}_k} \in {\mathbb{R}^{{H_{{Q_k}}} \times D}} \] \end{document} and $W k ∈ ℝ H W k × (H Q k + D)$ M6 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{{\bf{W}}_k} \in {\mathbb{R}^{{H_{{W_k}}} \times ({H_{{Q_k}}} + D)}}\] \end{document} , and a point-wise non-linear activation function σ, in our case, the Exponential Linear Unit function (ELU). Here, $H Q k$ M7 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {H_{{Q_k}}} \] \end{document} and $H W k$ M8 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {H_{{W_k}}} \] \end{document} are the output dimensions of the respective projections. The last output, $X K ∈ ℝ H W K × V$ M9 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\bf{X}}_K} \in {\mathbb{R}^{{H_{{W_K}}} \times V}} \] \end{document} , holds the l₂-normalized representations of each node in the mini-batch in its columns. It is fed into the following fully connected layers, which then compute the output embedding y_v of a node. Finally, these embeddings are used to compute the triplet loss and back-propagate it through the GNN.

3.2.3 Connection Dropout

As we observed in our experiments (see Section 5), the GNNs learned to overly rely on the graph topology. This is because—given enough GC layers—graph topology trumps features when it comes to predicting similarity (as we will see in Section 5). To alleviate this issue, we introduce a tweak during training: each time we consult the neighborhood of a node k, we return a randomly sampled subset of its neighbors. This is achieved by dropping each connection to k with a given probability p. Concretely, in Algorithm 1, line 7, we randomly set each element A_k to 0 with probability p. During evaluation, we do not drop any connections, and use the allowed maximum of 25. As we will see when discussing our results, this method greatly improves the GNN’s performance on artists with few known connections.

Connection Dropout can be seen as sub-sampling the neighborhoods in the graph. Sub-sampling has been previously used in GNNs, but for a different purpose: to condense neighborhoods and to control the computational burden. Indeed, Ying et al. () finds importance-weighted, dense neighborhoods using short random walks; these random walks were carried out until convergence criteria determined that the neighborhoods are stable enough (). This is in stark contrast to our method, which randomly destabilizes and sparsifies neighborhood structures on purpose to achieve better generalization. Future work could aim at combining these two purposes; it is however out of scope of this work.

4. Datasets

Many published studies on the topic of artist similarity are limited by data: datasets including artists, their similarity relations, and their features comprise at most hundreds to a few thousand artists. In addition, the quality of the ground truth provided is often based on 3^rd party APIs with unknown similarity methods like the last.fm API, rather than based on data curated by human experts.

For instance, Oramas et al. () provides two datasets, one with ~2k artists and similarity based on last.fm relations, and another with only 268 artists, but based on relations curated by human experts. Schedl et al. () use a dataset of 1,677 artists based on last.fm similarity relations for evaluation. Also, the dataset used in the Audio Music Similarity and Retrieval (AMS) MIREX task, which was manually curated, contains data about only 602 artists. Others, like Lee et al. (), use tag data shared among tracks or artists as a proxy for similarity estimation—which can be considered as a weak signal of similarity—and use a small set of 879 human-labeled triplets for evaluation.

Due to all these issues regarding existing datasets, we compiled a new dataset, the OLGA Dataset, which we describe in the following.

4.1 The OLGA Dataset

For the OLGA (“Oh, what a Large Graph of Artists”) dataset, we bring together content-based low-level features from AcousticBrainz (), and similarity relations from AllMusic, as curated by their music editors. Assembling the data works as follows:

Select a common pool of artists based on the unique artists in the Million Song Dataset ().
Map the available MusicBrainz IDs of the artists to AllMusic IDs using mapping available from MusicBrainz.
For each artist, obtain the list of “related” artists from AllMusic; this data can be licensed and accessed on their website. Use only related artists who can be mapped back to MusicBrainz.
Using MusicBrainz, select up to 25 tracks for each artist using their API, and collect the low-level features of the tracks from AcousticBrainz.
Compute the track feature centroid of each artist.

In total, the dataset comprises 17,673 artists connected by 101,029 similarity relations. On average, each artist is connected to 11.43 other artists. The quartiles are at 3, 7, and 16 connections per artist. The lower 10% of artists have only one connection, the top 10% have at least 27.

While the dataset size is still small compared to industrial catalog sizes, it is significantly bigger than other datasets available for this task. Its size and available features permits to apply more data-driven machine learning methods to the problem of artist similarity.

For our experiments, we partition the artists following an 80/10/10 split into 14,139 training, 1767 validation, and 1767 test artists.

4.2 Proprietary Dataset

We also use a larger proprietary dataset to demonstrate the scalability of our approach. Here, explicit feedback from listeners of a music streaming service is used to define whether two artists are similar or not: we derive similarity connections based on the co-occurrence of positive feedback for two artists.

For artist features, we use the centroid of an artist’s track features. These track features are musicological attributes annotated by experts, and comprise hundreds of content-based characteristics such as “amount of electric guitar”, or “prevalence of groove”.

In total, this dataset consists of 136,731 artists connected by 3,277,677 similarity relations. The number of connections per artist is a top-heavy distribution with a few artists sharing most of the connections: the top 10% are each connected to more than 134 others, while the bottom 10% to only one. The quartiles are at 2, 5, and 48 connections per artist.

We follow the same partition strategy as for the OLGA dataset, which results in 109,383 training, 13,674 validation, and 13,674 test artists.

5. Experiments

Our experiments aim to evaluate how well the embeddings produced by our model capture artist similarity. To this end, we set up a ranking scenario: given an artist, we collect its K nearest neighbors sorted by ascending distance, and evaluate the quality of this ranking. To quantify this, we use normalized discounted cumulative gain () with a high cut-off at K = 200 (“NDCG@200”). We prefer this metric over others, because it was shown that at high cut-off values, it provides better discriminative power, as well as robustness to sparsity bias (and, to a moderate degree, popularity bias) (). Formally, given an artist a with an ideal list of similar artists s (sorted by relevance), the NDCG_K of a predicted list of similar artists ŝ is defined as:

NDCG K (a, s^, s) = ∑ k = 1 K g (s^k, a) d (k) ∑ k = 1 K g (s k, a) d (k),

M2 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\rm{NDC}}{{\rm{G}}_{\rm{K}}}(a,{\bf{\hat s}},{\bf{s}}) = \frac{{\sum\nolimits_{k = 1}^K {g({{\hat s}_k},a)d(k)} }}{{\sum\nolimits_{k = 1}^K {g({s_k},a)d(k)} }}, \] \end{document}

where g(·, a), the gain, is 1 if an artist is indeed similar to a, and 0 otherwise, and $d (k) = log 2 − 1 (k + 1)$ M10 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[d(k) = \log _2^{ - 1}(k + 1)\] \end{document} the discounting factor, weights top rankings higher than the tail of the list.

In the following, we first explain the models, their training details, the features, and the evaluation data used in our experiments. Then, we show, compare and analyze the results.

5.1 Models

As explained in Section 3.2.1, a GNN with no graph convolution layers is identical to our baseline model (i.e. a DNN trained using triplet loss). This allows us to fix hyper-parameters between the baseline and the proposed GNN, and isolate the effect of adding graph convolutions to the model. For each dataset, we thus train and evaluate four models with 0 to 3 graph convolution layers.

The other hyper-parameters remain fixed: each layer in the graph convolutional front-end consists of 256 ELUs (); the back-end comprises two layers of 256 ELUs each, and one linear output layer with 100 dimensions; we train the networks using the ADAM optimizer () with a linear learning-rate warm-up () for the first epoch, and following a cosine learning rate decay () for the remaining 49 epochs (in contrast to Loshchilov and Hutter (), we do not use warm-restarts); for selecting triplets, we apply distance-weighted sampling (), and use a margin of Δ = 0.2 in the loss; finally, as distance measure, we use Euclidean distance between l₂-normalized embeddings.

We are able to train the largest model with 3 graph convolution layers within 2 hours on the proprietary dataset, and under 5 minutes on OLGA, using a Tesla P100 GPU and 8 CPU threads for data loading, which includes tracing the graph to find the relevant neighborhood as explained in Section 3.2.2.

5.2 Features

We build artist-level features by averaging track-level features of the artist’s tracks. Depending on the dataset, we have different types of features at hand.

In the OLGA dataset, we have low-level audio features extracted by the AcousticBrainz project using the Essentia library. These features represent track-level statistics about the loudness, dynamics and spectral shape of the signal, but they also include more abstract descriptors of rhythm and tonal information, such as BPM and the average pitch class profile.

Although AcousticBrainz also provides high-level features such as mood and genre predictions, we refrain from using them. The reason is twofold: first, they are derived from the low-level features themselves, and as such, do not provide complementary information; second, as stated on the AcousticBrainz website itself, the high-level features may be subject to change if and when the models predicting them are changed, re-trained or improved.

We select all numeric features and pre-process them as follows: we apply element-wise standardization, discard features with missing values, and flatten all numbers into a single vector of 2613 elements.

In the proprietary dataset, we use numeric musicological descriptors annotated by experts (for example, “the nasality of the singing voice”). We apply the same pre-processing for these, resulting in a total of 170 values.

Using two different types of content features gives us the opportunity to evaluate the utility of our graph model under different circumstances, or more precisely, features of different quality and signal-to-noise ratio. The low-level audio-based features available in the OLGA dataset are undoubtedly noisier and less specific than the high-level musical descriptors manually annotated by experts, which are available in the proprietary dataset. Experimenting with both permits us to gauge the effect of using the graph topology for different data representations.

In addition, we also train models with random vectors as features. For each artist, we uniformly sample a random vector of the same dimension as the real features, and keep it constant throughout training and testing. This way, we can differentiate between the performance of the real features and the performance of using the graph topology in the model: the results of a model with no graph convolutions is only due to the features, while the results of a model with graph convolutions but random features is only due to the use of the graph topology.

5.3 Evaluation Data

As described in Section 4, we partition artists into a training, validation and test set. When evaluating on the validation or test sets, we only consider artists from these sets as candidates and potential true positives. Specifically, let V_eval be the set of evaluation artists; we only compute embeddings for these, and retrieve nearest neighbors from this set, and only consider ground truth similarity connections within V_eval.

This notion is more nuanced in the case of GNNs. Here, we want to exploit the known artist graph topology (i.e.,which artists are connected to each other) when computing the embeddings. To this end, we use all connections between artists in V_train(the training set) and connections between artists in V_train and V_eval. This process is outlined in Figure 3.

Figure 3

Artist nodes and their connections used for training (green) and evaluation (orange). During training, only green nodes and connections are used. When evaluating, we extend the graph with the orange nodes, but only add connections between validation and training artists. Connections among evaluation artists (dotted orange) remain hidden. We then compute the embeddings of all evaluation artists, and evaluate based on the hidden evaluation connections.

Note that this does not leak information between train and evaluation sets; the features of evaluation artists have not been seen during training, and connections within the evaluation set—these are the ones we want to predict—remain hidden.

5.4 Emulating Long-Tail Artists

Overall evaluations portray a model’s performance from a birds-eye view. Beyond that, we are interested in the performance of our model for the segment of long-tail artists. Such artists usually have few known connections, which not only limits the information a GNN is able to leverage, but also limits our capability to evaluate how well the GNN is able to leverage existing information. Since ground truth for these artists is sparse, retrieved lists of similar artists can contain relevant items for which we do not know that they are relevant; we cannot quantitatively distinguish a list of bad recommendations from a list of good recommendations of which we do not know that they are indeed good.

To circumvent this problem, we collect a subset of well-connected artists for which we will then artificially sparsify known evaluation connections (i.e.,connections between validation artists and training artists, see Figure 3). This will enable us to emulate a long-tail artist for which the GNN cannot use a dense neighborhood to compute embeddings. At the same time, we retain the ability to quantify the quality of retrieved similar artists, since we have a lot of unseen evaluation connections at hand. In particular, we will sweep the number of known evaluation connections from zero to 25 (the maximum number of connections), and inspect the results for each degree of connectivity.

Depending on the dataset, we use different criteria to select these artists. Since each dataset differs in size and connection density, parameters that work for one would not work for the other. For the proprietary dataset, which is large and densely connected, we use artists with at least 25 connections to the training graph (known evaluation connections), and 50 unseen connections (“evaluation connections” in Figure 3); this results in 207 artists. For the OLGA dataset, we also require 25 known evaluation connections, but are satisfied with at least 5 unseen evaluation connections; this gives us 44 artists to look at.

6. Results and Discussion

We will first discuss the overall results in the following section. Then, we will use the subsets of artists selected in Section 5.4 to evaluate the sensitivity of our model to decaying connectivity, as observed with less popular artists.

6.1 Overall Evaluation

Table 1 compares the baseline model with the proposed GNN, trained without connection dropout. We can see that the GNN easily out-performs the DNN. It achieves an NDCG@200 of 0.55 vs. 0.24 on the OLGA dataset, and 0.57 vs. 0.44 on the proprietary dataset. The table also demonstrates that the graph topology is more predictive of artist similarity than content-based features: the GNN, using random features, achieves better results than a DNN using informative features for both datasets (0.45 vs. 0.24 on OLGA, and 0.52 vs. 0.44 on the proprietary dataset).

Table 1

NDCG@200 for the baseline (DNN) and the proposed model with 3 graph convolution layers (GNN), using features or random vectors as input. The GNN with real features as input gives the best results. Most strikingly, the GNN with random features—using only the known graph topology—out-performs the baseline DNN with informative features.


DATASET	FEATURES	DNN	GNN

OLGA	Random	0.02	0.45

	AcousticBrainz	0.24	0.55

Proprietary	Random	0.00	0.52

	Musicological	0.44	0.57

Additionally, the results indicate—perhaps to little surprise—that low-level audio features in the OLGA dataset are less informative than manually annotated high-level features in the proprietary dataset. Although the proprietary dataset poses a more difficult challenge due to the much larger number of candidates (14k vs. 1.8k), the DNN—which can only use the features—improves more over the random baseline in the proprietary dataset (+0.44), compared to the improvement (+0.22) on OLGA. These are only indications; for a definitive analysis, we would need to use the exact same features in both datasets.

Similarly, we could argue that the topology in the proprietary dataset seems more coherent than in the OLGA dataset. We can judge this by observing the performance gain obtained by a GNN with random features—which can only leverage the graph topology to find similar artists—compared to a completely random baseline (random features without GC layers). In the proprietary dataset, this performance gain is +0.52, while in the OLGA dataset, only +0.43. Again, while this is not a definitive analysis (other factors may play a role), it indicates that the large amounts of user feedback used to generate ground truth in the proprietary dataset give stable and high-quality similarity connections.

Figure 4 depicts the results for each model and feature set depending on the number of graph convolution layers used. (Recall that a GNN with 0 graph convolutions corresponds to the baseline DNN.) In the OLGA dataset, we see the scores increase with every added layer. This effect is less pronounced in the proprietary dataset, where adding graph convolutions does help significantly, but results plateau after the first graph convolution layer. We believe this is due to the quality and informativeness of the features: the low-level features in the OLGA dataset provide less information about artist similarity than high-level expertly annotated musicological attributes in the proprietary dataset. Therefore, exploiting contextual information through graph convolutions results in more uplift in the OLGA dataset than in the proprietary one.

Figure 4

Results on the OLGA (top) and the proprietary (bottom) dataset with different numbers of graph convolution layers, using either the given features (left) or random vectors as features (right). Error bars indicate 95% confidence intervals computed using bootstrapping.

Looking at the scores obtained using random features (where the model depends solely on exploiting the graph topology), we observe two remarkable results. First, whereas one graph convolution layer suffices to out-perform the feature-based baseline in the OLGA dataset (0.28 vs. 0.24), using only one GC layer does not produce meaningful results (0.05) in the proprietary dataset. We believe this is due to the different sizes of the respective test sets: 14k in the proprietary dataset, while only 1.8k in OLGA. Using only a very local context seems to be enough to meaningfully organize the artists in a smaller dataset.

Second, most performance gains are obtained with two GC layers, while adding the third GC layer pushes the results to a much lesser degree. Our explanation for this effect is that most similar artists are connected through at least one other, common artist. In other words, most artists form similarity cliques with at least two other artists. Within these cliques, in which every artist is connected to all others, missing connections are easily retrieved by no more than 2 graph convolutions.

In fact, in the OLGA dataset, ~71% of all cliques fulfill this requirement. This means that, for any hidden similarity link in the data, in 71% of cases, the true similar artist is within 2 steps in the graph—which corresponds to using two GC layers.

6.2 Evaluation of Long-Tail Artists

Let us now focus on the results specific for long-tail artists. As explained in Section 5.4, we will not use actual long-tail artists for this, since data sparsity prevents a solid evaluation. Instead, we emulate the long-tail condition by removing known connections of well-connected artists, while keeping all their unseen evaluation connections. From the OLGA dataset, we collected 44 artists with at least 25 known connections and at least 5 unseen ones; for the proprietary dataset, we found 207 artists with at least 25 known connections, and at least 50 unseen ones (the proprietary dataset is larger and more densely connected).

We train the largest models with 3 graph convolution layers using varying connection dropout probabilities: 0.0, 0.25, 0.5, 0.75, 0.95 and 0.99; a connection dropout probability of 0.0 corresponds to the baseline GNN model with no connection dropout. Once these models are trained, we use them to evaluate the resulting artist embeddings in different connectivity settings: we sweep the known evaluation connections between 25 and zero (the cold-start scenario) for each evaluation artist, dropping the weakest connection at each step. To reiterate, we do not re-train the models; we only manipulate the connectivity of validation artists when computing artist embeddings.

Figure 5 shows the results. We see that for the baseline model (blue, no connection dropout), results degrade significantly with decreasing connectivity. We observe this effect—though with different intensity—on both datasets. Indeed, the baseline model with no connection dropout performs poorly for cold-start artists: it needs 2 known connections in the OLGA dataset to be on-par with a simple DNN without GC layers (and even 5 in the proprietary one).

Figure 5

Evaluation of the long-tail performance of a 3-GC-layer model on the OLGA dataset (top) and the proprietary dataset (bottom). The different bars represent models trained with different probabilities of connection dropout. The gray line in the background represents the baseline model with no graph convolution layers, with the shaded area indicating the 95% confidence interval. We see that for the standard model (blue, no connection dropout), performance degrades with fewer connections. Introducing connection dropout significantly reduces this effect.

We also observe how connection dropout greatly reduces that degradation, without negatively impacting results for well-connected artists: in the OLGA dataset, we can use very high dropout rates such as 0.95 to achieve better-than-baseline results for cold-start artists without significantly sacrificing results for others; in the proprietary dataset, we achieve this with a lower dropout probability of 0.75. The optimal ratio of connection dropout clearly depends on the dataset, and is a hyper-parameter to be tuned. However, values of 0.5 or 0.75 seem to be good starting points.

Using connection dropout achieves better results for sparsely connected artists because it prevents the GNN from relying too much on the graph connectivity when computing the embedding. To substantiate this claim, we examine the stability of artist embeddings while manually removing known connections, using the same subset of artists as before. We consider the embedding computed using all 25 known connections to be the true embedding of an artist. We then remove known connections one by one, compute a new artist embedding at each level of connectivity, and calculate the cosine distance of these embeddings to the true embedding.

The results are shown in Figure 6: the distance stays close to zero at first, but degrades quickly after we dropped a certain number of connections. We also see how connection dropout greatly reduces this effect. In the OLGA dataset, the cosine distance between the embedding using no connections and the one using all 25 connections is more than 0.5 on average when no connection dropout is used; this number drops to less than 0.2 if we employ a connection dropout rate of 0.99. Similar effects can be observed for the proprietary dataset.

Figure 6

Cosine distance between embeddings computed using reduced connectivity and the “true” embedding (computed using all 25 known connections). Without connection dropout, the GNNs learn to rely too much on the graph connectivity to compute the artist embedding: the distance between an embedding computed using fewer connections and the “true” embedding grows quickly. With connection dropout, we can strongly curb this effect.

7. Summary and Future Work

In this paper, we described a hybrid approach to computing artist similarity, which uses graph neural networks to combine content-based features with explicit relations between artists.

To evaluate our approach, we assembled a dataset with 17,673 artists, their features, and their similarity relations. Additionally, we used a much larger proprietary dataset to show the scalability of our method. The results showed that leveraging known connections between artists can be more effective for understanding their similarity than high-quality features, and that combining both gives the best results.

The introduction of Connection Dropout in training was shown to be effective in decreasing the model’s reliance on the number of known artist connections, which was detrimental for sparsely connected long-tail artists. The proposed method significantly improves results for such artists without negatively affecting densely connected ones.

Our work is a first step towards models that directly use known relations between musical entities, like tracks, albums, artists, or even genres. Future work could investigate how to employ multi-modality in this context; for example, we could build a multi-modal graph by using connections between different types of entities (e.g. tracks, albums, artists), or different types of connections between the same entities (e.g. artist collaborations, band memberships). Another avenue of research could focus on collecting and using better and/or higher-level features for the OLGA dataset. This would provide a better judgement of the importance of feature quality in the proposed model.

Transactions of the International Society for Music Information Retrieval

Research articles

Artist Similarity for Everyone: A Graph Neural Network Approach

Abstract

1. Introduction