1 Introduction
Singing contains both textual and musical information. As an important component of singing voice analysis, automatic transcription of singing voice includes
automatic lyric transcription (ALT) and
automatic music transcription (AMT). The former is the task of recognizing textual information, while the latter is the task of identifying musical information, including onsets/offsets/pitch of note events. The above two tasks facilitate solving many downstream music information retrieval problems. For instance, ALT can be applied to lyric alignment [
29], query by singing [
32], audio indexing [
20], music subtitling [
18], and singing pronunciation evaluation [
58]. AMT can be applied to sight-singing evaluation [
79], music therapy [
71], and human–computer interaction [
57,
76]. Furthermore, they can also be employed in singing voice synthesis [
36,
44], which is a topic that has recently been actively studied in the singing field.
Traditionally, ALT and AMT systems are built only on audio modality and treated as separate tasks with distinct objectives. However, they encounter certain common challenges, which motivate developing a generalized solution.
Insufficient robustness for noise. Audio recordings of singing may be accompanied with noise, e.g., background music. In challenging
signal-to-noise ratio (SNR) environments, the intelligibility of singing in the audio modality will be drastically reduced, thus affecting the information retrieval of lyrics and musical note events. In our previous work [
27], we showed that attempting the ALT task solely on audio recordings in noisy environments yields unsatisfactory performance. Additionally, [
26,
35] showed that low-SNR environments greatly harm the performance of pitch estimation from speech. Considering that singing and speech share similarities in terms of the sound production mechanism, it is reasonable to surmise that attempting audio-only AMT from singing voices would probably meet the similar challenge of noise robustness.
Limited data for complex tasks. Singing transcription is notably much more difficult compared to speech-related recognition tasks due to the scarcity of labeled data and the intricate intertwining of textual and musical information within singing. Speech recognition benefits from large-scale annotated datasets such as LibriSpeech [
62], which comprises 960 hours of annotated speech recordings. In contrast, DSing [
8,
13], a widely used ALT dataset, has about 150 hours of data, and the largest AMT dataset, MIR-ST500 [
75], only contains around 30 hours. The scarcity of labeled data arises from the time-consuming process of manual annotation, where extensive musical knowledge is necessary. Additionally, singers inevitably have to adjust or compromise certain linguistic features, such as word stress and articulation, to accommodate properties or constraints of singing that are not present in regular speech, such as melody, tempo, or deliberate timbre adjustments. As a result, singing tends to be less intelligible compared to speech [
66], thereby further complicating the transcription process.
The perception of both speech and singing extends beyond the auditory realm, as exemplified by the McGurk effect [
49]. This phenomenon highlights the significant impact of visual information on auditory perception. Inspired by this, we assume that incorporating more modalities in singing will enhance the performance of both ALT and AMT systems, particularly concerning noise robustness. In our previous work [
27], we developed the first multimodal ALT system, MM-ALT, capable of processing audio, video, and IMU inputs. Comparative analyses between MM-ALT and its single-modal counterparts revealed that supplementary modalities, especially videos of lip movements, significantly contribute to noise robustness. However, the realm of AMT from multimodal singing has not been explored yet. In a position paper [
76], the potential of multimedia fusion approaches in improving AMT from music or singing was mentioned. To address this research gap and validate our assumption, we extend our previous work [
27] to accommodate for both multimodal ALT and AMT. In developing our multimodal system, we propose adapting
self-supervised learning (SSL) models, e.g., wav2vec 2.0 [
3] and AV-HuBERT [
67], from the speech domain to the singing domain. This approach addresses the issue of limited data availability for tasks of audio-only ALT and AMT. In this manner, we harness the abundance of speech data. Furthermore, to enhance the integration of representations from various modalities, we introduce a
residual cross-attention (RCA) mechanism, which combines self-attention and cross-attention to effectively utilize the strengths of each modality and exploit the complementary relationships among different modalities. To summarize, our contributions are four-fold:
—
We present a general framework for ALT and AMT from multimodal singing. Our framework incorporates both audio and video modalities. To support the development of these systems, we curate the first multimodal singing dataset, consisting of N20EMv1 for ALT and N20EMv2 for AMT. By introducing the video modality, our systems demonstrate increased noise robustness. With severe perturbations of musical accompaniment (-10 dB SNR), our systems outperform their audio-only counterparts by large margins.=-1
—
We adapt SSL models from the speech domain to the singing domain, employing our proposed adaptation method. Consequently, our audio-only systems achieve
state-of-the-art (SOTA) performance for both ALT and AMT tasks on widely used benchmark singing datasets, including DSing [
8,
13], DALI [
53,
54], Jamendo [
69], Hansen [
30], Mauch [
48], MIR-ST500 [
75], TONAS [
25], and ISMIR2014 [
56].=-1
—
We initialize new tasks of lyric lipreading and note lipreading utilizing only video information. Our systems are capable of extracting language-related information (lyrics) and music-related information (note events) from the only video modality.
—
We introduce RCA, a new feature fusion method to better fuse the multimodal singing features, leveraging both self-attention and cross-attention mechanisms.
Our previous work [
27] focused on the construction and evaluation of the multimodal ALT system. This article extends it in the following aspects: (1) We propose a generalized problem setting for both ALT and AMT from multimodal singing voice and we focus on the audio and video modalities. (2) Based on the data collected in [
27], we curate a new dataset named N20EMv2 with the annotations tailored for AMT. (3) We propose a novel adaptation strategy for AMT. (4) We conduct extensive experiments for single-modal and multimodal AMT systems. (5) We incorporate more comparison experiments and ablation studies to demonstrate the effectiveness of our methods.
4 Methodology
4.1 Problem Formulation
We consider a general setting for both ALT and AMT from singing. Specifically, given the synchronized singing recordings from multiple modalities (in this work, we consider audio and video modalities, \(\mathbf {x}^{A}\) and \(\mathbf {x}^{V}\); our framework can be seamlessly adopted to scenarios with more modalities), the ALT target is a sequence of tokens \(\mathbf {y}^{L}=\lbrace y_1^{L}, y_2^{L}, \ldots , y_{N_1}^{L}\rbrace , y_n^{L}\in \mathbb {V}\), where \(N_1\) is the length of output sequence and \(\mathbb {V}\) represents the vocabulary comprising all possible tokens. Since lyrics belong to the textual modality, various tokenizers, such as characters, words, subwords, or phonemes, can be used to represent tokens. In this work, we use a character tokenizer. Then the vocabulary has 26 English letters and four special characters (beginning of sentence \(\lt \text{bos}\gt\), end of sentence \(\lt \text{eos}\gt\), quotation \(\lt ^{\prime }\gt\), and word boundary \(\lt \quad \gt\)). AMT aims to produce a sequence of note events \(\mathbf {y}^{M}=[(o_1, f_1, p_1),(o_2, f_2, p_2), \ldots ,(o_{N_2}, f_{N_2}, p_{N_2})]\), where \(o_n\) and \(f_n\) are the onset/offset time of the \(n\)th note, \(0\le o_1\lt f_1\le o_2\lt f_2\le \ldots \le o_{N_2}\lt f_{N_2}\), \(p_n\) is the note pitch value, and \(N_2\) represents the number of note events. Consequently, the multimodal ALT system is a function that maps \(\mathbf {x}^{A}\) and \(\mathbf {x}^{V}\) into \(\mathbf {y}^{L}\), while the multimodal AMT system is a function that maps into \(\mathbf {y}^{M}\).
As present in Figure
2, each system consists of a feature representation learning frontend and a task-specific backend. Initially, modality-specific encoders
\(\phi ^{A}\) and
\(\phi ^{V}\) are employed to extract the feature representations for each modality input. The modality feature fusion module
\(\psi\) first aligns the features from different modalities to ensure the features have the same number of frames and dimensions. Afterward,
\(\psi\) projects the features from different modalities into a shared latent space and integrates them to obtain more informative representations. Finally,
\(\theta ^{L}\) and
\(\theta ^{M}\) transform the fused representations into lyrics and note events, respectively.
Considering that the lengths of input modalities and output modalities do not possess fixed relationships, we formulate multimodal ALT and multimodal AMT as two
sequence-to-sequence (S2S) problems. While these two systems share the same architectures (not their parameter weights) for encoders, they are trained separately. It is worth noting that (1) our systems can accommodate a single input modality or multiple input modalities and (2) our systems can be extended to output both lyrics and note events simultaneously. We direct readers to Section
6 for further discussion.=-1
4.2 Modality-specific Encoders
The audio encoder
\(\phi ^{A}\) is designed to learn acoustic representations for audio modality. We propose the adaptation of SSL models, especially wav2vec 2.0 LARGE [
3], from the speech domain to the singing domain. The rationale behind this choice is that SSL models, pretrained on abundant speech data, exhibit strong generalization capabilities even provided with low-resource labeled data in new domains. wav2vec 2.0 consists of a CNN-based feature encoder and a Transformer-based context network. The feature encoder has seven temporal 1D convolutional blocks. It takes the raw waveform of the singing audio and produces latent singing representations. The latent singing representations are then fed into the context network. By capturing global temporal information, the context network transforms the latent singing representations into contextual singing representations. The resulting output
\({\bf z}^{A}\) has a frame rate of approximately 49.8 Hz (equivalent to a frame length of about 20 ms), with each frame having 1,024 dimensions.=-1
The video encoder
\(\phi ^{V}\) is designed to learn visual representations of singing from videos of lip movements. We propose the adoption of AV-HuBERT LARGE [
67] in our system, which is one of the SOTA approaches for lipreading. Similar to wav2vec 2.0, AV-HuBERT consists of a CNN-based image encoder and a Transformer-based transformer encoder. The image encoder is constructed using a 3D convolutional frontend followed by a modified ResNet-18 block [
31]. This component is responsible for extracting latent visual representations, which can be regarded as embeddings of the video frames. Then the transformer encoder operates on the video embeddings and captures contextual visual representations by considering the relationships among video frames in a large context. The frame rate of the final output
\(\mathbf {z}^{V}\) remains consistent with that of the input video clips, with each frame having 1,024 dimensions. In the original AV-HuBERT structure, the input video frame rate is set as 25 Hz. Hence, for ALT, we retain the same frame rate considering task similarity with ASR. However, the transcription of note events has higher resolution requirements, so we select an input frame rate of 50 Hz for our AMT systems.=-1
4.3 Modality Feature Fusion
The modality feature fusion module \(\psi\) aims to exploit the complementary relationship and redundancy that are presented in the different modalities. Before fusing the acoustic representations \(\mathbf {z}^{A}\)
and the visual representations
\(\mathbf {z}^{V}\), we unify the frame rates to about 50 Hz and the frame dimensions to 1,024 if necessary. Specifically, we up-sample
\(\mathbf {z}^V\) using nearest interpolation with a scale factor of 2. Afterward, we introduce a new attention module called RCA for fusing the unified features, as illustrated in Figure
3. RCA is built upon Transformer block architecture, and its illustration can be found in Appendix
D. There are
\(M\) RCA blocks when considering
\(M\) input modalities. Every RCA block takes input representations from all modalities. Within each block, one modality is designated as the source, providing keys and values, while the remaining modalities serve as references, providing queries. In addition to the
multi-head self-attention (MHSA) [
74] operation applied to the source modality, each RCA block adds extra shortcuts by performing the
multi-head cross-attention (MHCA) operation between the source and each reference. The outputs of all RCA blocks are then aggregated to yield the final fused features
\(\mathbf {z}\). RCA can be mathematically represented as follows:=-1
where “
\(\text{LN}\)” denotes layer normalization, and “
\(\text{FFN}\)” refers to a positional-wise feed forward network.=-1
4.4 Automatic Lyric Transcription Backend
For ALT systems, we design a hybrid CTC-Attention backend to address the S2S problem inspired by [
77], as present in Figure
4(a). Initially, the ground-truth lyrics are converted into a sequence of tokens
\(\mathbf {y}^{L}=\lbrace y_1^{L}, y_2^{L}, \ldots , y_{N_1}^{L}\rbrace , y_n^{L}\in \mathbb {V}\) and
\(\mathbb {V}\) represents the character vocabulary comprising 30 tokens. The ALT backend
\(\theta ^{L}\) aims to predict
\(p(\mathbf {y}^{L}|\mathbf {z})\) and consists of a two-layer MLP, a CTC linear layer, and an S2S decoder. First, the MLP with 1,024 hidden neurons further encodes the fused features
\(\mathbf {z}\) into
\(\mathbf {e}\in \mathbb {R}^{T\times 1024}\), where
\(T\) denotes the number of frames. Subsequently, there are two network branches to compute
\(p(\mathbf {y}^{L}|\mathbf {z})\), equivalently
\(p(\mathbf {y}^{L}|\mathbf {e})\).=-1
The first branch is a CTC linear layer, which maps
\(\mathbf {e}\) to output probabilities for each frame
\(p_{\text{CTC}}(\pi _t|e_t), \pi _t\in \mathbb {V}\cup \lbrace \lt \text{blank}\gt \rbrace , t=1,2, \ldots ,T\), where <blank> is the blank token. In CTC, each frame’s prediction is considered independent, leading to the probability of a sequence
\(\pi _{1:T}\) being
\(p(\pi _{1:T}|\mathbf {e})=\prod _{t=1}^Tp(\pi _t|e_t)\). The final predictions for output sequence
\(\mathbf {y}^{L}\) are derived from the alignment
\(\pi _{1:T}\) by eliminating repeated tokens and <blank> tokens. The operation is represented as
\(\mathcal {B}\). To supervise the CTC predictions, it is required to convert the ground-truth labels into all possible CTC alignments. We use
\(\mathcal {B}^{-1}(\mathbf {y}^{L})\) to represent all CTC paths mapped from
\(\mathbf {y}^{L}\), and then
\(p(\mathbf {y}^{L}|\mathbf {e})=\sum _{\pi _{1:T}\in \mathcal {B}^{-1}(\mathbf {y}^{L})}p(\pi _{1:T}|\mathbf {e})\). Therefore, the CTC loss is written as=-1
The second branch is parameterized by a location-aware attention-based GRU decoder [
6]. In contrast to the CTC formulation, the S2S formulation does not assume independence among predictions. Instead, it directly computes
\(p(\mathbf {y}^{L}|\mathbf {e})=\prod _{n=1}^{N_1}p(y_n^{L}|y_{1:n-1}^{L},\mathbf {e})\) following the chain rule. To predict each target token
\(y_n^{L}\), the S2S decoder takes previously predicted tokens
\(y_{1:n-1}^{L}\) as input and utilizes a location-aware attention mechanism to derive a contextually weighted
\(\mathbf {e}\). This attention mechanism enables the model to focus on specific parts of
\(\mathbf {e}\) that are relevant for predicting the current token
\(y_n^{L}\). Then the S2S loss is written as=-1
As we employ a hybrid system, the overall loss function is a weighted sum of the two aforementioned loss terms: \(\mathcal {L}^{L}=(1-\lambda)\mathcal {L}_{\text{S2S}} + \lambda \mathcal {L}_{\text{CTC}}\). To balance the losses, we set \(\lambda =0.2\) in this work.=-1
During inference, in addition to the hybrid CTC-Attention structure mentioned above, we leverage a character-level LSTM LM. This allows us to predict the most likely lyrics by considering the output of three components:
where
\(\alpha\) and
\(\beta\) are hyper-parameters used to balance three log-probability terms during the beam search. We set the beam size as 512. To evaluate the performance of our ALT systems, we report the
word error rate (WER), which is a widely used metric for this task.
4.5 Automatic Music Transcription Backend
For AMT systems, we reformulate the S2S problem as a frame-level classification problem, inspired by [
75]. The ground-truth note events
\(\mathbf {y}^{M}=[(o_1, f_1, p_1),(o_2, f_2, p_2),\) \(\ldots ,(o_{N_2}, f_{N_2}, p_{N_2})]\) are transformed into onset/silence/pitch name/octave frame-level targets, represented as
\(\mathbf {w}^{1}, \mathbf {w}^{2}, \mathbf {w}^{3}, \mathbf {w}^{4}\). This transformation enables us to classify each frame of the fused features
\(\mathbf {z}\in \mathbb {R}^{T\times 1024}\) into corresponding labels, as visualized in Figure
4(b). Since directly predicting offsets is challenging, our AMT backend predicts silence instead, and the offsets
\(f_1, f_2, \ldots , f_{N_2}\) are determined as the beginnings of silence frames. We employ a pitch name and an octave to denote each note pitch.=-1
To construct \(\mathbf {w}^1\), frames covering the onsets \(o_1, o_2,\ldots ,o_{N_2}\) are labeled as 1, while other frames are labeled as 0. Similarly, silence frames are assigned a label of 1 in \(\mathbf {w}^{2}\), while other frames are assigned a label of 0. As a result, we can use binary values to indicate the state of each frame in \(\mathbf {w}^{1}, \mathbf {w}^{2}\). In conventional practice, pitch values \(p_1, p_2,\ldots ,p_{N_2}\) are represented as MIDI note numbers ranging from C2 (MIDI number 36, 65.41 Hz) to B5 (MIDI number 83, 987.77 Hz). Here “B” and “C” are the pitch names, while “2” and “5” are the octaves. According to music theory, there are 12 notes (\(C, D\flat , D, E\flat , E, F, G\flat , G, A\flat , A, B\flat , B\)) in each octave. We consider a pitch range from C2 to B5, resulting in a total of four octaves. Additionally, we introduce an octave class and a pitch name class to represent silence. Consequently, each frame of \(\mathbf {w}^{3}\) has 13 possible values, and each frame of \(\mathbf {w}^{4}\) has 5 possible values. During inference, the frame-level predictions are transformed back into the note events. It is noted that the transformation between note events and frame-level targets introduces temporal quantization errors. Therefore, the frame resolution significantly impacts the AMT accuracy.=-1
The AMT backend
\(\theta ^{M}\) consists of a linear layer with 20 output neurons, allocating 1, 1, 13, 5 neurons for
\(\mathbf {w}^{1}, \mathbf {w}^{2}, \mathbf {w}^{3}, \mathbf {w}^{4}\), separately. The output probabilities can be expressed as
\(p(\mathbf {w}^{i}|\mathbf {z})=\prod _{i=1}^T p(w_t^{i}|z_t), i=1,2,3,4\). To train the AMT system, we combine the loss terms for the four targets:
where we employ
binary cross-entropy (BCE) loss for targets
\(\mathbf {w}^{1}, \mathbf {w}^{2}\) and cross-entropy loss for targets
\(\mathbf {w}^{3}, \mathbf {w}^{4}\). Notably, we set a positive weight of 15.0 in the BCE loss for onset prediction to amortize the effects of imbalanced distribution in
\(\mathbf {w}^{1}\).=-1
In Figure
4(b), we provide a visualization of the post-processing step to convert the predictions for
\(\mathbf {w}^1, \mathbf {w}^2, \mathbf {w}^3, \mathbf {w}^4\) into note events. We postpone the details to Appendix
B. At a high level, we first identify pairs of onset and offset and then identify the pitch between the time. Unless otherwise stated, we maintain a fixed onset threshold of 0.4 and an offset threshold of 0.5. AMT systems are typically evaluated using F1-scores of COnPOff (Correct onset, pitch, and offset), COnP (Correct onset, pitch), and COn (Correct onset). Their definitions and implementations can be found in [
56,
63]. To ensure fair comparisons with previous approaches, such as [
19,
33,
41,
46,
75], we set the pitch tolerance to 50 cents, the onset tolerance to 50 ms, and the offset tolerance to the maximum of 50 ms and
\(0.2\times\) note duration. Additionally, we use the F1-score of the COff (Correct offset) metric to evaluate the performance of offset detection.=-1
4.6 Training Strategy
We developed several training strategies for our multimodal ALT system and multimodal AMT system to address the following challenges. One key challenge is
adapting SSL models from the speech domain to the singing domain. In our approach, we utilize SSL models, namely wav2vec 2.0 [
3] as audio encoder and AV-HuBERT [
67] as video encoder. Originally, these models are pretrained on unlabeled speech data using SSL objectives. They are then finetuned on labeled speech data with ASR objectives. As we mentioned before, these SSL models have demonstrated the ability to generalize well to new domains, even in low-resource labeled scenarios, which can be attributed to their unsupervised learning on rich speech data. Given the similarities between speech and singing data, we hypothesize that these SSL models can also effectively generalize to our setting. For the ALT task, we initialize our audio encoder and video encoder with the SSL models pretrained and finetuned on speech data. This choice is motivated by the fact that ALT and ASR are analogous tasks with similar input–output pairs. We expect that both the pretraining and finetuning on speech data will yield benefits for the ALT task. However, the targets of the AMT task are the note events, rather than text in ALT and ASR. Hence, a question arises regarding the adaptation of the SSL models: will finetuning on speech data be advantageous for the AMT task?=-1
Inspired by [
42], we speculate that finetuning on speech data may distort the pretrained features of SSL models and bias them toward ASR, thus hindering their generalization to AMT. To address this concern, we propose a new adaptation strategy specifically tailored to the AMT task. We skip the finetuning step on speech data with ASR objectives. Instead, we conduct linear probing on the AMT backend
\(\theta ^{M}\), followed by full finetuning of the entire system. To further compare the above two adaptation strategies, we outline the training pipeline for the single-modal singing ALT system and single-modal singing AMT system in Algorithm
1 and Algorithm
2, respectively (for single-modal system, the feature fusion module
\(\psi\) can be omitted). Typically, we use a relatively smaller learning rate
\(\gamma _2\) than
\(\gamma _1\) to preserve pretrained features of modality-specific encoders.=-1
Both wav2vec 2.0 and AV-HuBERT in our multimodal systems are large scale. Consequently, to
mitigate high GPU memory demands, we propose a two-stage training approach similar to [
61]. In the first stage, we train single-modal systems independently, each of which consists of a modality-specific encoder and a task-specific backend. Then in the second stage, we freeze the modality-specific encoders and only train the feature fusion module and the task-specific backend. In this way, we eliminate the requirements to load and update all model weights simultaneously and take advantage of powerful singing representations learned by single-modal systems. For more details, we refer readers to Appendix
B.=-1
7 Conclusion
In this work, we proposed a unified multimodal framework for transcribing lyrics and note events from singing voices. To develop our systems, we carefully curated the multimodal singing ALT dataset N20EMv1 and the multimodal singing AMT dataset N20EMv2. Then, we adapted SSL models from the speech domain into the singing domain as acoustic encoders, yielding SOTA performance. Additionally, we adapted SSL models initially used for lipreading tasks to serve as visual encoders, allowing us to initialize two novel tasks: lyric lipreading and note lipreading. Our results demonstrated that video modality can significantly contribute to both ALT and AMT tasks, despite the inherent challenges posed by ambiguity. Finally, we introduced RCA, a new feature fusion method, to fuse features from different modalities to obtain the ultimate transcription. Through our comprehensive experiments, we unveiled the advantages of incorporating additional modalities, which led to improved transcription performance and enhanced robustness against sound contamination and perturbations.