research-article

Open access

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing

Authors:

Ye WangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 7

Article No.: 209, Pages 1 - 29

https://doi.org/10.1145/3651310

Published: 16 May 2024 Publication History

PDF eReader

Abstract

Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics, while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.

1 Introduction

Singing contains both textual and musical information. As an important component of singing voice analysis, automatic transcription of singing voice includes automatic lyric transcription (ALT) and automatic music transcription (AMT). The former is the task of recognizing textual information, while the latter is the task of identifying musical information, including onsets/offsets/pitch of note events. The above two tasks facilitate solving many downstream music information retrieval problems. For instance, ALT can be applied to lyric alignment [29], query by singing [32], audio indexing [20], music subtitling [18], and singing pronunciation evaluation [58]. AMT can be applied to sight-singing evaluation [79], music therapy [71], and human–computer interaction [57, 76]. Furthermore, they can also be employed in singing voice synthesis [36, 44], which is a topic that has recently been actively studied in the singing field.

Traditionally, ALT and AMT systems are built only on audio modality and treated as separate tasks with distinct objectives. However, they encounter certain common challenges, which motivate developing a generalized solution.

Insufficient robustness for noise. Audio recordings of singing may be accompanied with noise, e.g., background music. In challenging signal-to-noise ratio (SNR) environments, the intelligibility of singing in the audio modality will be drastically reduced, thus affecting the information retrieval of lyrics and musical note events. In our previous work [27], we showed that attempting the ALT task solely on audio recordings in noisy environments yields unsatisfactory performance. Additionally, [26, 35] showed that low-SNR environments greatly harm the performance of pitch estimation from speech. Considering that singing and speech share similarities in terms of the sound production mechanism, it is reasonable to surmise that attempting audio-only AMT from singing voices would probably meet the similar challenge of noise robustness.

Limited data for complex tasks. Singing transcription is notably much more difficult compared to speech-related recognition tasks due to the scarcity of labeled data and the intricate intertwining of textual and musical information within singing. Speech recognition benefits from large-scale annotated datasets such as LibriSpeech [62], which comprises 960 hours of annotated speech recordings. In contrast, DSing [8, 13], a widely used ALT dataset, has about 150 hours of data, and the largest AMT dataset, MIR-ST500 [75], only contains around 30 hours. The scarcity of labeled data arises from the time-consuming process of manual annotation, where extensive musical knowledge is necessary. Additionally, singers inevitably have to adjust or compromise certain linguistic features, such as word stress and articulation, to accommodate properties or constraints of singing that are not present in regular speech, such as melody, tempo, or deliberate timbre adjustments. As a result, singing tends to be less intelligible compared to speech [66], thereby further complicating the transcription process.

The perception of both speech and singing extends beyond the auditory realm, as exemplified by the McGurk effect [49]. This phenomenon highlights the significant impact of visual information on auditory perception. Inspired by this, we assume that incorporating more modalities in singing will enhance the performance of both ALT and AMT systems, particularly concerning noise robustness. In our previous work [27], we developed the first multimodal ALT system, MM-ALT, capable of processing audio, video, and IMU inputs. Comparative analyses between MM-ALT and its single-modal counterparts revealed that supplementary modalities, especially videos of lip movements, significantly contribute to noise robustness. However, the realm of AMT from multimodal singing has not been explored yet. In a position paper [76], the potential of multimedia fusion approaches in improving AMT from music or singing was mentioned. To address this research gap and validate our assumption, we extend our previous work [27] to accommodate for both multimodal ALT and AMT. In developing our multimodal system, we propose adapting self-supervised learning (SSL) models, e.g., wav2vec 2.0 [3] and AV-HuBERT [67], from the speech domain to the singing domain. This approach addresses the issue of limited data availability for tasks of audio-only ALT and AMT. In this manner, we harness the abundance of speech data. Furthermore, to enhance the integration of representations from various modalities, we introduce a residual cross-attention (RCA) mechanism, which combines self-attention and cross-attention to effectively utilize the strengths of each modality and exploit the complementary relationships among different modalities. To summarize, our contributions are four-fold:

—

We present a general framework for ALT and AMT from multimodal singing. Our framework incorporates both audio and video modalities. To support the development of these systems, we curate the first multimodal singing dataset, consisting of N20EMv1 for ALT and N20EMv2 for AMT. By introducing the video modality, our systems demonstrate increased noise robustness. With severe perturbations of musical accompaniment (-10 dB SNR), our systems outperform their audio-only counterparts by large margins.=-1

—

We adapt SSL models from the speech domain to the singing domain, employing our proposed adaptation method. Consequently, our audio-only systems achieve state-of-the-art (SOTA) performance for both ALT and AMT tasks on widely used benchmark singing datasets, including DSing [8, 13], DALI [53, 54], Jamendo [69], Hansen [30], Mauch [48], MIR-ST500 [75], TONAS [25], and ISMIR2014 [56].=-1

—

We initialize new tasks of lyric lipreading and note lipreading utilizing only video information. Our systems are capable of extracting language-related information (lyrics) and music-related information (note events) from the only video modality.

—

We introduce RCA, a new feature fusion method to better fuse the multimodal singing features, leveraging both self-attention and cross-attention mechanisms.

Our previous work [27] focused on the construction and evaluation of the multimodal ALT system. This article extends it in the following aspects: (1) We propose a generalized problem setting for both ALT and AMT from multimodal singing voice and we focus on the audio and video modalities. (2) Based on the data collected in [27], we curate a new dataset named N20EMv2 with the annotations tailored for AMT. (3) We propose a novel adaptation strategy for AMT. (4) We conduct extensive experiments for single-modal and multimodal AMT systems. (5) We incorporate more comparison experiments and ablation studies to demonstrate the effectiveness of our methods.

2 Related Work

2.1 Automatic Lyric Transcription

ALT, the counterpart task of ASR in the field of music information retrieval, has evolved with various approaches. The initial work [32] developed a Japanese ALT system by adapting a Hidden-Markov-Model (HMM). [52] investigated the impact of in-domain lyric language models (LMs) on transcription performance. Additionally, [50] leveraged the repetitive patterns of songs to enhance the consistency and accuracy of their transcription system. The advent of deep learning and benchmark datasets DSing [8, 13] and DALI [53, 54] enabled data-driven deep learning approaches for ALT. Notably, [8, 13, 15] proposed employing a DNN-HMM framework with factorized time-delay neural network (TDNN-F) or its variations as the feature encoder. Additionally, [16] adopted a connectionist temporal classification (CTC) architecture, utilizing a CRNN as the encoder, while [4, 29] implemented the hybrid CTC-Attention framework [77] in this task.

Despite the efforts, the progress in developing ALT systems is hindered by the limited availability of large-scale singing datasets [80]. Although DSing and DALI provide some support for ALT, they still fall short in scale. Moreover, the issue of copyright protection surrounding singing recordings restricts the sharing and accessibility of such data. Data augmentation emerges as a viable solution to alleviate the data scarcity problem. Previous research explored techniques like random time stretching, pitch adjustment [40], and vocoder-based synthesis [4] to transform speech data into a more “song-like” form. Moreover, [80] proposed a method that aligns lyrics with melodies before adjusting duration and pitch during data augmentation. However, these methods complicate the workflow of building transcription systems, making the training more computationally intensive. To resolve the problem of limited data more efficiently, we leverage the similarities between speech and singing. In our previous work [27], we proposed adapting SSL models, e.g., wav2vec 2.0 [3], from the speech domain as acoustic models for ALT. Building upon this, the subsequent wav2vec 2.0-based ALT system achieved SOTA performance on all benchmark singing datasets and exhibited few-shot capabilities [60]. Subsequently, [23] proposed a semi-supervised learning method to further improve the few-shot performance using the same ALT system.=-1

Although ALT is the counterpart task of ASR, it still presents unique research problems that must be overcome to successfully adapt ASR systems for ALT. One major challenge is that singing is typically accompanied by musical instruments, resulting in polyphonic inputs. [29] introduced a genre-informed acoustic model for ALT systems under polyphonic scenarios. Follow-up research efforts enhanced this framework with genre adapters [22], multi-task settings [21], and so forth. However, all these methods only consider the audio modality and do not incorporate additional information. When faced with challenging SNR environments or other types of sound contamination, these approaches may struggle to accurately transcribe lyrics. This, therefore, motivates the use of multimodal approaches for ALT.=-1

2.2 Automatic Music Transcription

AMT involves three subtasks: musical notes’ onset detection, offset detection, and pitch estimation. Initial research focused on fundamental frequency (F0) estimation. One of the representative works is YIN [10], which utilized auto-correlation to estimate F0 from speech or music signals. pYIN, an extension of YIN, improved pitch estimation with multiple candidates and HMM-based refinement [47]. With the emergence of data-driven deep learning techniques, CREPE introduced a CNN architecture for frame-level pitch estimation and achieved SOTA performance [38]. Simultaneously, PatchCNN used a patch-based CNN for pitch contour extraction [70]. Afterward, SPICE introduced a self-supervised task for pitch estimation without relying on large labeled datasets [24]. Previous approaches primarily focused on predicting pitch values in frequency, while TONet considered tone (pitch name) and octave as the pitch targets [5]. These aforementioned works concentrate on pitch estimation, thus neglecting the other aspects of note events, i.e., onsets and offsets.=-1

We narrow our focus on AMT from singing. Notably, AMT shares similarities with audio-to-score conversion [2, 59, 64], whose targets are symbolic representations that reflect what musicians read. However, our current work focuses on transcribing note events, rather than musical scores. In the pre–deep learning era, Tony, a software tool based on HMM, was developed to transcribe note onsets, offsets, and MIDI pitch values from singing recordings [46]. Afterward, HCN [19] and VOCANO [33] adopted PatchCNN [70] for pitch estimation and integrated a note segmentation network for onset and offset detection. VOCANO also utilized virtual adversarial training [55] to leverage unlabeled singing data to improve performance. Recently, there has been a growing interest in end-to-end frameworks. For instance, [75] used an Efficient-Net [72] architecture to transcribe singing notes in an end-to-end manner and introduced the MIR-ST500 dataset, which is the largest singing dataset with human annotations for AMT. Besides, [41] employed a pretrained pitch estimation network and a quantization algorithm to generate frame-level pseudo labels, training an end-to-end AMT system using the noisy student framework [78]. It is noted that [41] directly transcribed singing notes from polyphonic singing, while other approaches relied on source separation, like Demucs [12], as a preprocessing step. Regardless, both approaches still struggle to transcribe/separate singing audio when the instrumental musical accompaniment is much louder, i.e., a challenging SNR environment. Additionally, [28] found that AMT systems tend to perform better on females and proposed an approach to alleviate this fairness issue.=-1

2.3 Multimodal Learning

Humans rely on multiple modalities, e.g., sight, hearing, taste, touch, and smell, to perceive and understand their surroundings. Each modality provides distinct and complementary information, enhancing the overall understanding. For instance, previous research showed that visual cues in speech provide valuable assistance in language learning [9, 51]. Inspired by this, many deep learning models are designed to enable multimodal input. Even though the original task may be designed for a single modality, the introduction of extra modalities brings empirical performance gains. In the speech domain, [45, 67, 68, 73] fused audio and video modalities to enhance the performance of speech recognition and speaker detection. In the vision domain, [34, 65] combined both RGB images and depth images to improve tasks like semantic segmentation. Besides the empirical findings of the general superiority of multimodal approaches over their single-modal counterparts, [37] derived the framework of the multimodal learning problem from the theoretical perspective. Then they proved that learning with multiple modalities tends to have a better latent representation quality than that with a subset of modalities, thus providing a theoretical guarantee for better performance of multimodal systems.=-1

3 Multimodal Singing Dataset: N20EMv1 and N20EMv2

3.1 Singer Profile and Song Selection

First, we recruited 30 participants from a local university to collect data, comprising 17 males and 13 females. To promote singer diversity, the participants were selected with varying accents, including a range of European, Indian, North American, East Asian, and Southeast Asian accents. Moreover, their abilities varied widely, from individuals with no formal vocal training to those who are amateur-level singers. To ensure song diversity, we chose 20 songs from [17] based on their rich phonemic coverage and variation in musical features, e.g., genre and tempo. Furthermore, these songs are easy for singers to learn. Although each participant was given the freedom to choose the songs according to their preferences, we made adjustments to ensure a limit of 10 singers for each song at maximum to balance the dataset.

3.2 Multimodal Singing Data Recording

To ensure a controlled and undisturbed environment, we conducted the singing data recording in a soundproof studio. The recording setup included specific equipment for each modality: an Audio-Technica 4050 condenser microphone with a pop filter for audio and a Sony AX4 video camera for video. Before the recording session, the singers were instructed to wear a monaural headset for playback of musical accompaniment. The video camera, accompanied by a ring light, was positioned in front of the singers, prioritizing on the movements of the lower half of each singer’s face, especially the jaw, lips, and tongue. The camera can also collect audio signals, which were only used for modality synchronization. Lyric sheets for the selected songs were provided on a music stand for reference.=-1

During the recording, singers were advised to minimize bodily movements to reduce noise interference in the data. While the tempo and key of each song were predetermined, the singers had the flexibility to choose between male-vocal or female-vocal arrangements for songs that better suit their own vocals in pitch range and timbre. They also had some freedom in their rendition of pitch and rhythm. Minor pronunciation errors were allowed as long as the clarity of the vocals remained intact. The recording process adhered to standard practices where singers were instructed to monitor the musical accompaniment through their monaural headsets, ensuring only vocal voices were captured. After each complete song recording, the singer moved on to the next song. Table 1 presents the resolutions and frequencies of the raw data for audio and video modalities. Following the recording, the raw data from the two modalities were synchronized using audio recorded from each device.=-1

Table 1.

Modality	Device	Resolution			Frequency (Hz)
Modality	Device	Raw	v1	v2	Raw	v1	v2
Audio	Audio-Technica 4050	32-bit depth			44.1k	16k
Video	Sony AX4	1920x1080 pixels	96x96 pixels		50	25	50

Table 1. Data Collection and Processing for Audio and Video Modalities

v1, v2 denote N20EMv1, N20EMv2.

3.3 Multimodal Singing Data Processing

As presented in Table 1, we modified the resolutions and frequencies when curating N20EMv1 and N20EMv2 datasets. Specifically, the audio data was down-sampled to 16 kHz and transformed into a single channel for both datasets to meet the input requirements of SSL models from the speech domain, e.g., wav2vec 2.0 [3]. For raw video data, we followed the approach in [67] to crop regions-of-interest centered around the mouth region, resulting in a resolution of 96x96 pixels. This cropping technique not only reduces unnecessary information but also helps protect the privacy of the singers. The video data was down-sampled to 25 Hz for N20EMv1, adhering to the input specifications of AV-HuBERT [67]. However, for N20EMv2, we retained a frame rate of 50 Hz as a higher temporal resolution is crucial for accurate AMT.=-1

Following the practices in benchmark ALT datasets, e.g., DSing [8, 13], and benchmark AMT datasets, e.g., MIR-ST500 [75], we curated N20EMv1 at the utterance level and N20EMv2 at the song level. As the raw data was already in song-level form, it was directly used in N20EMv2 after the aforementioned pre-processing. For N20EMv1, we divided whole songs into utterances, and further details about this process are presented in the next section. Subsequently, the data was partitioned into train/valid/test splits, following the same division scheme used in our previous work [27] for N20EMv1 (different splits have no overlapping songs). The statistics of N20EMv1 and N20EMv2 can be found in Table 2 and Table 3. Notably, the total duration of N20EMv2 is longer than that of N20EMv1 due to the exclusion of silent utterances in N20EMv1.

Table 2.

Split	Duration (min)	Num. of Utterances
Total	323	5116
Train	241	3803
Valid	35	616
Test	47	697

Table 2. Statistics of Our N20EMv1 Dataset

Table 3.

Split	Duration (min)	Num. of Songs
Total	502	157
Train	386	123
Valid	47	16
Test	69	18

Table 3. Statistics of Our N20EMv2 Dataset

3.4 Lyric Annotation for N20EMv1

The lyric annotation primarily focuses on the audio modality, as audio and video were already synchronized. To segment the whole song into utterances, an expert uses spectrogram information and the marker function in Adobe Audition software to annotate each utterance’s starting and ending timestamps. The standards are established based on natural factors such as musical cadence, as well as practical considerations, including a preference for consonant boundaries over vowel boundaries between utterances. For each utterance, actually sung words between the starting and ending timestamps are served as the lyric annotation. In some cases where the sung words by the singers are different from the correct lyrics that should have been sung, the actual sung words are used in the annotations. We also provide the annotations for different types of errors, detailed in Appendix A. Following the completion of the annotation process, the recordings of the two modalities are segmented at the utterance level based on the provided annotations. Additionally, any instances of silence, breaths, or non-phonemic noise occurring between utterances are removed from the data. Similarly, the musical accompaniment is segmented accordingly.

3.5 Note Annotation for N20EMv2

The note annotation is also conducted on the audio modality. A coarse-to-fine method is used to enhance the annotation precision, as depicted in Figure 1. In the first stage, we use Melodyne,¹ a professional digital signal processing software, to obtain coarse annotations. Then in the second stage, a manual refinement is performed, involving the adjustment of onset/offset/pitch. This is achieved by concurrently playing and comparing the annotations and audio tracks simultaneously from an interface comprising spectrogram, waveform, and MIDI notes. Given that note annotation demands extensive musical knowledge and is time-consuming, two experts are assigned to complete the task. To ensure inter-rater reliability, several rules (detailed in Appendix A) are established as guidelines.=-1

Fig. 1.

4 Methodology

4.1 Problem Formulation

We consider a general setting for both ALT and AMT from singing. Specifically, given the synchronized singing recordings from multiple modalities (in this work, we consider audio and video modalities, \(\mathbf {x}^{A}\) and \(\mathbf {x}^{V}\); our framework can be seamlessly adopted to scenarios with more modalities), the ALT target is a sequence of tokens \(\mathbf {y}^{L}=\lbrace y_1^{L}, y_2^{L}, \ldots , y_{N_1}^{L}\rbrace , y_n^{L}\in \mathbb {V}\), where \(N_1\) is the length of output sequence and \(\mathbb {V}\) represents the vocabulary comprising all possible tokens. Since lyrics belong to the textual modality, various tokenizers, such as characters, words, subwords, or phonemes, can be used to represent tokens. In this work, we use a character tokenizer. Then the vocabulary has 26 English letters and four special characters (beginning of sentence \(\lt \text{bos}\gt\), end of sentence \(\lt \text{eos}\gt\), quotation \(\lt ^{\prime }\gt\), and word boundary \(\lt \quad \gt\)). AMT aims to produce a sequence of note events \(\mathbf {y}^{M}=[(o_1, f_1, p_1),(o_2, f_2, p_2), \ldots ,(o_{N_2}, f_{N_2}, p_{N_2})]\), where \(o_n\) and \(f_n\) are the onset/offset time of the \(n\)th note, \(0\le o_1\lt f_1\le o_2\lt f_2\le \ldots \le o_{N_2}\lt f_{N_2}\), \(p_n\) is the note pitch value, and \(N_2\) represents the number of note events. Consequently, the multimodal ALT system is a function that maps \(\mathbf {x}^{A}\) and \(\mathbf {x}^{V}\) into \(\mathbf {y}^{L}\), while the multimodal AMT system is a function that maps into \(\mathbf {y}^{M}\).

As present in Figure 2, each system consists of a feature representation learning frontend and a task-specific backend. Initially, modality-specific encoders \(\phi ^{A}\) and \(\phi ^{V}\) are employed to extract the feature representations for each modality input. The modality feature fusion module \(\psi\) first aligns the features from different modalities to ensure the features have the same number of frames and dimensions. Afterward, \(\psi\) projects the features from different modalities into a shared latent space and integrates them to obtain more informative representations. Finally, \(\theta ^{L}\) and \(\theta ^{M}\) transform the fused representations into lyrics and note events, respectively.

Fig. 2.

Considering that the lengths of input modalities and output modalities do not possess fixed relationships, we formulate multimodal ALT and multimodal AMT as two sequence-to-sequence (S2S) problems. While these two systems share the same architectures (not their parameter weights) for encoders, they are trained separately. It is worth noting that (1) our systems can accommodate a single input modality or multiple input modalities and (2) our systems can be extended to output both lyrics and note events simultaneously. We direct readers to Section 6 for further discussion.=-1

4.2 Modality-specific Encoders

The audio encoder \(\phi ^{A}\) is designed to learn acoustic representations for audio modality. We propose the adaptation of SSL models, especially wav2vec 2.0 LARGE [3], from the speech domain to the singing domain. The rationale behind this choice is that SSL models, pretrained on abundant speech data, exhibit strong generalization capabilities even provided with low-resource labeled data in new domains. wav2vec 2.0 consists of a CNN-based feature encoder and a Transformer-based context network. The feature encoder has seven temporal 1D convolutional blocks. It takes the raw waveform of the singing audio and produces latent singing representations. The latent singing representations are then fed into the context network. By capturing global temporal information, the context network transforms the latent singing representations into contextual singing representations. The resulting output \({\bf z}^{A}\) has a frame rate of approximately 49.8 Hz (equivalent to a frame length of about 20 ms), with each frame having 1,024 dimensions.=-1

The video encoder \(\phi ^{V}\) is designed to learn visual representations of singing from videos of lip movements. We propose the adoption of AV-HuBERT LARGE [67] in our system, which is one of the SOTA approaches for lipreading. Similar to wav2vec 2.0, AV-HuBERT consists of a CNN-based image encoder and a Transformer-based transformer encoder. The image encoder is constructed using a 3D convolutional frontend followed by a modified ResNet-18 block [31]. This component is responsible for extracting latent visual representations, which can be regarded as embeddings of the video frames. Then the transformer encoder operates on the video embeddings and captures contextual visual representations by considering the relationships among video frames in a large context. The frame rate of the final output \(\mathbf {z}^{V}\) remains consistent with that of the input video clips, with each frame having 1,024 dimensions. In the original AV-HuBERT structure, the input video frame rate is set as 25 Hz. Hence, for ALT, we retain the same frame rate considering task similarity with ASR. However, the transcription of note events has higher resolution requirements, so we select an input frame rate of 50 Hz for our AMT systems.=-1

4.3 Modality Feature Fusion

The modality feature fusion module \(\psi\) aims to exploit the complementary relationship and redundancy that are presented in the different modalities. Before fusing the acoustic representations \(\mathbf {z}^{A}\)

and the visual representations \(\mathbf {z}^{V}\), we unify the frame rates to about 50 Hz and the frame dimensions to 1,024 if necessary. Specifically, we up-sample \(\mathbf {z}^V\) using nearest interpolation with a scale factor of 2. Afterward, we introduce a new attention module called RCA for fusing the unified features, as illustrated in Figure 3. RCA is built upon Transformer block architecture, and its illustration can be found in Appendix D. There are \(M\) RCA blocks when considering \(M\) input modalities. Every RCA block takes input representations from all modalities. Within each block, one modality is designated as the source, providing keys and values, while the remaining modalities serve as references, providing queries. In addition to the multi-head self-attention (MHSA) [74] operation applied to the source modality, each RCA block adds extra shortcuts by performing the multi-head cross-attention (MHCA) operation between the source and each reference. The outputs of all RCA blocks are then aggregated to yield the final fused features \(\mathbf {z}\). RCA can be mathematically represented as follows:=-1

\begin{align} \mathbf {z}^{I_i^{\prime }}=&\, \text{LN}(\mathbf {z}^{I_i}+\text{MHSA}(\mathbf {z}^{I_i})+\sum _{j\ne i}\text{MHCA}(\mathbf {z}^{I_i}, \mathbf {z}^{I_j})),\,\,\, I_i, I_j=A\,\, \text{or}\,\, V, \end{align}

(1)

\begin{align} \mathbf {z}^{I_i^{\prime \prime }}=&\, \text{LN}(\mathbf {z}^{I_i^{\prime }}+\text{FFN}(\mathbf {z}^{I_i^{\prime }})),\,\,\, \mathbf {z}= \mathbf {z}^{A^{\prime \prime }}+\mathbf {z}^{V^{\prime \prime }},\,\,\, I_i=A\,\, \text{or}\,\, V, \end{align}

(2)

where “\(\text{LN}\)” denotes layer normalization, and “\(\text{FFN}\)” refers to a positional-wise feed forward network.=-1

Fig. 3.

4.4 Automatic Lyric Transcription Backend

For ALT systems, we design a hybrid CTC-Attention backend to address the S2S problem inspired by [77], as present in Figure 4(a). Initially, the ground-truth lyrics are converted into a sequence of tokens \(\mathbf {y}^{L}=\lbrace y_1^{L}, y_2^{L}, \ldots , y_{N_1}^{L}\rbrace , y_n^{L}\in \mathbb {V}\) and \(\mathbb {V}\) represents the character vocabulary comprising 30 tokens. The ALT backend \(\theta ^{L}\) aims to predict \(p(\mathbf {y}^{L}|\mathbf {z})\) and consists of a two-layer MLP, a CTC linear layer, and an S2S decoder. First, the MLP with 1,024 hidden neurons further encodes the fused features \(\mathbf {z}\) into \(\mathbf {e}\in \mathbb {R}^{T\times 1024}\), where \(T\) denotes the number of frames. Subsequently, there are two network branches to compute \(p(\mathbf {y}^{L}|\mathbf {z})\), equivalently \(p(\mathbf {y}^{L}|\mathbf {e})\).=-1

Fig. 4.

The first branch is a CTC linear layer, which maps \(\mathbf {e}\) to output probabilities for each frame \(p_{\text{CTC}}(\pi _t|e_t), \pi _t\in \mathbb {V}\cup \lbrace \lt \text{blank}\gt \rbrace , t=1,2, \ldots ,T\), where <blank> is the blank token. In CTC, each frame’s prediction is considered independent, leading to the probability of a sequence \(\pi _{1:T}\) being \(p(\pi _{1:T}|\mathbf {e})=\prod _{t=1}^Tp(\pi _t|e_t)\). The final predictions for output sequence \(\mathbf {y}^{L}\) are derived from the alignment \(\pi _{1:T}\) by eliminating repeated tokens and <blank> tokens. The operation is represented as \(\mathcal {B}\). To supervise the CTC predictions, it is required to convert the ground-truth labels into all possible CTC alignments. We use \(\mathcal {B}^{-1}(\mathbf {y}^{L})\) to represent all CTC paths mapped from \(\mathbf {y}^{L}\), and then \(p(\mathbf {y}^{L}|\mathbf {e})=\sum _{\pi _{1:T}\in \mathcal {B}^{-1}(\mathbf {y}^{L})}p(\pi _{1:T}|\mathbf {e})\). Therefore, the CTC loss is written as=-1

\begin{equation} \mathcal {L}_{\text{CTC}} = -\log p_{\text{CTC}}(\mathbf {y}^{L}|\mathbf {e})=-\log \sum _{\pi _{1:T}\in \mathcal {B}^{-1}(\mathbf {y}^{L})}\prod _{t=1}^Tp(\pi _t|e_t). \end{equation}

(3)

The second branch is parameterized by a location-aware attention-based GRU decoder [6]. In contrast to the CTC formulation, the S2S formulation does not assume independence among predictions. Instead, it directly computes \(p(\mathbf {y}^{L}|\mathbf {e})=\prod _{n=1}^{N_1}p(y_n^{L}|y_{1:n-1}^{L},\mathbf {e})\) following the chain rule. To predict each target token \(y_n^{L}\), the S2S decoder takes previously predicted tokens \(y_{1:n-1}^{L}\) as input and utilizes a location-aware attention mechanism to derive a contextually weighted \(\mathbf {e}\). This attention mechanism enables the model to focus on specific parts of \(\mathbf {e}\) that are relevant for predicting the current token \(y_n^{L}\). Then the S2S loss is written as=-1

\begin{equation} \mathcal {L}_{\text{S2S}} = -\log p_{\text{S2S}}(\mathbf {y}^{L}|\mathbf {e})=-\log \prod _{n=1}^{N_1}p_{\text{S2S}}(y^{L}_n|y^{L}_{1:n-1}, \mathbf {e}). \end{equation}

(4)

As we employ a hybrid system, the overall loss function is a weighted sum of the two aforementioned loss terms: \(\mathcal {L}^{L}=(1-\lambda)\mathcal {L}_{\text{S2S}} + \lambda \mathcal {L}_{\text{CTC}}\). To balance the losses, we set \(\lambda =0.2\) in this work.=-1

During inference, in addition to the hybrid CTC-Attention structure mentioned above, we leverage a character-level LSTM LM. This allows us to predict the most likely lyrics by considering the output of three components:

\begin{equation} \mathbf {y}^{L*}=\arg \max _{\mathbf {y}^{L}}\lbrace \alpha \log p_{\text{CTC}}(\mathbf {y}^{L}|\mathbf {e})+(1-\alpha)\log p_{\text{S2S}}(\mathbf {y}^{L}|\mathbf {e})+\beta \log p_{\text{LM}}(\mathbf {y}^{L})\rbrace , \end{equation}

(5)

where \(\alpha\) and \(\beta\) are hyper-parameters used to balance three log-probability terms during the beam search. We set the beam size as 512. To evaluate the performance of our ALT systems, we report the word error rate (WER), which is a widely used metric for this task.

4.5 Automatic Music Transcription Backend

For AMT systems, we reformulate the S2S problem as a frame-level classification problem, inspired by [75]. The ground-truth note events \(\mathbf {y}^{M}=[(o_1, f_1, p_1),(o_2, f_2, p_2),\) \(\ldots ,(o_{N_2}, f_{N_2}, p_{N_2})]\) are transformed into onset/silence/pitch name/octave frame-level targets, represented as \(\mathbf {w}^{1}, \mathbf {w}^{2}, \mathbf {w}^{3}, \mathbf {w}^{4}\). This transformation enables us to classify each frame of the fused features \(\mathbf {z}\in \mathbb {R}^{T\times 1024}\) into corresponding labels, as visualized in Figure 4(b). Since directly predicting offsets is challenging, our AMT backend predicts silence instead, and the offsets \(f_1, f_2, \ldots , f_{N_2}\) are determined as the beginnings of silence frames. We employ a pitch name and an octave to denote each note pitch.=-1

To construct \(\mathbf {w}^1\), frames covering the onsets \(o_1, o_2,\ldots ,o_{N_2}\) are labeled as 1, while other frames are labeled as 0. Similarly, silence frames are assigned a label of 1 in \(\mathbf {w}^{2}\), while other frames are assigned a label of 0. As a result, we can use binary values to indicate the state of each frame in \(\mathbf {w}^{1}, \mathbf {w}^{2}\). In conventional practice, pitch values \(p_1, p_2,\ldots ,p_{N_2}\) are represented as MIDI note numbers ranging from C2 (MIDI number 36, 65.41 Hz) to B5 (MIDI number 83, 987.77 Hz). Here “B” and “C” are the pitch names, while “2” and “5” are the octaves. According to music theory, there are 12 notes (\(C, D\flat , D, E\flat , E, F, G\flat , G, A\flat , A, B\flat , B\)) in each octave. We consider a pitch range from C2 to B5, resulting in a total of four octaves. Additionally, we introduce an octave class and a pitch name class to represent silence. Consequently, each frame of \(\mathbf {w}^{3}\) has 13 possible values, and each frame of \(\mathbf {w}^{4}\) has 5 possible values. During inference, the frame-level predictions are transformed back into the note events. It is noted that the transformation between note events and frame-level targets introduces temporal quantization errors. Therefore, the frame resolution significantly impacts the AMT accuracy.=-1

The AMT backend \(\theta ^{M}\) consists of a linear layer with 20 output neurons, allocating 1, 1, 13, 5 neurons for \(\mathbf {w}^{1}, \mathbf {w}^{2}, \mathbf {w}^{3}, \mathbf {w}^{4}\), separately. The output probabilities can be expressed as \(p(\mathbf {w}^{i}|\mathbf {z})=\prod _{i=1}^T p(w_t^{i}|z_t), i=1,2,3,4\). To train the AMT system, we combine the loss terms for the four targets:

\begin{equation} \mathcal {L}^{M}=-\log p(\mathbf {y}^{M}|\mathbf {z})=\sum _{i=1}^4-\log \prod _{t=1}^T p(w_t^{i}|z_t), \end{equation}

(6)

where we employ binary cross-entropy (BCE) loss for targets \(\mathbf {w}^{1}, \mathbf {w}^{2}\) and cross-entropy loss for targets \(\mathbf {w}^{3}, \mathbf {w}^{4}\). Notably, we set a positive weight of 15.0 in the BCE loss for onset prediction to amortize the effects of imbalanced distribution in \(\mathbf {w}^{1}\).=-1

In Figure 4(b), we provide a visualization of the post-processing step to convert the predictions for \(\mathbf {w}^1, \mathbf {w}^2, \mathbf {w}^3, \mathbf {w}^4\) into note events. We postpone the details to Appendix B. At a high level, we first identify pairs of onset and offset and then identify the pitch between the time. Unless otherwise stated, we maintain a fixed onset threshold of 0.4 and an offset threshold of 0.5. AMT systems are typically evaluated using F1-scores of COnPOff (Correct onset, pitch, and offset), COnP (Correct onset, pitch), and COn (Correct onset). Their definitions and implementations can be found in [56, 63]. To ensure fair comparisons with previous approaches, such as [19, 33, 41, 46, 75], we set the pitch tolerance to 50 cents, the onset tolerance to 50 ms, and the offset tolerance to the maximum of 50 ms and \(0.2\times\) note duration. Additionally, we use the F1-score of the COff (Correct offset) metric to evaluate the performance of offset detection.=-1

4.6 Training Strategy

We developed several training strategies for our multimodal ALT system and multimodal AMT system to address the following challenges. One key challenge is adapting SSL models from the speech domain to the singing domain. In our approach, we utilize SSL models, namely wav2vec 2.0 [3] as audio encoder and AV-HuBERT [67] as video encoder. Originally, these models are pretrained on unlabeled speech data using SSL objectives. They are then finetuned on labeled speech data with ASR objectives. As we mentioned before, these SSL models have demonstrated the ability to generalize well to new domains, even in low-resource labeled scenarios, which can be attributed to their unsupervised learning on rich speech data. Given the similarities between speech and singing data, we hypothesize that these SSL models can also effectively generalize to our setting. For the ALT task, we initialize our audio encoder and video encoder with the SSL models pretrained and finetuned on speech data. This choice is motivated by the fact that ALT and ASR are analogous tasks with similar input–output pairs. We expect that both the pretraining and finetuning on speech data will yield benefits for the ALT task. However, the targets of the AMT task are the note events, rather than text in ALT and ASR. Hence, a question arises regarding the adaptation of the SSL models: will finetuning on speech data be advantageous for the AMT task?=-1

Inspired by [42], we speculate that finetuning on speech data may distort the pretrained features of SSL models and bias them toward ASR, thus hindering their generalization to AMT. To address this concern, we propose a new adaptation strategy specifically tailored to the AMT task. We skip the finetuning step on speech data with ASR objectives. Instead, we conduct linear probing on the AMT backend \(\theta ^{M}\), followed by full finetuning of the entire system. To further compare the above two adaptation strategies, we outline the training pipeline for the single-modal singing ALT system and single-modal singing AMT system in Algorithm 1 and Algorithm 2, respectively (for single-modal system, the feature fusion module \(\psi\) can be omitted). Typically, we use a relatively smaller learning rate \(\gamma _2\) than \(\gamma _1\) to preserve pretrained features of modality-specific encoders.=-1

Both wav2vec 2.0 and AV-HuBERT in our multimodal systems are large scale. Consequently, to mitigate high GPU memory demands, we propose a two-stage training approach similar to [61]. In the first stage, we train single-modal systems independently, each of which consists of a modality-specific encoder and a task-specific backend. Then in the second stage, we freeze the modality-specific encoders and only train the feature fusion module and the task-specific backend. In this way, we eliminate the requirements to load and update all model weights simultaneously and take advantage of powerful singing representations learned by single-modal systems. For more details, we refer readers to Appendix B.=-1

5 Experiments

In this section, we comprehensively evaluate our proposed systems for ALT and AMT tasks using both benchmark singing datasets and our curated multimodal singing dataset. To begin with, we conduct single-modal experiments for each task in a clean scenario (only vocal) to evaluate the efficacy of our modality-specific representation learning and assess the individual contributions of each modality to the task. Then we proceed with multimodal experiments to demonstrate the robustness of multimodal systems in the presence of sound noise contamination.² Finally, we conduct ablation studies to evaluate the effectiveness of our proposed methods. Additional information on the benchmark singing datasets and implementation details are presented in Appendix C and Appendix D, respectively.

5.1 Automatic Lyric Transcription Experiments

5.1.1 Audio-only ALT.

To evaluate our audio encoder and ALT backend, we train and test our systems using benchmark datasets, including DSing [8, 13], DALI [53, 54], Jamendo [69], Hansen [30], and Mauch [48]. DSing and DALI are two large-scale datasets for model training, while the other three datasets are only used for evaluation. We follow [11] to extract vocal parts.=-1

Initially, we train and evaluate our ALT system on DSing. The audio encoder, wav2vec 2.0, was pretrained on LibriVox (LV-60k) and finetuned on LibriSpeech [62] before its finetuning on singing data. For inference, we train an LSTM LM using lyrics exclusively from the DSing train split, aiming for simplicity. We note that the previous methods incorporated a broader range of lyrics to train their LMs. As indicated in Table 4, our system achieves SOTA performance on DSing. Subsequently, we finetune the above system using the DALI train split and evaluate its performance on DALI test split/Jamendo/Hansen/Mauch. For inference, we train an LSTM LM on both the DSing and DALI train splits. We observe that our wav2vec 2.0-based ALT system surpasses previous approaches on DALI test split/Hansen/Mauch by large margins. On the Jamendo dataset, our system achieves comparable performance as MSTRE-Net [15].=-1

Table 4.

Method	DSing Valid	DSing Test	DALI Test	Jamendo	Hansen	Mauch
TDNN-F [8]	23.33	19.60	67.12\(^{*}\)	76.37\(^{*}\)	77.59\(^{*}\)	76.98\(^{*}\)
CTDNN-SA [13]	17.70	14.96	76.72\(^{*}\)	66.96\(^{*}\)	78.53\(^{*}\)	78.50\(^{*}\)
MSTRE-Net [15]	–	15.38	42.11	34.94	36.78	37.33
Genre-Informed [29]	–	56.90\(^{*}\)	–	50.64\(^{*}\)	39.00\(^{*}\)	40.43\(^{*}\)
Voice2Singing [4]	–	–	41.5	–	–	–
Pitch-Informed [16]	–	–	64.41	76.2	–	–
DE2-segmented [14]	–	–	–	44.52	–	49.92
Ours	13.26	14.56	32.71	35.63	18.55	29.47

Table 4. WER (%) of Different Audio-only ALT Systems on Various Datasets

The reported numbers with \(^{*}\) are from [15] since the results in original papers are either absent or inferior. The best results and the second-best results are marked as bold face and underline, respectively.

Considering that our proposed wav2vec 2.0-based ALT system has achieved SOTA performance on the benchmark singing datasets, we adopt it to build a strong baseline for our curated N20EMv1 dataset. As for the training of LM, we use the lyrics from both the DSing train split and N20EMv1 train split. We also augment the LM using some texts from LibriSpeech. This LM will be used in all following experiments related to N20EMv1. Initially, we train the system using only the N20EMv1 train split. To further enhance the system’s performance, we augment the training data by incorporating the DSing dataset. As presented in Table 5, the system exhibits improved performance, which demonstrates that scaling more singing data during training enhances the system’s generalization.=-1

Table 5.

Train Data		WER (%) \(\downarrow\)
N20EMv1	DSing	Valid	Test
\(\surd\)	\(\times\)	12.74	19.68
\(\surd\)	\(\surd\)	9.65	13.00

Table 5. WER (%) of Our Audio-only ALT Systems on N20EMv1

5.1.2 Video-only ALT.

In this section, we initialize the new task of video-only ALT (or lyric lipreading). As this is the very first attempt, we train our video encoder and ALT backend to establish a benchmark system and assess the contribution of the video modality to the ALT task. Prior to finetuning our system on N20EMv1, the video encoder undergoes pretraining on LRS3 [1] and VoxCeleb2 [7], followed by finetuning on LRS3. We present the experimental results in Table 6. It is noted that lip videos inherently possess ambiguity in distinguishing between different characters, as singers may exhibit similar mouth shapes when pronouncing different characters. Therefore, the context relationships among consecutive characters are important. We validate this by conducting ablation studies on the decoding choice. First, when we ablated the use of LM, we found that the performance drops a lot. Furthermore, when only using the CTC backend (w/o. LM & w/o. S2S) or only the S2S backend (w/o. LM & w/o. CTC) for decoding, we observe that the S2S backend makes great contributions to the performance. We assume that the use of the S2S backend and an external LM alleviates the ambiguity in the video modality, thus enhancing the performance of video-only ALT.=-1

Table 6.

Method			WER (%) \(\downarrow\)
CTC	S2S	LM	Valid	Test
\(\surd\)	\(\times\)	\(\times\)	63.52 (+15.61)	78.20 (+9.75)
\(\times\)	\(\surd\)	\(\times\)	55.72 (+7.81)	74.10 (+5.65)
\(\surd\)	\(\surd\)	\(\times\)	55.80 (+7.89)	72.70 (+4.25)
\(\surd\)	\(\surd\)	\(\surd\)	47.91	68.45

Table 6. WER (%) of Our Video-only ALT System on N20EMv1 with Ablated Decoding Configurations

5.1.3 Multimodal ALT.

To build our multimodal ALT system, we adhere to the training strategy detailed in Appendix B. For fair comparisons, we train our audio-only and audio-visual systems using the same training strategy, with the disabled modality set as zeros. Our experiments are conducted on the N20EMv1 dataset. In contrast to the previous sections, we simulate noisy environments by mixing the vocal singing with its corresponding musical accompaniment at different SNR levels, including \(-10, -5, 0, 5, 10\ \text{dB}\), as well as clean scenarios without accompaniment.=-1

The quantitative results are reported in Figure 5 (left). It is observed that the multimodal systems outperform the audio-only system by large margins, especially in challenging SNR environments. For instance, at \(-10\ \text{dB}\), the performance gap is about \(30\%\) WER, while at \(-5\ \text{dB}\), the performance gap is about \(10\%\) WER. However, with the increase of SNR, the benefits brought by the additional modality gradually become limited, which is also observed in the comparison between audio-visual speech recognition and audio-only speech recognition in [45, 67]. The reason behind this is that with the absence of noise perturbations, the audio modality has sufficient information to retrieve textual information with limited aid from other modalities. Afterward, we compute the average WER across the six scenarios as an evaluation metric for noise robustness. Consequently, on average, our audio-visual system shows a significant improvement over its audio-only counterpart, reducing WERs by 7.62% and 6.51% on the N20EMv1 valid and test splits, respectively. Therefore, we conclude that multimodal systems are more robust to noise disturbances than the single modal system.

Fig. 7.

The quantitative results are also presented in Figure 5 (right), where we showcase the comparisons among predicted lyrics of audio-only and audio-visual transcription systems. More case studies are included in Appendix E. First, we would like to highlight that although a character-level tokenizer is used in our system, the word-level errors (e.g., insertions, substitutions, deletions) are analyzed as WER is used as the evaluation metric. As shown in Figure 5 (right), in the clean scenario, the predictions of the audio-only system have one insertion error and two substitution errors, while the audio-visual system corrects the insertion error. Similarly, in noisy environments, the audio-visual system corrects the insertion error of “is.” While it fails to transcribe the words “Wonder” and “along,” it exhibits fewer character-level errors compared to the audio-only system.=-1

5.2 Automatic Music Transcription Experiments

5.2.1 Audio-only AMT.

We validate our choice of audio encoder and AMT backend on N20EMv2 and benchmark singing datasets, which include MIR-ST500 [75], TONAS [25], and ISMIR2014 [56]. MIR-ST500 is the largest singing AMT dataset with human annotations for training and in-domain (ID) evaluation. TONAS and ISMIR2014 are two small datasets only for evaluation in out-of-domain (OOD) scenarios. We follow [75] to extract the vocal parts from singing if necessary.=-1

We first train an AMT system, referred to as “Ours 1” in Table 7, on the MIR-ST500 train split and evaluate its performance on the MIR-ST500 test split, TONAS, and ISMIR2014. The audio encoder (wav2vec 2.0) has been pretrained on LibriVox (LV-60) without additional finetuning on a speech recognition task. For ID data, our system achieves a significant performance improvement over the previous SOTA. For OOD data, our system still outperforms the EffNet [75], thereby indicating the effectiveness of our model design and adaptation strategy. We note that the performance on TONAS is noticeably lower compared to the MIR-ST500 test set and ISMIR2014. This disparity can be attributed to the fact that TONAS predominantly consists of Flamenco songs, resulting in a substantial distribution shift when compared to the other datasets that primarily consist of pop songs.=-1

Table 7.

Dataset	Metric (%) \(\uparrow\)	Tony [46]	HCN [19]	VOCANO [33]	EffNet [75]	JDC [41]	Ours 1	Ours 2
MIR-ST500	COnPOff	–	–	–	45.78	42.23	52.39	52.84
	COnP	–	–	–	66.63	69.74	70.73	70.00
	COn	–	–	–	75.44	76.18	78.32	78.05
TONAS	COnPOff	–	–	–	9.57	–	12.71	24.08
	COnP	–	–	–	19.65	–	25.24	36.87
	COn	–	–	–	42.41	–	52.77	64.38
ISMIR 2014	COnPOff	50	59.4	68.38	49.55	–	52.36	62.42
	COnP	68	–	80.58	63.63	–	70.38	75.91
	COn	73	79.0	84.04	79.16	–	92.77	93.02

Table 7. COnPOff/COnP/COn F1-Score (%) of Different Audio-only AMT Systems on MIR-ST500 Test Set/TONAS/ISMIR2014

The best results and the second-best results are marked as bold face and underline.

Next, we proceed to train another AMT system, denoted as “Ours 2,” using both the MIR-ST500 train split and the N20EMv2 train split. In Table 7, “Ours 2” not only maintains a high level of performance for ID data but also exhibits significantly improved generalization abilities when confronted with singing data from previously unseen domains. Specifically, “Ours 2” achieves SOTA performance in terms of COnPOff/COnP/COn on TONAS and COn on ISMIR2014. Despite pitch quantization errors, the performance of “Ours 2” is comparable to the SOTA [33] in terms of COnPOff/COnP on ISMIR2014. It is important to note that while the MIR-ST500/TONAS/N20EMv2 datasets are annotated in semitones, the pitch values in ISMIR2014 are annotated in cents (1 semitone = 100 cents), which puts our AMT system at a disadvantage. However, considering modern musical notation and following the approach in [75], our current design adopts a 12-tonal equal temperament system with semitonal resolution, which proves to be more practical in real-world applications. To summarize, the performance of our AMT systems (“Ours 1” and “Ours 2”) demonstrates that wav2vec 2.0 can learn excellent acoustic representations for the AMT task. Finally, we evaluate the performance of the system “Ours 2” on the N20EMv2 valid/test splits to establish a baseline for this new dataset. The results are presented in Table 8, where “Tolerance 1” denotes the default onset/offset/pitch tolerance.=-1

Table 8.

Dataset	Metric (%) \(\uparrow\)	Audio Tolerance 1	Video
Dataset	Metric (%) \(\uparrow\)	Tolerance 1	Tolerance 2
N20EMv2 valid	COnPOff	61.83	4.45	9.27
	COnP	68.42	6.16	11.79
	COn	92.18	77.14	88.69
	COff	89.80	74.68	83.01
N20EMv2 test	COnPOff	73.06	6.84	15.25
	COnP	79.56	8.79	18.53
	COn	93.66	78.62	88.64
	COff	91.78	78.83	84.48

Table 8. COnPOff/COnP/COn/COff F1-Score (%) of Our Audio-only/Video-only AMT Systems on N20EMv2

5.2.2 Video-only AMT.

In this section, we initialize the new task of video-only AMT (or note lipreading). To establish a baseline for N20EMv2, we train our video encoder and AMT backend. The video encoder has been pretrained on LRS3 [1] and VoxCeleb2 [7] without finetuning on a lipreading task. Experimental results in Table 8 demonstrate the effectiveness of utilizing video data for lip movements, achieving an F1-score of approximately \(80\%\) for onset and offset detection using the default tolerance. This performance is noteworthy as it competes with the performance of previous audio-only AMT systems in terms of these two metrics. Furthermore, we explore the potential of our video-only system by relaxing the tolerance settings as “Tolerance 2.” Specifically, we set the onset tolerance to 100 ms, the offset tolerance to the maximum of 100 ms and \(0.2\times\) note duration, and the pitch tolerance to 100 cents. Consequently, the COn F1-score reaches about \(89\%\), indicating that within a range of \(\pm 50\) ms, our system can accurately detect almost all onsets. Regarding pitch estimation, our video-only system also provides hints for distinguishing different pitches, showcasing the power of AV-HuBERT in learning visual representations for AMT.=-1

To interpret the high performance of our video-only system in onset/offset detection, we assume that it can detect transitions between consecutive note events by recognizing subtle changes in the mouth shape of singers. However, capturing acoustic information such as pitch solely from video is challenging. The system demonstrates rough differentiation between mouth shapes, as indicated by the COnPOff/COnP performance. However, mouth shapes alone are insufficient for accurate pitch predictions. As depicted in Figure 6(a) and Figure 6(b), some cases show different mouth shapes corresponding to different pitch labels, while others exhibit identical mouth shapes but different ground-truth MIDI numbers, resulting in potential failures. This issue resembles the character ambiguity problem in video-only ALT systems. We further evaluate the pitch name and octave classification accuracy of our video-only AMT system on N20EMv2, as shown in Figure 6(c). The accuracy of octave predictions ranges from \(65\%\) to \(70\%\), while the accuracy of pitch name estimation is around \(30\%\).=-1

Fig. 8.

5.2.3 Multimodal AMT.

Similar to multimodal ALT, we develop multimodal AMT systems using different combinations of modality inputs and conduct experiments in noisy environments using synchronized musical accompaniment on N20EMv2. First, we compare the performance of the audio-only system with that of the audio-visual system. As illustrated in Figure 7, the audio-visual system consistently outperforms the audio-only system across COnPOff/COnP/COn/COff metrics. The addition of the video modality yields significant improvements, particularly in low-SNR scenarios. Especially for the COn and COff metrics, the audio-visual system surpasses the audio-only system by more than 40% F1-score at –10 dB. This result aligns with our assumption that the video modality excels in onset/offset detection but faces challenges in pitch estimation due to the inherent ambiguity. As the SNR increases, the performance gaps between the two AMT systems are narrowed, suggesting that the contributions of video modality are diluted in less noisy environments.=-1

Fig. 9.

To further illustrate the advantages of incorporating the video modality, we visualize the predictions made by our audio-only system and audio-visual system in a 0 dB environment in Figure 8. The selected song segment (42 s to 47 s) contains seven notes, and both systems accurately predict the pitch of the notes. However, the audio-only system fails to detect the onset (\(t_1\)) of the fifth note and the offset (\(t_2\)) of the seventh note, whereas these events are successfully detected by the audio-visual system. To understand the decision-making process of the audio-visual system, we visualize consecutive video frames capturing the lip movements \(\mathbf {x}_{t_1-1}^{V},\mathbf {x}_{t_1}^{V},\mathbf {x}_{t_1+1}^{V}\), corresponding to the onset of the fifth note. From \(t_1-1\) to \(t_1+1\), we observe a slight narrowing of the subject’s mouth, indicating a transition from a higher-pitched note to a lower-pitched note. Similarly, from \(t_2-2\) to \(t_2+2\), we observe a gradual closure of the subject’s mouth, signifying the transition from a note to silence after \(t_2\). To conclude, the audio-visual system effectively captures the note transitions, allowing for precise onset and offset predictions.

Fig. 10.

5.3 Ablation Study

5.3.1 Ablation on Model Choices.

AV-HuBERT can accept both audio and video modalities within its original structure [67]. However, our preliminary experiments indicate that AV-HuBERT struggles to learn powerful acoustic representations for our singing transcription tasks. To investigate this phenomenon further, we conducted an ablation study on N20EMv1 and N20EMv2. Specifically, we trained an audio-only ALT system based on AV-HuBERT using the N20EMv1 training split and an audio-only AMT system based on AV-HuBERT using the N20EMv2 training split. These models were subsequently evaluated on the corresponding valid/test splits, and their performance was compared to audio-only transcription systems based on wav2vec 2.0. As presented in Table 9 and Table 10, we observe that the transcription systems utilizing wav2vec 2.0 achieve significantly better performance in both ALT and AMT tasks compared to the systems relying on AV-HuBERT. Moreover, we validate the effectiveness of using the pre-trained model on both ALT and AMT tasks. Specifically, we re-train the wav2vec 2.0-based transcription systems following the same procedure except that we randomly initialize the model weights of wav2vec 2.0. As shown in Table 10, the resulting ALT hardly recognizes lyrics from audio as it achieves almost 100% WER, while for AMT, the performance also deteriorates significantly.=-1

Table 9.

Models	N20EMv2 Valid				N20EMv2 Test
Models	COnPOff	COnP	COn	COff	COnPOff	COnP	COn	COff
AV-HuBERT	44.99	51.45	85.47	83.49	57.77	64.89	88.01	86.11
wav2vec 2.0 (rand)	43.47	53.43	82.60	81.07	50.74	63.12	81.29	80.39
wav2vec 2.0	59.54	64.89	91.45	90.65	69.77	76.04	93.02	91.94

Table 9. Ablation Study of Model Choices on the N20EMv2 Dataset

Table 10.

Models	WER (%) \(\downarrow\)
Models	Valid	Test
AV-HuBERT	31.21	38.64
wav2vec 2.0 (rand)	99.91	99.37
wav2vec 2.0	12.74	19.68

Table 10. Ablation Study of Model Choices on N20EMv1

5.3.2 Ablation on Adaptation Strategy.

To adapt SSL models from the speech domain to the AMT task, we propose Algorithm 2 considering both the domain shift and task difference. Specifically, we first skip the finetuning of SSL models on the ASR task and then directly finetune them on the AMT task in a linear probing and full finetuning manner. We conduct an ablation study on audio-only AMT systems to validate the effectiveness of this adaptation strategy. In addition to our proposed design, we create two variants for comparison. For “variant 1,” we retain finetuning of SSL models on the ASR task, followed by full finetuning on singing data. Conversely, for “variant 2,” we skip the finetuning on the ASR task and directly proceed with full finetuning on singing data. All transcription systems are trained using both the MIR-ST500 train split and N20EMv2 train split. The evaluation results are presented in Table 11 and Table 12. We note that our adaptation strategy consistently outperforms the two variants across all evaluation sets, including ID data and OOD data, in terms of all metrics.=-1

Table 11.

Methods	N20EMv2 Valid				N20EMv2 Test
Methods	COnPOff	COnP	COn	COff	COnPOff	COnP	COn	COff
Variant 1	56.89	63.39	91.50	89.09	70.16	77.25	93.08	91.22
Variant 2	59.24	65.99	91.17	89.62	69.90	76.84	92.71	91.21
Ours	61.83	68.42	92.18	89.80	73.06	79.56	93.66	91.78

Table 11. Comparison among Different Adaptation Strategies for AMT on the N20EMv2 Dataset

Table 12.

Methods	MIR-ST500 Test			TONAS			ISMIR2014
Methods	COnPOff	COnP	COn	COnPOff	COnP	COn	COnPOff	COnP	COn
Variant 1	50.78	68.75	77.23	21.63	34.60	63.01	61.03	74.25	91.84
Variant 2	51.43	68.89	77.98	22.55	36.72	63.48	57.97	72.21	92.16
Ours	52.84	70.00	78.05	24.08	36.87	64.38	62.42	75.91	93.02

Table 12. Comparison among Different Adaptation Strategies for AMT on MIR-ST500 Test/TONAS/ISMIR2014

5.3.3 Ablation on Feature Fusion.

We evaluate the effectiveness of our proposed RCA module for the multi-modal ALT task. To highlight the differences, we evaluate the ALT performance with different feature fusion modules at the \(-10\ \text{dB}\) SNR scenario. As presented in Table 13, we find that the absence of cross-attention shortcuts leads to an increase of \(1.90\%\) and \(2.17\%\) WER on the valid and test splits, while the absence of self-attention mechanism causes an increase of \(2.63\%\) and \(1.01\%\) WER, respectively. These results indicate that RCA contributes to improved feature fusion.=-1

Table 13.

Fusion		WER (%) \(\downarrow\)
CA	SA	Valid	Test
\(\times\)	\(\surd\)	41.62 (+1.90)	65.15 (+2.17)
\(\surd\)	\(\times\)	42.35 (+2.63)	63.99 (+1.01)
\(\surd\)	\(\surd\)	39.72	62.98

Table 13. WER (%) of Our Multimodal ALT System with Ablated Fusion Module in –10 dB SNR Scenario

CA: Cross-Attention, SA: Self-Attention.

To further investigate the effectiveness of the RCA mechanism, we visualize the attention maps within the RCA module when the audio serves as the source modality. We include audio-audio self-attention and audio-video cross-attention, as shown in Figure 9. We observe that both attention maps exhibit common attention patterns. Moreover, the cross-attention can extract additional relationships between frames that are not captured by the self-attention alone. This demonstrates that the RCA enhances feature fusion by incorporating complementary information from the other modality.=-1

Fig. 11.

6 Discussions and Future Work

In this work, we consider multimodal singing ALT and multimodal singing AMT as two distinct tasks, following previous literature. Our current system can be trained to seamlessly transcribe both lyrics and musical note events, resulting in a multi-task system. However, the data of different modalities are highly imbalanced. Treating the training on a single dataset with predefined modality combinations and predefined learning objectives as an individual learning task, the challenge lies in striking a balance between different learning tasks to train a multitask and multimodal system that achieves high performance and high robustness simultaneously. This poses an open problem that requires further investigation. Furthermore, while our ALT system is designed for a single language, the AMT system is language agnostic. Thus, combining these two systems would necessitate considering a multilingual setting. We leave this direction as a topic for future research.

7 Conclusion

In this work, we proposed a unified multimodal framework for transcribing lyrics and note events from singing voices. To develop our systems, we carefully curated the multimodal singing ALT dataset N20EMv1 and the multimodal singing AMT dataset N20EMv2. Then, we adapted SSL models from the speech domain into the singing domain as acoustic encoders, yielding SOTA performance. Additionally, we adapted SSL models initially used for lipreading tasks to serve as visual encoders, allowing us to initialize two novel tasks: lyric lipreading and note lipreading. Our results demonstrated that video modality can significantly contribute to both ALT and AMT tasks, despite the inherent challenges posed by ambiguity. Finally, we introduced RCA, a new feature fusion method, to fuse features from different modalities to obtain the ultimate transcription. Through our comprehensive experiments, we unveiled the advantages of incorporating additional modalities, which led to improved transcription performance and enhanced robustness against sound contamination and perturbations.

Acknowledgments

We would like to thank anonymous reviewers for their valuable suggestions.

Footnotes

https://www.celemony.com/en/melodyne/what-is-melodyne

We have released our code and trained models for ALT and AMT through https://github.com/guxm2021/MM_ALT and https://github.com/guxm2021/SVT_SpeechBrain, respectively.

MERT-v1-330M/MERT-v1-95M/MERT-v0-public/MERT-v0 in https://huggingface.co/m-a-p

A Additional Details On the N20EMv1 and N20EMv2 Datasets

During the annotation of N20EMv1, we mentioned in the main article that the pronunciation errors are annotated. Although the annotations are not used in this work, we think they will prove valuable for future research endeavors, particularly in the area of singing pronunciation evaluation. As depicted in Table 14, there are four types of pronunciation errors, including mispronunciation, substitution, insertion, and deletion. We use distinct brackets to mark them.

Table 14.

Error	Example (“correct” - “wrong”)	Annotation
Mispronunciation	“little” - “laytle”	/little/
Substitution	“the” - “a”	a [the]
Insertion	“” - “a”	{a}
Deletion	“and” - “”	(and)

Table 14. Annotation for Different Types of Pronunciation Errors

During the annotation process of N20EMv2, we established several rules to ensure consistency between the two annotators. First, notes are segmented based on both pitch and syllables. Each syllable was considered as a separate note, while specific guidelines for labeling onset/offset/pitch are outlined below:

—

Pitch: Notes with a duration longer than a semiquaver are treated as individual notes, as perceived by the annotators. However, ornaments such as pitch bending at the beginning of a note or vibratos were not considered independent notes. The pitch of each note was annotated with semitonal precision.

—

Onset: The onset time of each note was identified as the start of the vowel in each syllable. In cases where a syllable began with a non-vowel sonorant, the annotators deliberately determined when the vowel was pronounced and marked it as the onset. For instance, if the lyrics of a note were “last” [la:st], the onset would be placed at the beginning of “a” [a:] rather than “l” [l].

—

Offset: The offset time of each note was determined based on the absence of significant patterns in the audio spectrogram or when the next note commenced.

Following the initial annotation, two experts carefully scrutinized each other’s labeling results to reach a final agreement.

B Additional Details On Methodology

Post-processing for AMT systems. During the inference of AMT systems, we first obtain frame-level predictions \(\mathbf {w}^{1*}, \mathbf {w}^{2*}, \mathbf {w}^{3*}, \mathbf {w}^{4*}\). These predictions are then transformed back into note-level predictions \(\mathbf {y}^{M*}\) through a post-processing step. First, given the pitch name predictions \(\mathbf {w}^{3*}\) and octave predictions \(\mathbf {w}^{4*}\), we determine the predicted MIDI number (or silence) for each frame. Next, we iterate all frames to identify all note events. For each note event, we first determine its onset. If the onset prediction \(w_t^{1*}\) surpasses 0.4 (onset threshold) and \(w_t^{1*}\) is a local maximum, we consider \((t-1)\frac{L}{T}\) as the onset time. Here \(L\) represents the duration of input and \(\frac{L}{T}\) corresponds to the frame length or frame resolution. Then the offset time \((t^{\prime }-1)\frac{L}{T}\) is determined under the condition that \(t^{\prime }=\arg \min (w_{t^{\prime }}^{2*}\gt 0.5)\,\, \text{and}\,\, t^{\prime }\gt t\). The MIDI number assigned to this note is determined as the mode of the predicted MIDI numbers between the \(t\)th and \(t^{\prime }\)-th frames.

Training multimodal singing transcription systems. The complete algorithm for training multimodal ALT systems is presented in Algorithm 3. The training pipeline for the multimodal AMT system can be similarly derived. Specifically, in stage I, we train single-modal systems using Algorithm 1 for ALT and Algorithm 2 for AMT. This training procedure results in two task-specific backends \(\theta _A^{L}\) and \(\theta _V^{L}\) (or \(\theta _A^{M}\) and \(\theta _V^{M}\)), corresponding to audio and video modalities. We observe that initializing the task-specific backend for the multimodal system using the task-specific backend from the best-performing modality (normally the audio modality) can expedite convergence and lead to empirical performance improvements. Then in stage II, we freeze the parameter updates for the modality-specific encoders and only train the feature fusion module and task-specific backend.

Training on samples with uneven duration. To address the issue of uneven duration in singing samples and enable batch training, we employ a padding approach. Both the training samples and their corresponding frame-level targets are padded with zeros to match the duration of the longest sample in a batch. Suppose the numbers of frames in a batch are \(T^{1},\ldots , T^{b}, \ldots , T^{B}\), where \(B\) is the batch size; then the frame number of the padded batch is denoted as \(T_{\text{max}}=\max \lbrace T^{1},\ldots , T^{b},\ldots , T^{B}\rbrace\). We then construct a mask \(\mathbf {M}\in \mathbb {R}^{B\times T_{\text{max}}}\) for each batch. Each element \(M_t^{b}=1\) if \(t\le T^{b}\). Otherwise, \(M_t^{b}=0\). Suppose the loss for each frame in a single sample is \(l_t^b\); then the masked loss is computed as=-1

\begin{equation} \mathcal {L}=\frac{1}{B}\sum _{b=1}^B\sum _{t=1}^{T_{\text{max}}}M_t^{b}l_t^b\,\,\,\text{or}\,\,\,\mathcal {L}=\frac{1}{\sum _{b=1}^B\sum _{t=1}^{T_{\text{max}}}M_t^b}\sum _{b=1}^B\sum _{t=1}^{T_{\text{max}}}M_t^{b}l_t^b, \end{equation}

(7)

where the choice between the two forms depends on whether the loss is averaged over the frame axis. To reduce the padding ratio, we sort all the samples in ascending order based on their duration during training.

Training on song-level data. The experiments of AMT are conducted at the song level. However, loading an entire song is highly demanding for GPU memories, given the typical duration of songs ranging from 3 to 5 minutes. To address this, we divide each song into segments, each comprising 5 seconds, except for the last segment, which may range from 2.5 to 7.5 seconds. During the evaluation, the predictions for all segments are combined to compute song-level metrics.

C Benchmark Singing Datasets

ALT datasets. The DSing dataset [8, 13] provides official train/valid/test splits. Specifically, the train split has three subsets, namely DSing1, DSing3, and DSing30, each varying in size. Throughout our work, we utilize the DSing30 subset as the train split. We divide DALI v2 [54] into train and valid splits and regard a subset of DALI v1 as the test split following [15]. These two datasets are the most large-scale ones for ALT. Jamendo, Hansen, and Mauch are three small datasets only for evaluation. Among them, Jamendo comprises English songs with different genres, while the other two datasets consist of Western pop songs. The statistics of these datasets are shown in Table 15.

Table 15.

Data	Split	Num. of Utter.	Duration
DSing	Train	81,092	149.1 h
	Valid	482	41 min
	Test	480	48 min
DALI	Train	268,392	183.8 h
	Valid	1,313	55 min
	Test	12,471	9 h
Jamendo	–	921	49 min
Hansen	–	634	34 min
Mauch	–	878	54 min

Table 15. Statistics of Benchmark ALT Datasets

AMT datasets. The MIR-ST500 [75] dataset is the largest singing AMT dataset with human annotations. It comprises 500 Chinese pop songs, amounting to a total duration of about 30 hours. The dataset is divided into a train split with 400 songs and a test split of 100 songs. TONAS [25] and ISMIR2014 [56] are two small datasets that we used only for evaluating OOD scenarios, due to their distinct styles, languages, and annotation processes. TONAS has 72 Flamenco songs, while ISMIR2014 encompasses 14 songs sung by children, 13 by male adults, and 11 by female adults. The statistics of these datasets are shown in Table 16.

Table 16.

Data	Split	Num. of Songs	Duration
MIR-ST500	Train	400	27.6 h
MIR-ST500	Test	100	6.8 h
TONAS	–	72	36 min
ISMIR2014	–	38	19 min

Table 16. Statistics of Benchmark AMT Datasets

D Implementation Details

D.1 Automatic Lyric Transcription Experiments

In Section 5.1.1, Audio-only ALT, we first train our ALT system on the DSing dataset. We follow Algorithm 1 and employ a learning rate of \(\gamma _1=3\times 10^{-4}\) for the ALT backend and a learning rate of \(\gamma _2=1\times 10^{-5}\) for the acoustic encoder. The model is trained using Adam optimizer [39] for 10 epochs with a batch size of four. During the inference, we train an RNN language model (RNNLM), which has the embedding size of 128, two RNN layers with 2,048 RNN neurons, and one DNN block with 512 DNN neurons. This RNNLM is trained on the lyrics from the DSing train split. We select \(\alpha =0.4\) for CTC decoding weight and \(\beta =0.4\) for LM decoding weight when evaluating the system on DSing valid/test splits. We mark this system as “System 1.”

When we train our ALT system only on the N20EMv1 dataset, we keep the same training configuration except that the RNNLM is trained on the lyrics from N20EMv1/DSing/LibriSpeech train splits. We mark this ALT system as “System 2.” In the main article, we further augment the training data by incorporating the DSing dataset. To achieve this, we directly finetune “System 1” on the N20EMv1 dataset, which is called “System 3.” This ALT system is further used in our multimodal experiments.

To evaluate our proposed framework on the DALI/Jamendo/Mauch/Hansen datasets, we finetune “System 1” on the DALI train split for two epochs. We enhance the ability of RNNLM by increasing the number of RNN layers to three, the number of DNN blocks to three, and DNN neurons to 1,024. We employ \(\alpha =0.3\) for CTC decoding weight and \(\beta =0.2\) for LM decoding weight when evaluating the system on the above datasets. This system is labeled as “System 4.”

In Section 5.1.2, Video-only ALT, we train our transcription system based on only video modality. We follow the same configurations as “System 2” for implementation. In Section 5.1.3, Multimodal ALT, we train our transcription system following Algorithm 3 for 10 training epochs with a learning rate of \(1\times 10^{-4}\). Since we freeze the parameters of our acoustic encoder and visual encoder, we increase the batch size to 24.

D.2 Automatic Music Transcription Experiments

In Section 5.2.1, Audio-only AMT, we first train our AMT system on the MIR-ST500 dataset. We follow Algorithm 2 and employ a learning rate of \(\gamma _1=3\times 10^{-4}\) for the AMT classifier and a learning rate of \(\gamma _2=5\times 10^{-5}\) for the acoustic encoder. Then we train the system for two epochs under the linear probing stage and eight epochs under the full fine-tuning stage. The batch size is set as eight. The resulting system is marked as “Ours 1” in the main article. We follow the same training configurations to train “Ours 2” and our video-only AMT system with only modification on training data.

In Section 5.2.2, Video-only AMT, we train our video-only AMT system on N20EMv2 using the same training pipeline as our audio-only AMT system except that the input is the video modality. In Section 5.2.3, Multimodal AMT, we train our transcription system following Algorithm 3 for 10 training epochs with a learning rate of \(3\times 10^{-4}\).

E More Results

E.1 Adaptation of Models from the Music Domain

In the main article, we have highlighted that singing and speech share similarities, which is our main motivation to adapt wav2vec 2.0 and AV-HuBERT from the speech domain into the singing domain. It is worth mentioning that singing and music are also similar, especially in terms of musical perspective in the audio signal. Therefore, we replace wav2vec 2.0 with MERT, a recent large-scale SSL model from the music domain [43], in our transcription systems to evaluate its ability to extract linguistic/musical information for ALT and AMT tasks. MERT shares a similar model architecture as wav2vec 2.0 [3] but with different training paradigms and data resources. We consider four model variants of MERT³ and compare the best performance we can achieve to wav2vec 2.0. To facilitate fair comparison, we only make minimal modifications on training configurations to meet the specifications of the input sampling rate and output frame rate.

For ALT, we train and evaluate audio-only systems on the N20EMv1 dataset. As shown in Table 17, we observe that when replacing wav2vec 2.0 with MERT in our ALT framework, its performance deteriorates drastically. This is expected since MERT was trained on music data and prone to focus on the musical part of audio input instead of textual information.=-1

Table 17.

Models	WER\(\downarrow\)
Models	Valid	Test
MERT	57.48	74.55
wav2vec 2.0	12.74	19.68

Table 17. Comparison between wav2vec 2.0 and MERT for the ALT Task on N20EMv1

For AMT, we train audio-only systems on the combination of MIR-ST500 and N20EMv2 train splits. Afterward, we test their performance on various singing datasets. Afterward, we test their performance on various singing datasets. We found that in most cases, the performance of wav2vec 2.0 exceeds that of MERT, especially on the MIR-ST500, TONAS, and ISMIR2014 datasets, as presented in Table 18. On the N20EMv2 dataset, it seems that MERT performs better on pitch estimation, while wav2vec 2.0 performs better on onset and offset detection, as shown in Table 19. Therefore, we conclude that wav2vec 2.0 has superiority over MERT. However, we think further explorations on the adaptation of MERT will be beneficial to this task. For instance, it is possible to ensemble wav2vec 2.0 and MERT to extract better representations from singing audio.=-1

Table 18.

Models	MIR-ST500 Test			TONAS			ISMIR2014
Models	COnPOff	COnP	COn	COnPOff	COnP	COn	COnPOff	COnP	COn
MERT	50.76	69.62	76.61	19.62	30.30	61.08	62.59	74.91	90.91
w2v 2.0	52.84	70.00	78.05	24.08	36.87	64.38	62.42	75.91	93.02

Table 18. Comparison between wav2vec 2.0 and MERT for AMT on the MIR-ST500/TONAS/ISMIR2014 Datasets

Table 19.

Models	N20EMv2 Valid				N20EMv2 Test
Models	COnPOff	COnP	COn	COff	COnPOff	COnP	COn	COff
MERT	63.08	71.18	89.29	87.89	72.06	79.89	91.78	90.50
w2v 2.0	61.83	68.42	92.18	89.80	73.06	79.56	93.66	91.78

Table 19. Comparison between wav2vec 2.0 and MERT for AMT on the N20EMv2 Dataset

E.2 More Qualitative Results for Our ALT Systems

We visualize more qualitative results by comparing the predictions of our audio-only/audio-visual ALT systems under different noise-level environments in Figure 10.

Fig. 12.

References

[1]

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. LRS3-TED: A large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018).

Abstract

1 Introduction

2 Related Work

2.1 Automatic Lyric Transcription

2.2 Automatic Music Transcription

2.3 Multimodal Learning

3 Multimodal Singing Dataset: N20EMv1 and N20EMv2

3.1 Singer Profile and Song Selection

3.2 Multimodal Singing Data Recording

3.3 Multimodal Singing Data Processing

3.4 Lyric Annotation for N20EMv1

3.5 Note Annotation for N20EMv2

4 Methodology

4.1 Problem Formulation

4.2 Modality-specific Encoders

4.3 Modality Feature Fusion

4.4 Automatic Lyric Transcription Backend

4.5 Automatic Music Transcription Backend

4.6 Training Strategy

5 Experiments

5.1 Automatic Lyric Transcription Experiments

5.1.1 Audio-only ALT.

5.1.2 Video-only ALT.

5.1.3 Multimodal ALT.

5.2 Automatic Music Transcription Experiments

5.2.1 Audio-only AMT.

5.2.2 Video-only AMT.

5.2.3 Multimodal AMT.

5.3 Ablation Study

5.3.1 Ablation on Model Choices.

5.3.2 Ablation on Adaptation Strategy.

5.3.3 Ablation on Feature Fusion.

6 Discussions and Future Work

7 Conclusion

Acknowledgments

Footnotes

A Additional Details On the N20EMv1 and N20EMv2 Datasets

B Additional Details On Methodology

C Benchmark Singing Datasets

D Implementation Details

D.1 Automatic Lyric Transcription Experiments

D.2 Automatic Music Transcription Experiments

E More Results

E.1 Adaptation of Models from the Music Domain

E.2 More Qualitative Results for Our ALT Systems

References

Index Terms

Recommendations

Automatic transcription of flamenco singing from polyphonic music recordings

Automatic Guitar Music Transcription

Automatic music transcription based on non-negative matrix factorization

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations