Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation

\nameLongshen Ou, Xichu Ma, and Ye Wanga CONTACT Ye Wang Email: wangye@comp.nus.edu.sg School of Computing, National University of Singapore, 21 Lower Kent Ridge Road, Singapore
Abstract

Despite previous efforts in melody-to-lyric generation research, there is still a significant singability gap between lyrics written by machines and human lyricists. This paper bridges the singability gap with a novel approach to generating singable lyrics by jointly Learning wOrding And Formatting during Melody-to-Lyric training (LOAF-M2L). After general-domain pretraining, our model acquires length awareness using unsupervised learning from a large text-only lyric corpus. Then, we introduce a new objective informed by musicological research on the relationship between melody and lyrics during melody-to-lyric supervised training, which enables the model to learn the fine-grained format requirements of the melody. Our model achieves 3.75% and 21.44% absolute accuracy gains in the outputs’ number-of-line and syllable-per-line requirements compared to naive fine-tuning, without sacrificing text fluency. Furthermore, our model demonstrates a 42.15% and 74.18% relative improvement in overall quality in the subjective evaluation, compared to two leading melody-to-lyric generation models, highlighting the significance of formatting learning in lyric generation.111Code available at \urlhttps://github.com/Sonata165/BART-M2L

keywords:
Lyric generation; lyric singability; melody-to-lyric generation; melody–lyric compatibility; sequence-to-sequence model; conditional text generation; prompt-based control; automatic composing
articletype: RESEARCH ARTICLE

1 Introduction

Integrating natural language processing in music has seen expansive applications (Sergio Oramas \BBA Serra, \APACyear2018). Recently, automatic lyric generation has garnered increasing attention from academic and industrial sectors. In particular, the melody-to-lyric generation (M2L) task aims to produce lyrics that harmonize with the song, making the outcome performable and streamlining the music creation process. An efficient automated system would considerably decrease music creation and production costs (N. Liu \BOthers., \APACyear2022).

Refer to caption
Figure 1: Caption: Comparative analysis of original lyrics with versions generated by various models, all using the same melody for input.
                                                                                                                                    Figure 1. Alt Text: The top row shows the original lyrics, noted for their optimal singability. Rows 2 to 4 feature lyrics produced by SongMASS, AI-Lyricist, and ChatGPT, respectively, highlighting issues like fragmented sentences, syllable-note mismatches, and inappropriate musical emphasis on less meaningful words, all detrimental to singability. The final row presents lyrics from our system, devoid of the aforementioned problems and demonstrating a high degree of singability.

Lyrics are crafted to be sung. The ability to be sung with its melody, which we term as singability in this paper, is an essential property for lyrics. While there is no universal consensus on the definition of singability, we believe the definition adopted here represents a common intersection and is likely to be broadly accepted. Inspired by (Low, \APACyear2003), we believe that creating highly singable lyrics involves efforts from both internal and external aspects: (1) clarity in pronunciation, which means adopting a lyrical language style and avoiding complex consonant clusters and undersized vowels, and (2) compatibility with the melody, ensuring words maintain natural tones when sung and emphasizing crucial words musically. With the help of pre-trained language models, we can handle the first requirement by finetuning them with a small amount of lyric text to adjust them to a lyrical output style. However, ensuring the second requirement—music–lyric compatibility—is more challenging. This paper delves into enhancing this compatibility of M2L models without considering other content constraints like keywords or emotions.

Without carefully handling the relationship between the melody and lyrics, lyrics that look good on paper will likely be awkward when sung. For example, Fig. 1 shows the original and generated lyrics for a phrase in the song Free as a Bird. Notable disparities exist between the original (1st row) and generated lyrics (rows 2–4) from SongMASS (Sheng \BOthers., \APACyear2021), AI-Lyricist (Ma \BOthers., \APACyear2021), and GPT-4 (OpenAI, \APACyear2023). SongMASS’s output has evident grammatical flaws. While the AI-Lyricist and the ChatGPT offer more fluent outputs, they still grapple with aligning the lyrics to the melody. For lyrics in the red boxes, the performer has to extend the duration of the words to align them with musical notes, while for lyrics in the purple boxes, musical notes have to be broken into pieces to align with the syllables in the text. In both cases, the rhythm pattern will be changed, deviating from the music composer’s intention. Further, in the orange boxes, unimportant words get undue emphasis due to their alignment with longer musical notes, resulting in a jarring effect when sung. Generally, these systems fall short of producing singable lyrics that align with melodies.

In the musical landscape, the fourth-row lyric stands out from the prior results in various capacities. This lyric is not just more fluent than the output from SongMASS due to its grammatical accuracy and continuity, but it also aligns well with the melody. A direct correlation between the number of syllables and musical notes is evident, and their synchronization produces a natural pairing. A case in point is the word ‘alright’, where the emphasis on the second syllable corresponds with the note of the longest duration in its segment, which creates a perfect coupling and contributes to singability. These properties make it a substantially more performable version of lyrics.

This paper proposes an innovative methodology to generate singable lyrics from melodies, aiming to bridge the quality disparities above. Our contributions are:

  • We underscore the unique demands of M2L in contrast to lyric-style text generation. Melodies inherently carry specific constraints regarding text structure, phonology, and syntax that guide lyric creation. Our experiments demonstrate that overlooking these specifications can produce grammatically correct lyrics that might not mesh well with the melody. Additionally, we are the first to use computable metrics to quantize the degree of fine-level compatibility between melody and lyrics.

  • We devise a hybrid training strategy for our M2L model, blending both unsupervised and supervised learning stages. We found that the sparsity of paired data limits the model’s length awareness ability, yet this can be effectively compensated through self-supervised training with a focus on length-conditioned text generation. Further, the supervised M2L training afterward helps the model identify and interpret the constraints imposed by a given melody. Combining the two stages enables the model to generate more singable lyrics.

  • We introduce a fresh training objective during the supervised phase, crafted to maximize the benefit of limited paired data. This approach leverages music–lyric pairs, decoding the implicit formatting guidelines inherent in melodies. This supplementary supervision allows our model to discern and apply the links between melodies and their corresponding constraints, thereby enhancing condition-aware generation.

  • Our model’s effectiveness is validated through both objective metrics and subjective evaluations. It crafts singable lyrics with an elevated level of music–lyric compatibility, surpassing earlier M2L models in text quality and alignment with music. Furthermore, it outperforms its unsupervised-learning-only counterparts regarding fluency, compatibility, and overall quality.

2 Related Work

2.1 Relationship between melody and lyrics

A defining feature that sets the M2L model apart from text-only lyric generation models is the conditions that guide its generation. For M2L, these constraints stem directly from the melody, influencing text structure, phonology, and syntax. While some guidelines can be flexible, human-composed lyrics generally adhere to them. Below are some key considerations.

Central to the constraints presented by a melody is the number of syllables per line. Essentially, lyrics need to be contained within a specific length range(Noske \BBA Benton, \APACyear1988). Overloading the lyrics with syllables necessitates subdividing notes for each syllable, whereas fewer syllables demand extending some to match multiple notes. Both situations disrupt rhythm, making them unfavorable. Consequently, many studies have focused on incorporating number-of-syllable requirements into the lyric generation process (Lee \BOthers., \APACyear2019; P. Li \BOthers., \APACyear2020; Ma \BOthers., \APACyear2021; N. Liu \BOthers., \APACyear2022; Guo \BOthers., \APACyear2022\APACexlab\BCnt1).

Stressed syllable placement is also pivotal, especially in stress-timed languages like English (Low, \APACyear2003). To optimize singability, lyricists often align stressed syllables in the lyrics with the song’s pronounced notes. Recognizing this, several studies have explored methods to govern the positioning of stressed syllables to achieve rhythmic harmony (Ghazvininejad \BOthers., \APACyear2018; Xue \BOthers., \APACyear2021).

Achieving synchronicity between melody and lyrics is not just about syllable count and stress. As Nichols \BOthers. (\APACyear2009) notes: (1) Syllable stress is tightly linked with metric position, melodic peaks, and note length. (2) Stopwords are strongly associated with metric position and melodic peaks. (3) Vowel duration often correlates with note length. These observations unveil a nuanced, intrinsic connection between melodies and lyrics.

Moreover, other factors come into play. For example, in tonal languages, there is an interaction between character tone and melody pitch (X. Zhang \BBA Cross, \APACyear2021; Guo \BOthers., \APACyear2022\APACexlab\BCnt2). In pop music, bright vowels tend to occupy weak beat positions (Ammirante \BBA Rovetti, \APACyear2021). For this paper, we focus on English and do not delve into specific genres or tonal languages.

The complexities do not end here. An essential observation is that the relationship is not strictly one-to-one; theoretically, an infinite number of melodies could fit a particular lyric, with the actual song melody being just one instance. Such loosely coupled associations between melodies and lyrics lead to data sparsity challenges. Given the limited melody–lyric parallel data, it can be challenging for a sequence-to-sequence model to accurately capture these seemingly loosely coupled properties between melody and lyrics. This issue urges us to consider whether there is a way to incorporate these inductive biases into the training process as guidance.

2.2 Melody-to-Lyric Generation

The challenge of melody-to-lyric (M2L) generation has seen various intriguing advancements. Watanabe \BOthers. (\APACyear2018) constructed a melody-conditioned lyric language model in Japanese using melody–lyric aligned data; Lee \BOthers. (\APACyear2019) tried to solve both M2L and its counterpart, lyric-to-melody generation (L2M) problem with Long Short-Term Memory models; Chen \BBA Lerch (\APACyear2020) developed a Seq-GAN-based model with theme and melody as conditions; Ma \BOthers. (\APACyear2021) added length, music structure, and keyword constraints to a Seq-GAN model; Sheng \BOthers. (\APACyear2021) employed a unified model to solve M2L and L2M problems concurrently and show the effectiveness of unsupervised masked pretraining; Qian \BOthers. (\APACyear2022) introduced a reconstruction loss in training the dual M2L and L2M model; J. Li \BOthers. (\APACyear2022) incorporated additional music-related information as input, such as beat and tempo.

The emergence of unsupervised training for M2L models has provided innovative approaches (Tian \BOthers., \APACyear2023; Qian \BOthers., \APACyear2023). Specifically, Tian \BOthers. (\APACyear2023) introduced an effective content control method, achieved by first planning keywords from the song’s theme and then expanding keywords to a full line of lyrics. As for format control of the generated lyrics, these works follow the below methodology: they first convert melody in music into some rhythm pattern, which describes mainly the number-of-syllable requirement from the melody, and the generation model can be aware of this length requirement during generation. Such methodology again points out the importance of length condition in lyric generation and hints that length awareness can be gained without the help of paired data.

However, several gaps persist. Many models fall short of ensuring singability. As will be discussed in §5.2, leading M2L models often fail to align melody notes with lyric syllables satisfactorily (Ma \BOthers., \APACyear2021; Sheng \BOthers., \APACyear2021). While supervisely trained models could theoretically grasp the intricate correspondence between melody and lyrics, limited paired data hampers this. Furthermore, while unsupervised techniques can train models to recognize syllable counts, they sometimes oversimplify the melody’s demands. The interplay between syllable stress and melody, among other intricate compatibility requirements, has often been overlooked.

Additionally, while several innovative training strategies exist, an effective approach to help models discern the melody–lyric relationship in sparse data scenarios remains elusive. Although Qian \BOthers. (\APACyear2022) proposed a reconstruction loss method, its general applicability and diversity warrant further examination. After all, different melodies might enforce similar lyric constraints but not necessarily share resemblances.

2.3 Generate Lyrics with Other Inputs

Beyond M2L, other lyric generation models operate with diverse inputs and constraints. Some consider pre-defined lengths (Wu \BOthers., \APACyear2019; P. Li \BOthers., \APACyear2020; N. Liu \BOthers., \APACyear2022), stress patterns (Barbieri \BOthers., \APACyear2012; Xue \BOthers., \APACyear2021), and rhyme schemes (Barbieri \BOthers., \APACyear2012; Lingan, \APACyear2021; R. Zhang \BOthers., \APACyear2022; P. Li \BOthers., \APACyear2020; N. Liu \BOthers., \APACyear2022). Others integrate keywords (Nikolov \BOthers., \APACyear2020), melody emotions (Huang \BBA You, \APACyear2021), styles (Lingan, \APACyear2021; Chang \BOthers., \APACyear2021), structural patterns (Lu \BOthers., \APACyear2019), textual passages (L. Zhang \BOthers., \APACyear2022), and even musical accompaniments (Melistas \BOthers., \APACyear2021; Watanabe \BBA Goto, \APACyear2021). Advanced outputs, like songs with structural tags (Potash \BOthers., \APACyear2015) or lyrics with hidden messages (Tong \BOthers., \APACyear2019), have also been explored.

However, there are unresolved challenges in the current approaches. Firstly, the inclusion of format constraints that contribute to music–lyric compatibility is insufficiently comprehensive. Specifically, fine-grained compatibility requirements, such as word importance and vowel duration, are not considered in previous works. Additionally, the current explicit stress pattern control mechanism is less than ideal. It mandates users to specify a rhythmic pattern, indicating which syllables should be stressed. However, in scenarios where content conditions are coupled with predefined output length, there are only limited sentences to convey the intended meaning, leading to only limited numbers of suitable prompts, unknown to the user, that can lead to the output with desired content and format conditions. As a result, the stress patterns provided by the user are very likely to differ from the actual distribution of syllable stress positions in English,222An extreme example: the user requires all syllables in the output to be weak syllables. which may lead to results that miss either naturalness or desired properties. Furthermore, the challenge of deciphering syllable stress requirements from the melody remains. For instance, it is not always necessary for a long note to coincide with a stressed syllable. As evidenced in Section 5.1, only 82.45% of long notes align with stressed syllables from our corpus of human-created lyrics. Making an absolute assumption in this regard can render the outputs feel rigid.

2.4 Sequence-to-Sequence Denoising Pretraining

Given the lack of paired melody–lyric domain-specific data, a foundational model is crucial to retain generation quality. Our examination pinpoints large-scale denoising sequence-to-sequence pretraining (Lewis \BOthers., \APACyear2020) as an apt choice. The approach has been validated in text generation domains, such as summarization (Akiyama \BOthers., \APACyear2021) and translation (Y. Liu \BOthers., \APACyear2020; Tang \BOthers., \APACyear2020; Ou \BOthers., \APACyear2023). Furthermore, employing transfer learning from general to specific domains has consistently enhanced performance in data-limited scenarios (Gu \BOthers., \APACyear2022; Ou \BOthers., \APACyear2022).

Due to the limited in-domain data, a robust foundational model is essential for maintaining the generation quality. Our investigations identified large-scale denoising sequence-to-sequence pretraining (Lewis \BOthers., \APACyear2020) as highly suitable for our problem context. This approach has proven its efficacy in text generation tasks, such as summarization (Akiyama \BOthers., \APACyear2021) and translation (Y. Liu \BOthers., \APACyear2020; Tang \BOthers., \APACyear2020; Ou \BOthers., \APACyear2023). Moreover, transfer learning of models trained with general-domain data to a domain-specific task is a widely adopted strategy for performance improvement with limited data, such as (Gu \BOthers., \APACyear2022; Ou \BOthers., \APACyear2022).

2.5 Prompt-Based Methods

The paradigm of using prompt-based methods is gaining traction in NLP research (P. Liu \BOthers., \APACyear2023). For conditional text generation tasks, leveraging prompts during fine-tuning has been shown to effectively dictate output attributes (Y. Liu \BOthers., \APACyear2021; Grangier \BBA Auli, \APACyear2018) and ensure the inclusion of specific lexicons (Susanto \BOthers., \APACyear2020; Chousa \BBA Morishita, \APACyear2021; Wang \BOthers., \APACyear2022). Additionally, this method offers control over other dimensions like output length (Lakew \BOthers., \APACyear2019) and the initial word of the output (Y. Li \BOthers., \APACyear2022). Notably, in lyric generation, prompt-based techniques have been validated for controlling elements such as syllable count, stress patterns, and rhyme schemes (P. Li \BOthers., \APACyear2020; Ma \BOthers., \APACyear2021; Xue \BOthers., \APACyear2021; Ormazabal \BOthers., \APACyear2022; N. Liu \BOthers., \APACyear2022). Given these advantages, we posit that prompt-based methods present a viable approach for length control in our specific context.

3 Methodology

We focus on improving the compatibility between lyrics and melody to bridge the singability gap. We initiate this by implementing length control at the paragraph level via a prompt-based approach. Subsequent steps involve aligning syllable stress, word importance, and vowel length by introducing specialized training objectives.

3.1 Length: the Basic Requirement for Compatibility

Achieving harmony between lyrics and melody fundamentally depends on synchronizing their lengths. When generating at the paragraph level, ‘length’ is twofold: the number of lines in a lyric paragraph should match the number of melody phrases, and each lyric line’s syllable count should correspond to the phrase’s musical note count.

In our approach, lyric length is strictly controlled, ensuring that each sentence in the generated lyrics exactly meets the desired syllable count. This design draws from prior format control attempts in NLP and adheres to the common practice. This strict control simplifies alignment between lyrics and musical notes, streamlining the process for more nuanced controls.

3.1.1 Prompt-Based Control

To control length during lyric generation, we employ a prompt-based fine-tuning approach. We devise a list of specialized tokens for each output paragraph, each signifying the requisite number of syllables for a given line. These tokens, in the format <len_i>, dictate the syllable count, while the list’s length corresponds to the number of lines. With the help of the CMU pronunciation dictionary (Carnegie Mellon University, \APACyear2022), we can get the mapping between words and their syllable counts. In addition, we used an auxiliary special token <b> to represent sentence breaks. During training, these length tokens are integrated as supplementary input to enhance the model’s awareness of length constraints.

Our foundational model, chosen for its pre-training on large general domain corpus, undergoes further fine-tuning to heighten its sensitivity to length. Given BART’s (Lewis \BOthers., \APACyear2020) proficiency in sequence-to-sequence text generation, it was selected for our task. We expanded the BART model’s original vocabulary to incorporate length prompt tokens into our model. This required enlarging both the tokenizer’s vocabulary and the BART model’s embedding layer dimension. Consequently, our length tokens were transformed into vectors, aligning with the BART model’s hidden dimension. During the M2L supervised training, the list of length prompts is concatenated with the musical note information, hinting at the output length. In this way, we expect the length awareness can be gained together with M2L training.

3.1.2 Length-Aware Unsupervised Training

Our experiment found that the quantity of our paired music–lyric data is insufficient to support the model in learning length awareness. We attempt to leverage large-scale text-only lyric data, which offers a more accessible augmentation to paired datasets. We incorporated an additional training phase before the supervised M2L training, dedicated to fostering length awareness using this text-only data prior to the supervised M2L training with the paired dataset. This approach conferred dual benefits. It not only allows the model to understand the meaning of length tokens better, but also better adapt the output style to the lyric domain.

Refer to caption
Figure 2: Caption: Illustration of the unsupervised training for length-awareness. The tensor-product\otimes refers to concatenation.
                                                                                                                                    Figure 2. Alt Text: Syllable counts per sentence, derived using the CMU dictionary, and partially masked lyrics, are concatenated into a single input sequence. This sequence is fed into our Transformer-based encoder-decoder model, which is trained to reconstruct the original, uncorrupted lyrics utilizing the provided sentence lengths and corrupted text.

We integrate a masking approach in this phase by drawing inspiration from BART’s pre-training methodology. As in Figure 2, arbitrarily selected segments of the input sequence are concealed using a special token, posing the challenge for the model to reconstruct the masked portion. Meanwhile, the previously discussed length tokens are placed at the beginning of the masked input. This arrangement instructs the model on the anticipated sentence length throughout the generation process.

3.2 Finer-Level Compatibility

3.2.1 Overall Principle

Beyond the foundational requisites for compatibility, our focus shifts toward the nuanced interplay between lyrics and melody. Drawing inspiration from insights presented by Nichols \BOthers. (\APACyear2009), our methodological design embraces two guiding principles: (1) We aspire to synchronize important musical notes with important syllables. The prominence of notes arises from their metric position, duration, and melodic peak. Concurrently, syllabic importance stems from stress levels and the overall significance of the encompassing word. (2) We aim to align longer musical notes with longer vowels. These foundational principles have informed both our methodological approach and compatibility metrics.

3.2.2 Joint Learning of Output Formatting

While unsupervised learning proves potent for length-awareness, supervised learning is more apt for this segment. Given the intricate and fluid rules governing the link between lyrics and melody, which resist easy linguistic or statistical descriptions, we entrust the model with autonomous learning. For achieving finer compatibility, we opt for an end-to-end melody-to-lyric fine-tuning approach for the general-domain model, after length-aware unsupervised learning.

However, the constraints posed by data limitations and sparsity challenge the model’s ability to discern the lyric-melody relationship via supervised learning. As elaborated in Section 5.1, a supervised training phase confined to a language model training objective and paired data does not markedly amplify the compatibility between lyrics and melody in the outputs, compared with a solely unsupervised learning approach.

To foster finer compatibility, we consider embedding the inductive biases outlined earlier into our model. Recognizing the pivotal roles played by the placement of stressed syllables, vital words, and varied vowel lengths in music–lyric compatibility, our model is devised to predict the lyrical pattern necessitated by the input melody explicitly. During M2L training, it then crafts outputs based on these discerned patterns. Specifically, we incorporate three position-wise linear classifiers to predict the syllable stress s𝑠sitalic_s, word significance i𝑖iitalic_i, and vowel category v𝑣vitalic_v. Each classifier processes the hidden representation hhitalic_h of the note sequence from the BART encoder to project the requisite properties for each note, aligning with the syllable position in the output. Our hypothesis posits that by actively learning these formatting nuances, the model’s encoder will internalize formatting information inside hhitalic_h, facilitating the decoder in crafting text in the desired format.

\tbl

Labels for classification tasks. Classification Task Label Meaning Syllable stress 0 Unstressed syllables 1 Syllables with primary stress 2 Syllables with secondary stress Word importance level 0 Stop words 1 Non-stop words with lower 50% TF-IDF scores 2 Non-stop words with higher 50% TF-IDF scores Vowel type 0 Short vowels 1 Long vowels 2 Diphthongs

In practice, each classification task is a 3-class classification. The distinct attributes of each class are detailed in Table 3.2.2. Despite the absence of labels in the original paired dataset used in our experiment, these attributes can be obtained from the target-side textual data, employing pronunciation dictionaries combined with the TF-IDF algorithm.

These classifiers are simultaneously trained with the M2L task, with cross-entropy loss as the training objective. The resulting overall loss function of the model becomes:

L=CE(𝐲,𝐲^)+1|𝐲|j=1|𝐲|[CE(sj,s^j)+CE(ij,i^j)+CE(vj,v^j)],𝐿CE𝐲^𝐲1𝐲superscriptsubscript𝑗1𝐲delimited-[]CEsubscript𝑠𝑗subscript^𝑠𝑗CEsubscript𝑖𝑗subscript^𝑖𝑗CEsubscript𝑣𝑗subscript^𝑣𝑗L=\text{CE}(\mathbf{y},\hat{\mathbf{y}})+\frac{1}{|\mathbf{y}|}\sum_{j=1}^{|% \mathbf{y}|}[\text{CE}(s_{j},\hat{s}_{j})+\text{CE}(i_{j},\hat{i}_{j})+\text{% CE}(v_{j},\hat{v}_{j})],italic_L = CE ( bold_y , over^ start_ARG bold_y end_ARG ) + divide start_ARG 1 end_ARG start_ARG | bold_y | end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_y | end_POSTSUPERSCRIPT [ CE ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + CE ( italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over^ start_ARG italic_i end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + CE ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] , (1)

where 𝐲𝐲\mathbf{y}bold_y represents lyrics; sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, ijsubscript𝑖𝑗i_{j}italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represent syllable stress, word importance, and vowel length, respectively, of the j𝑗jitalic_j-th syllable position; letters with and without hats represent predictions and ground truth, respectively; and CE refers to cross-entropy loss.

3.2.3 Handling Music Input

Despite the valuable information within our M2L training dataset, it lacks essential attributes such as tempo and quantized note durations. Instead, notes are represented as sequences of three attributes: onset, offset, and pitch, with each value measured in seconds. This representation results in a sparse distribution of time-related attributes, potentially compromising the model’s learning efficacy. To address this, quantizing the time-related inputs and normalizing them with the tempo is crucial.

To handle melody inputs, we convert the detailed melody information into a sequence of note embeddings. First, we quantize pitches to their nearest MIDI note number, ranging from 1111 to 128128128128. For the time-related attributes of notes, we implement the following quantization procedure. Within each paragraph, we select the duration of the shortest note as the reference unit duration, denoted as durusubscriptdur𝑢\text{dur}_{u}dur start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Subsequently, all time-related variables are divided by durusubscriptdur𝑢\text{dur}_{u}dur start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, multiplied by 10, and rounded down. Furthermore, each onset time is adjusted by subtracting the time of the first onset within the paragraph. Beyond note duration, we introduce an additional property, restjsubscriptrest𝑗\text{rest}_{j}rest start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, to represent the duration of the pause preceding the j𝑗jitalic_j-th note.

duru(s)subscriptsuperscriptdur𝑠𝑢\displaystyle\text{dur}^{(s)}_{u}dur start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT =min1j|y|{offj(s)onj(s)}absentsubscript1𝑗𝑦subscriptsuperscriptoff𝑠𝑗subscriptsuperscripton𝑠𝑗\displaystyle=\min_{1\leq j\leq|y|}\left\{\text{off}^{(s)}_{j}-\text{on}^{(s)}% _{j}\right\}= roman_min start_POSTSUBSCRIPT 1 ≤ italic_j ≤ | italic_y | end_POSTSUBSCRIPT { off start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - on start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } (2)
onj(q)subscriptsuperscripton𝑞𝑗\displaystyle\text{on}^{(q)}_{j}on start_POSTSUPERSCRIPT ( italic_q ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =min(10onj(s)on1(s)duru(s),vocabon),j[1,|y|],formulae-sequenceabsent10subscriptsuperscripton𝑠𝑗subscriptsuperscripton𝑠1subscriptsuperscriptdur𝑠𝑢subscriptvocabonfor-all𝑗1𝑦\displaystyle=\min\left(\left\lfloor 10\cdot\frac{\text{on}^{(s)}_{j}-\text{on% }^{(s)}_{1}}{\text{dur}^{(s)}_{u}}\right\rfloor,\text{vocab}_{\text{on}}\right% ),\quad\forall j\in[1,|y|],= roman_min ( ⌊ 10 ⋅ divide start_ARG on start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - on start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG dur start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG ⌋ , vocab start_POSTSUBSCRIPT on end_POSTSUBSCRIPT ) , ∀ italic_j ∈ [ 1 , | italic_y | ] , (3)
durj(q)subscriptsuperscriptdur𝑞𝑗\displaystyle\text{dur}^{(q)}_{j}dur start_POSTSUPERSCRIPT ( italic_q ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =min(10offj(s)onj(s)duru(s),vocabdur),j[1,|y|],formulae-sequenceabsent10subscriptsuperscriptoff𝑠𝑗subscriptsuperscripton𝑠𝑗subscriptsuperscriptdur𝑠𝑢subscriptvocabdurfor-all𝑗1𝑦\displaystyle=\min\left(\left\lfloor 10\cdot\frac{\text{off}^{(s)}_{j}-\text{% on}^{(s)}_{j}}{\text{dur}^{(s)}_{u}}\right\rfloor,\text{vocab}_{\text{dur}}% \right),\quad\forall j\in[1,|y|],= roman_min ( ⌊ 10 ⋅ divide start_ARG off start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - on start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG dur start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG ⌋ , vocab start_POSTSUBSCRIPT dur end_POSTSUBSCRIPT ) , ∀ italic_j ∈ [ 1 , | italic_y | ] , (4)
restj(q)subscriptsuperscriptrest𝑞𝑗\displaystyle\text{rest}^{(q)}_{j}rest start_POSTSUPERSCRIPT ( italic_q ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ={inf,j=1min(10onj(s)offj1(s)duru(s),vocabrest),2j|y|absentcasesinfimum𝑗110subscriptsuperscripton𝑠𝑗subscriptsuperscriptoff𝑠𝑗1subscriptsuperscriptdur𝑠𝑢subscriptvocabrest2𝑗𝑦\displaystyle=\begin{cases}\inf,&j=1\\ \min\left(\left\lfloor 10\cdot\frac{\text{on}^{(s)}_{j}-\text{off}^{(s)}_{j-1}% }{\text{dur}^{(s)}_{u}}\right\rfloor,\text{vocab}_{\text{rest}}\right),&2\leq j% \leq|y|\end{cases}= { start_ROW start_CELL roman_inf , end_CELL start_CELL italic_j = 1 end_CELL end_ROW start_ROW start_CELL roman_min ( ⌊ 10 ⋅ divide start_ARG on start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - off start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT end_ARG start_ARG dur start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG ⌋ , vocab start_POSTSUBSCRIPT rest end_POSTSUBSCRIPT ) , end_CELL start_CELL 2 ≤ italic_j ≤ | italic_y | end_CELL end_ROW (5)

where onjsubscripton𝑗\text{on}_{j}on start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, offjsubscriptoff𝑗\text{off}_{j}off start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and durjsubscriptdur𝑗\text{dur}_{j}dur start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT refer to the onset, offset and duration of j𝑗jitalic_j-th note in the melody; symbols with a ‘(s)𝑠(s)( italic_s )’ superscript are in seconds, while those with a ‘(q)𝑞(q)( italic_q )’ superscript represent quantized variables used as model inputs.

We allocate three distinct token sets representing different onsets, durations, and rest intervals. Their sizes are dictated by vocabonsubscriptvocabon\text{vocab}_{\text{on}}vocab start_POSTSUBSCRIPT on end_POSTSUBSCRIPT, vocabdursubscriptvocabdur\text{vocab}_{\text{dur}}vocab start_POSTSUBSCRIPT dur end_POSTSUBSCRIPT, and vocabrestsubscriptvocabrest\text{vocab}_{\text{rest}}vocab start_POSTSUBSCRIPT rest end_POSTSUBSCRIPT, which in our experiments, were determined to be 640, 640, and 240, respectively. Separate embedding layers transform these note element tokens into distinct embedding vectors. These vectors are subsequently summed into a single note embedding sequence. Consistent with the BART model’s hidden dimensions, embeddings of each individual note element and compound note representation possess identical dimensions as BART.

3.3 System Overview

Refer to caption
Figure 3: Caption: Illustration of the supervised M2L training with paired melody–lyric data. The direct-sum\oplus refers to addition; the tensor-product\otimes refers to concatenation; hhitalic_h: hidden representation of the input sequence.
                                                                                                                                    Figure 3. Alt Text: Quantized note elements are transformed into vector representations and summed together to create a single sequence containing all note information. This sequence is concatenated with a sequence of encoded length prompts, constituting our model’s input sequence. Our model is designed to produce lyrics that meet these specific input criteria. Meanwhile, three classification heads receive the encoder’s output to syllable stress, word importance, and vowel length for each syllable position.

Our model’s architecture is illustrated in Fig. 3. The design is rooted in the Transformer encoder-decoder paradigm. Input originates from two sources: length prompts, which signify the syllable count in each paragraph line, and note details, encompassing attributes such as onset, duration, pitch, and rest intervals preceding notes. Each note attribute is fed to a distinct embedding layer, and the resulting embeddings are then added together to form the note embedding vector. Finally, we concatenate the length prompt and note embedding sequences together to form the final input sequence for our model.

Taken together, these innovations form our final melody-to-lyric generation model, which generates the singable lyrics in the 4th line in Fig. 1. More case studies are featured in §5.3.

4 Experiments

4.1 Dataset

Previous studies frequently relied on the dataset from Yu \BOthers. (\APACyear2021), tailored for the L2M task. Upon closer inspection of this dataset, we identified several issues. There were many instances of misaligned lyrics and melodies, as well as word omissions in the lyrics. Given these discrepancies, the suitability of this dataset for producing high-quality, aligned lyrics output becomes sub-optimal.

\tbl

Dataset size of different splits. Train Validation Test Total Text-only #paragraphs 519,616 1,993 1,996 523,605 #lines 7,046,894 26,684 27,109 7,100,687 Paired #paragraphs 47,222 1,989 1,989 51,200 #lines 279,682 11,667 11,430 302,779

Opting for higher data quality and alignment, we turned to DALI v2 (Meseguer-Brocal \BOthers., \APACyear2020). This paired dataset offers higher-quality lyric text and alignment. For the text-only lyric corpus, we sourced a genre classification dataset available on Kaggle.333\urlhttps://www.kaggle.com/datasets/mateibejan/multilingual-lyrics-for-genre-classification It is worth noting that this text-only dataset substantially larger than the paired melody–lyric dataset.

We performed text normalization on the data, including removing non-English pieces, converting all letters to lowercase, and removing special symbols and blank lines. Then, deduplication was conducted, removing repeated paragraphs (e.g., repeated chorus) from both text-only and paired datasets. We split both datasets into training/validation/testing sets, ensuring no paragraph overlap across the splits. The dataset statistics of each split are presented in Table 4.1. For additional data processing, the CMU Pronunciation Dictionary444\urlhttp://www.speech.cs.cmu.edu/cgi-bin/cmudict and the NLTK (Loper \BBA Bird, \APACyear2002) were employed to derive the classification labels, as highlighted in Section 3.2.2.

4.2 Model Details

We chose BART-base555\urlhttps://huggingface.co/facebook/bart-base (Lewis \BOthers., \APACyear2020) as the foundation model, which has been pre-trained on general-domain texts. Unsupervised length-aware training and supervised M2L training are conducted sequentially.

During our experiments, we set the batch size to the largest possible value that could fit into an NVIDIA A5000 GPU (24G), which was 48 for both text-only and paired data training. We conducted a grid search to determine the ideal learning rate, which yielded 2e-4 for text-only training and 1e-4 for paired data training. The learning rate was scheduled with linear decay in text-only training. AdamW (Loshchilov \BBA Hutter, \APACyear2019) was used as the optimizer. Warm-up steps were designated 2500 for text-only training and 200 for paired data training. We trained the model on text-only and paired corpora for 15 and 10 epochs, respectively, and selected the best checkpoint based on validation loss. Dropout and label smoothing were not used during training.

4.3 Evaluation

4.3.1 Comparative Models

We evaluate our model using both objective and subjective evaluation, and the original lyrics serve as the gold standard in both assessments.

For the objective evaluation, we assess the quality of the text and its compatibility with music, covering both coarse and fine levels. Our model is compared against several of its ablated versions, which are:

  1. 1.

    Baseline: A BART-base model finetuned solely with paired M2L data. No length prompt.

  2. 2.

    Adapted baseline: The baseline with length-unaware in-domain pretraining.

  3. 3.

    Ours: The proposed model containing length-aware unsupervised training, supervised training, and additional classification objectives.

  4. 4.

    Single-task: A model with length-aware unsupervised and supervised training.

  5. 5.

    Unsupervised-only: A Model trained to generate lyrics from only length prompt.

  6. 6.

    Supervised-only: The baseline with additional length prompt in M2L training.

These ablated variants are the proposed model absent of one or more components yet trained with the same dataset(s). This approach ensures a fair comparison that aid in understanding how our proposed training procedures and objectives contribute to performance improvement.

For the comparison with leading lyric generation systems—SongMASS (Sheng \BOthers., \APACyear2021) and AI-lyricist (Ma \BOthers., \APACyear2021)—we believe that a human evaluation would provide a more convincing assessment of comprehensive quality. Moreover, our preliminary observations have indicated sub-optimal performance by SongMASS and AI-lyricist with regards to compatibility scores. Consequently, these systems were excluded from the objective evaluation that focuses on music–-lyric compatibility

4.3.2 Objective Metrics

For assessing the text quality of generated lyrics, we calculate the perplexity of each model on the test set. Given that we do not impose content constraints on the generated lyrics, discrepancies between outputs and original lyrics can be expected. Consequently, the BLEU score is not adopted as a metric in our problem setting.

Syllable Alignment

An integral aspect of our assessment is determining the syllable alignment between lyrics and melodies. We quantify this alignment by:

  1. 1.

    #Line: The paragraph-level accuracy of the line count, i.e., the ratio of output paragraphs that meet the desired line count requirement. This metric gauges the adherence to the required number of sentences within paragraphs.

  2. 2.

    Line len: The average accuracy of the syllable count per line. This metric measures the effectiveness of sentence length control.

Fine-Grained Compatibility

We utilize sentence-level metrics to evaluate music–lyric compatibility at a granular level. Central to these metrics is the concept of ‘coexistence probability’, which we define as follows.

Let A𝐴Aitalic_A and B𝐵Bitalic_B be a note property and a syllable property, respectively. We define the coexistence probability of note property A𝐴Aitalic_A and syllable property B𝐵Bitalic_B as:

Pr(A-B)=1|X|{(x,y)||x|=|y|}Pr(A-B;x,y),Pr𝐴-𝐵1𝑋subscriptconditional-set𝑥𝑦𝑥𝑦Pr𝐴-𝐵𝑥𝑦\displaystyle\Pr(A\text{-}B)=\frac{1}{|X|}\sum_{\{(x,y)||x|=|y|\}}\Pr(A\text{-% }B;x,y),roman_Pr ( italic_A - italic_B ) = divide start_ARG 1 end_ARG start_ARG | italic_X | end_ARG ∑ start_POSTSUBSCRIPT { ( italic_x , italic_y ) | | italic_x | = | italic_y | } end_POSTSUBSCRIPT roman_Pr ( italic_A - italic_B ; italic_x , italic_y ) , (6)
Pr(A-B;x,y)=JointCount(A,B;x,y)Count(A;x),Pr𝐴-𝐵𝑥𝑦JointCount𝐴𝐵𝑥𝑦Count𝐴𝑥\displaystyle\Pr(A\text{-}B;x,y)=\frac{\text{JointCount}(A,B;x,y)}{\text{Count% }(A;x)},roman_Pr ( italic_A - italic_B ; italic_x , italic_y ) = divide start_ARG JointCount ( italic_A , italic_B ; italic_x , italic_y ) end_ARG start_ARG Count ( italic_A ; italic_x ) end_ARG , (7)

where:

  • (x𝑥xitalic_x, y𝑦yitalic_y) is a paired input note sequence and output text sequence.

  • Count(A;x)Count𝐴𝑥\text{Count}(A;x)Count ( italic_A ; italic_x ) counts occurrences of note xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with property A𝐴Aitalic_A.

  • JointCount(A,B;x,y)JointCount𝐴𝐵𝑥𝑦\text{JointCount}(A,B;x,y)JointCount ( italic_A , italic_B ; italic_x , italic_y ) counts occurrences of aligned pairs of musical note xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and lyric syllable yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has property A𝐴Aitalic_A and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has property B𝐵Bitalic_B.

  • |X|𝑋|X|| italic_X | is the size of the testing set.

We apply the Equation 6 to the following five music–lyric property pairs (A,B𝐴𝐵A,Bitalic_A , italic_B), as our fine-level music–lyric compatibility measures:

  1. 1.

    Dur-str: Long notes (notes with top 50% duration), stressed syllable (either primary stress or secondary stress)

  2. 2.

    Peak-str: Melody peak (notes with higher pitches than both preceding and following notes), stressed syllable

  3. 3.

    Dur-imp: Long note, important words (non-stop words)

  4. 4.

    Peak-imp: Melody peak, important words

  5. 5.

    Dur-vow: Long note, long vowels (either long vowels or diphthongs)

Because previous works (Nichols \BOthers., \APACyear2009) have shown the close relationship between those property pairs, we assume that higher scores on those metrics indicate higher chances of note-syllable property coexistence and, hence, a higher degree of music–lyric compatibility.

4.3.3 Subjective Evaluation

Objective metrics may not capture the entirety of lyric quality. For instance, enhanced compatibility could occasionally result in a compromise on textual fluidity. However, neither objective metrics nor their collective average can accurately convey the model’s balance between these factors and the subsequent overall quality. This underscores the necessity for human evaluation.

We recruited ten university students proficient in amateur-level music performance or lyric composition. Each participant assigned a 5-point score to each lyric version, and the mean score of the same paragraph from different participants represented the final rating for that lyric paragraph. The assessment encompassed three facets:

  1. 1.

    Fluency: Evaluating grammatical correctness and semantic coherence.

  2. 2.

    Music–Lyric Compatibility: Gauging the unity between the lyric and the melody, and the consequent ease of singing.

  3. 3.

    Overall Quality: A holistic assessment factoring in both text quality and its harmony with the music, encapsulating the ultimate aspiration of singable lyric generation.

Evaluations were performed at the paragraph level. From the Yu \BOthers. (\APACyear2021) dataset, we randomly selected ten paragraphs from six songs for the assessment. This exercise posed an out-of-domain test for our model. Each evaluator was presented with seven versions (one original and six generated) of each paragraph for rating. Variations of the same paragraph, including multiple generated versions and the original lyrics, were displayed concurrently for comparison. Reviewers were not informed of the origins of each version. For a comprehensive evaluation, we provided participants with the music sheet containing melody lyrics with melody, together with the corresponding singing audio. This audio incorporated a singing voice synthesized to match the original melody and the generated lyrics, mixed with the original musical accompaniments. The vocal component was synthesized using ACE Studio,666\urlhttps://ace-studio.huoyaojing.com/ while the background tracks were extracted via the source separation model, Demucs v3 mdx_extra (Défossez, \APACyear2021).

5 Results

5.1 Main Results

\tbl

The main result of objective evaluation. The best results are bolded. No. Model PPL #Line Line len Dur-str Peak-str Dur-imp Peak-imp Dur-vow - Original lyrics - 100.00 100.00 82.45 66.45 72.29 52.73 59.99 1 Baseline 10.84 95.40 75.67 64.61 47.96 45.94 30.30 50.10 2 Adapted baseline 10.98 14.28 13.15 10.44 8.88 8.26 5.08 7.20 3 Ours 7.93 99.15 97.11 77.78 61.36 54.33 40.04 55.96 4 Single-task 7.96 99.65 97.42 75.15 59.91 54.44 41.49 53.75 5 Unsupervised-only 7.99 99.35 97.15 75.25 58.99 54.78 41.18 54.06 6 Supervised-only 11.62 95.10 70.67 57.86 42.33 38.97 26.74 43.28

Table 5.1 shows the performance comparison on objective metrics. Upon analysis, several key observations supporting our claims emerge:

Adding constraints does not negatively impact text quality. We first compare the perplexity, which reflects the text fluency of these models. Our model (No. 3) achieves the lowest perplexity among all ablation variants. Contrary to the results from Sheng \BOthers. (\APACyear2021), adding length constraints to the model improves the perplexity performance. This enhancement is likely attributed to these constraints acting as supplementary guidance during training, enabling the model to learn the lyric language style more effectively.

Length prompt and unsupervised training are crucial for gaining length awareness. Evaluating length control performance, it is evident that the baseline model (No. 1) struggles to manage the length of individual lines. Although note input might offer a reasonable hint regarding the number of lines, it does not sufficiently ensure the desired singability. Incorporating length-aware unsupervised training and maintaining length prompts during supervised training (Models No. 3, 4, 5) notably outperform other variants in governing both the number of sentences and individual sentence lengths. Conversely, neglecting length prompts during in-domain unsupervised training (No. 6) even underperforms the baseline model (No. 1).

Supervised training does not benefit performance, unless the proposed objectives are added. When appending a supervised training phase after unsupervised training, but with only language model objective (No. 4), there is only very marginal performance improvement on perplexity and length control over the model with only unsupervised training (No. 5). For nuanced compatibility metrics, the two models exhibit similar performance. However, the inclusion of our novel training objectives during M2L training manifests a noticeable improvement in several fine-grained compatibility measures: +2.63% on dur-str, +1.45% on peak-str, +2.21% on dur-vow, and a comparable performance (-0.11%) on dur-imp. This signals an augmented potential of our model to ensure music–lyric compatibility in its generated output.

Peak-imp may not be a reliable compatibility metric in our corpus. In the peak-imp metric, our model scores lower than the single-task model (No. 4). However, this metric seems not to be a good indicator of music–lyric compatibility in our corpus. Even the gold standard original lyrics achieve a modest 52.73% score on this metric—nearly half of the important words lie on non-peak notes—suggesting that the relationship between melody peaks and non-stopwords is insignificant in this dataset. Therefore, it is acceptable for our model not to excel in this metric. Overall, while there exists potential for further optimization, our model demonstrates the best performance on those property pairs with tighter relationships among all models.

5.2 Human Evaluation

\tbl

Subjective evaluation results. The best results are bolded. No. Model Fluency Compatibility Overall Quality - Original lyrics 4.02 4.24 4.16 1 Baseline 2.96 2.23 2.37 2 AI-Lyricist 2.24 2.41 2.23 3 SongMASS 1.70 1.94 1.82 4 Ours 3.43 3.18 3.17 5 Single-task 3.31 3.11 3.07 6 Unsupervised-only 3.42 3.07 3.05

Table 5.2 highlights that our model (No. 4) excels in all facets of the subjective evaluation, including fluency, music–lyric compatibility, or overall quality, even when tested outside its domain. In contrast, despite using the same training corpus as that of the test data, SongMASS (Sheng \BOthers., \APACyear2021) (No. 3) notably underperforms across all metrics. While AI-Lyricist (Ma \BOthers., \APACyear2021) (No. 2) fares better than SongMASS, it does not match the proficiency of our baseline model.

Upon comparing our model with its ablation variants, it is evident that excluding the additional training objectives (No. 5) diminishes fluency and compatibility, thereby reducing overall quality. This illustrates the efficacy of these objectives in maximizing the utility of the limited paired data and enhancing the effectiveness of supervised M2L training. Interestingly, when we remove note information from the input and restrict the model to generating lyrics based solely on syllable and line counts (unsupervised-only model, No. 6), it achieves superior fluency compared to the model with both unsupervised and supervised training (No. 5). There is only a slight reduction in compatibility and overall quality scores, further highlighting the importance of length awareness through unsupervised training in crafting singable lyrics for songs.

5.3 Case Study

Figure 4 presents the lyrics generated by various models for a melody included in the subjective evaluation.

SongMASS’s output exhibits substantial issues in quality: its generated lyrics are incoherent and nonsensical. AI-Lyricist produces complete sentences but lacks semantic coherence, as each sentence is generated in isolation. In contrast, our LOAF-M2L model outputs lyrics that are both grammatically correct and semantically coherent across sentences.

In terms of music–lyric compatibility, our model perfectly aligns each syllable with a corresponding musical note, thereby establishing a solid foundation for compatibility. In contrast, SongMASS and AI-Lyricist show multiple misalignments between syllables and notes, as indicated by the red boxes in Figure 4.

Our LOAF-M2L model further demonstrates fine-grained compatibility. For example (marked with green boxes), the first syllable of the word ‘hero’, typically stressed in speech, aligns with a melody peak for emphasis. Likewise, the less important word ‘the’ in the third sentence is matched with a short musical note. SongMASS and AI-Lyricist, however, perform unsatisfactorily in these respects (marked with orange boxes). In SongMASS’s generated lyrics, the second syllable of ‘gotta’ aligns with a longer note than its first syllable, resulting in unnatural pronunciation when sung. Similarly, the AI-Lyricist pairs the less emphasized word ‘of’ with a quarter note, leading to awkward vocal delivery.

Refer to caption
Figure 4: Caption: A case of melody with original lyrics, and generated lyrics from different models.
                                                                                                                                    Figure 4. Alt Text: The top row displays the original lyrics, while rows 2 and 3 contain versions by SongMASS and AI-Lyricist, exhibiting issues with length adherence and nuanced compatibility that compromise singability. In contrast, the final row highlights our system’s output, which successfully circumvents these issues, offering lyrics with enhanced singability.

6 Conclusion

In this study, we introduced an innovative approach to address limited compatibility between machine-generated lyrics and the corresponding melodies. Our method enhanced music–lyric compatibility in the generation outputs by implementing prompt-based length control and designing a unique objective rooted in a quantitative analysis of melody–lyric relationships. Both objective and subjective assessments validated the increased compatibility and singability of the generated lyrics. This research holds potential for various music-centric applications, including aiding songwriters, advancing music education, and enabling personalized song creation.

However, we recognize that melody-to-lyric generation remains a challenging domain. While our system advances the field, it does not produce flawless results. For instance, as illustrated in the last row of Figure 1, the melody’s emphasis on the second syllable of the word ‘baby’ can make the pronunciation incongruous. Additionally, this paper did not address some challenges, including length control’s inflexibility. While influenced by melody, the human composition of lyrics is not strictly bound by it. Lyrics might deviate slightly in syllable count from the number of notes to achieve optimal fluency or meaning conveyance. While introducing some flexibility into length control via prompt-based measures is feasible, achieving a granular melody–lyric alignment and maintaining this compatibility during generation becomes less straightforward. We leave exploring these challenges and potential solutions for future research.

Acknowledgements

This project was funded by research grant A-0008150-00-00 from the Ministry of Education, Singapore.

References

  • Akiyama \BOthers. (\APACyear2021) \APACinsertmetastarakiyama-etal-2021-hie{APACrefauthors}Akiyama, K., Tamura, A.\BCBL \BBA Ninomiya, T.  \APACrefYearMonthDay2021\APACmonth06. \BBOQ\APACrefatitleHie-BART: Document Summarization with Hierarchical BART Hie-BART: Document summarization with hierarchical BART.\BBCQ \BIn \APACrefbtitleProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Student research workshop (\BPGS 159–165). \APACaddressPublisherOnlineAssociation for Computational Linguistics. {APACrefURL} \urlhttps://aclanthology.org/2021.naacl-srw.20 {APACrefDOI} 10.18653/v1/2021.naacl-srw.20 \PrintBackRefs\CurrentBib
  • Ammirante \BBA Rovetti (\APACyear2021) \APACinsertmetastardoi:10.1080/09298215.2021.1936076{APACrefauthors}Ammirante, P.\BCBT \BBA Rovetti, J.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleBright vowels are favoured on weak beats in popular music lyrics Bright vowels are favoured on weak beats in popular music lyrics.\BBCQ \APACjournalVolNumPagesJournal of New Music Research503259-265. {APACrefURL} \urlhttps://doi.org/10.1080/09298215.2021.1936076 {APACrefDOI} 10.1080/09298215.2021.1936076 \PrintBackRefs\CurrentBib
  • Barbieri \BOthers. (\APACyear2012) \APACinsertmetastarbarbieri2012markov{APACrefauthors}Barbieri, G., Pachet, F., Roy, P.\BCBL \BBA Degli Esposti, M.  \APACrefYearMonthDay2012. \BBOQ\APACrefatitleMarkov Constraints for Generating Lyrics with Style. Markov constraints for generating lyrics with style.\BBCQ \BIn \APACrefbtitleEcai Ecai (\BVOL 242, \BPGS 115–120). \PrintBackRefs\CurrentBib
  • Carnegie Mellon University (\APACyear2022) \APACinsertmetastarcmudict{APACrefauthors}Carnegie Mellon University.  \APACrefYearMonthDay2022. \APACrefbtitleThe CMU Pronouncing Dictionary. The CMU pronouncing dictionary. \APAChowpublishedAvailable online at \urlhttp://www.speech.cs.cmu.edu/cgi-bin/cmudict. \PrintBackRefs\CurrentBib
  • Chang \BOthers. (\APACyear2021) \APACinsertmetastarchang2021singability{APACrefauthors}Chang, J\BHBIW., Hung, J\BPBIC.\BCBL \BBA Lin, K\BHBIC.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleSingability-enhanced lyric generator with music style transfer Singability-enhanced lyric generator with music style transfer.\BBCQ \APACjournalVolNumPagesComputer Communications16833–53. \PrintBackRefs\CurrentBib
  • Chen \BBA Lerch (\APACyear2020) \APACinsertmetastarchen2020melody{APACrefauthors}Chen, Y.\BCBT \BBA Lerch, A.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleMelody-conditioned lyrics generation with seqGANs Melody-conditioned lyrics generation with seqGANs.\BBCQ \BIn \APACrefbtitle2020 IEEE International Symposium on Multimedia (ISM) 2020 ieee international symposium on multimedia (ism) (\BPGS 189–196). \PrintBackRefs\CurrentBib
  • Chousa \BBA Morishita (\APACyear2021) \APACinsertmetastarchousa-morishita-2021-input{APACrefauthors}Chousa, K.\BCBT \BBA Morishita, M.  \APACrefYearMonthDay2021\APACmonth08. \BBOQ\APACrefatitleInput Augmentation Improves Constrained Beam Search for Neural Machine Translation: NTT at WAT 2021 Input augmentation improves constrained beam search for neural machine translation: NTT at WAT 2021.\BBCQ \BIn \APACrefbtitleProceedings of the 8th Workshop on Asian Translation (WAT2021) Proceedings of the 8th workshop on asian translation (wat2021) (\BPGS 53–61). \APACaddressPublisherOnlineAssociation for Computational Linguistics. {APACrefURL} \urlhttps://aclanthology.org/2021.wat-1.3 {APACrefDOI} 10.18653/v1/2021.wat-1.3 \PrintBackRefs\CurrentBib
  • Défossez (\APACyear2021) \APACinsertmetastardefossez2019demucs{APACrefauthors}Défossez, A.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleHybrid Spectrogram and Waveform Source Separation Hybrid spectrogram and waveform source separation.\BBCQ \BIn \APACrefbtitleProceedings of the ISMIR 2021 Workshop on Music Source Separation. Proceedings of the ismir 2021 workshop on music source separation. \PrintBackRefs\CurrentBib
  • Ghazvininejad \BOthers. (\APACyear2018) \APACinsertmetastarghazvininejad2018neural{APACrefauthors}Ghazvininejad, M., Choi, Y.\BCBL \BBA Knight, K.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleNeural poetry translation Neural poetry translation.\BBCQ \BIn \APACrefbtitleProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 2 (short papers) (\BPGS 67–71). \PrintBackRefs\CurrentBib
  • Grangier \BBA Auli (\APACyear2018) \APACinsertmetastargrangier2018quickedit{APACrefauthors}Grangier, D.\BCBT \BBA Auli, M.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleQuickEdit: Editing Text & Translations by Crossing Words Out Quickedit: Editing text & translations by crossing words out.\BBCQ \BIn \APACrefbtitleProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (long papers) (\BPGS 272–282). \PrintBackRefs\CurrentBib
  • Gu \BOthers. (\APACyear2022) \APACinsertmetastargu2022mm{APACrefauthors}Gu, X., Ou, L., Ong, D.\BCBL \BBA Wang, Y.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleMM-ALT: A multimodal automatic lyric transcription system Mm-alt: A multimodal automatic lyric transcription system.\BBCQ \BIn \APACrefbtitleProceedings of the 30th ACM International Conference on Multimedia Proceedings of the 30th acm international conference on multimedia (\BPGS 3328–3337). \PrintBackRefs\CurrentBib
  • Guo \BOthers. (\APACyear2022\APACexlab\BCnt1) \APACinsertmetastarguo2022automatic{APACrefauthors}Guo, F., Zhang, C., Zhang, Z., He, Q., Zhang, K., Xie, J.\BCBL \BBA Boyd-Graber, J.  \APACrefYearMonthDay2022\BCnt1. \BBOQ\APACrefatitleAutomatic Song Translation for Tonal Languages Automatic song translation for tonal languages.\BBCQ \BIn \APACrefbtitleFindings of the Association for Computational Linguistics: ACL 2022 Findings of the association for computational linguistics: Acl 2022 (\BPGS 729–743). \PrintBackRefs\CurrentBib
  • Guo \BOthers. (\APACyear2022\APACexlab\BCnt2) \APACinsertmetastarguo-etal-2022-automatic{APACrefauthors}Guo, F., Zhang, C., Zhang, Z., He, Q., Zhang, K., Xie, J.\BCBL \BBA Boyd-Graber, J.  \APACrefYearMonthDay2022\BCnt2\APACmonth05. \BBOQ\APACrefatitleAutomatic Song Translation for Tonal Languages Automatic song translation for tonal languages.\BBCQ \BIn \APACrefbtitleFindings of the Association for Computational Linguistics: ACL 2022 Findings of the association for computational linguistics: Acl 2022 (\BPGS 729–743). \APACaddressPublisherDublin, IrelandAssociation for Computational Linguistics. {APACrefURL} \urlhttps://aclanthology.org/2022.findings-acl.60 {APACrefDOI} 10.18653/v1/2022.findings-acl.60 \PrintBackRefs\CurrentBib
  • Huang \BBA You (\APACyear2021) \APACinsertmetastarhuang2021automated{APACrefauthors}Huang, Y\BHBIF.\BCBT \BBA You, K\BHBIC.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleAutomated Generation of Chinese Lyrics Based on Melody Emotions Automated generation of Chinese lyrics based on melody emotions.\BBCQ \APACjournalVolNumPagesIEEE Access998060–98071. \PrintBackRefs\CurrentBib
  • Lakew \BOthers. (\APACyear2019) \APACinsertmetastarlakew2019controlling{APACrefauthors}Lakew, S\BPBIM., Di Gangi, M.\BCBL \BBA Federico, M.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleControlling the output length of neural machine translation Controlling the output length of neural machine translation.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1910.10408. \PrintBackRefs\CurrentBib
  • Lee \BOthers. (\APACyear2019) \APACinsertmetastarlee2019icomposer{APACrefauthors}Lee, H\BHBIP., Fang, J\BHBIS.\BCBL \BBA Ma, W\BHBIY.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleiComposer: An automatic songwriting system for Chinese popular music iComposer: An automatic songwriting system for Chinese popular music.\BBCQ \BIn \APACrefbtitleProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics (demonstrations) (\BPGS 84–88). \PrintBackRefs\CurrentBib
  • Lewis \BOthers. (\APACyear2020) \APACinsertmetastarlewis2020bart{APACrefauthors}Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O.\BDBLZettlemoyer, L.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleBART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.\BBCQ \BIn \APACrefbtitleProceedings of the 58th Annual Meeting of the Association for Computational Linguistics Proceedings of the 58th annual meeting of the association for computational linguistics (\BPGS 7871–7880). \PrintBackRefs\CurrentBib
  • J. Li \BOthers. (\APACyear2022) \APACinsertmetastarli2022fuzzy{APACrefauthors}Li, J., Wang, P., Li, Z., Liu, X., Utiyama, M., Sumita, E.\BDBLAi, H.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleA Fuzzy Training Framework for Controllable Sequence-to-Sequence Generation A fuzzy training framework for controllable sequence-to-sequence generation.\BBCQ \APACjournalVolNumPagesIEEE Access1092467–92480. \PrintBackRefs\CurrentBib
  • P. Li \BOthers. (\APACyear2020) \APACinsertmetastarli2020rigid{APACrefauthors}Li, P., Zhang, H., Liu, X.\BCBL \BBA Shi, S.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleRigid formats controlled text generation Rigid formats controlled text generation.\BBCQ \BIn \APACrefbtitleProceedings of the 58th annual meeting of the association for computational linguistics Proceedings of the 58th annual meeting of the association for computational linguistics (\BPGS 742–751). \PrintBackRefs\CurrentBib
  • Y. Li \BOthers. (\APACyear2022) \APACinsertmetastarli2022prompt{APACrefauthors}Li, Y., Yin, Y., Li, J.\BCBL \BBA Zhang, Y.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitlePrompt-Driven Neural Machine Translation Prompt-driven neural machine translation.\BBCQ \BIn \APACrefbtitleFindings of the Association for Computational Linguistics: ACL 2022 Findings of the association for computational linguistics: Acl 2022 (\BPGS 2579–2590). \PrintBackRefs\CurrentBib
  • Lingan (\APACyear2021) \APACinsertmetastarLingan2021AMD{APACrefauthors}Lingan, G.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleA MODEL DESIGNED FOR AUTOMATIC GENERATED RAP LYRICS IN GIVEN GENDER AND STYLE A model designed for automatic generated rap lyrics in given gender and style.\BBCQ \BIn \APACrefbtitleISMIR 2021-Proceedings of the 23th International Society for Music Information Retrieval Conference, Late Breaking Demo. Ismir 2021-proceedings of the 23th international society for music information retrieval conference, late breaking demo. \PrintBackRefs\CurrentBib
  • N. Liu \BOthers. (\APACyear2022) \APACinsertmetastarliu2022chipsong{APACrefauthors}Liu, N., Han, W., Liu, G., Peng, D., Zhang, R., Wang, X.\BCBL \BBA Ruan, H.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleChipSong: A Controllable Lyric Generation System for Chinese Popular Song ChipSong: A controllable lyric generation system for Chinese popular song.\BBCQ \BIn \APACrefbtitleProceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022) Proceedings of the first workshop on intelligent and interactive writing assistants (in2writing 2022) (\BPGS 85–95). \PrintBackRefs\CurrentBib
  • P. Liu \BOthers. (\APACyear2023) \APACinsertmetastarliu2023pre{APACrefauthors}Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H.\BCBL \BBA Neubig, G.  \APACrefYearMonthDay2023. \BBOQ\APACrefatitlePre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.\BBCQ \APACjournalVolNumPagesACM Computing Surveys5591–35. \PrintBackRefs\CurrentBib
  • Y. Liu \BOthers. (\APACyear2021) \APACinsertmetastarliu2021refsum{APACrefauthors}Liu, Y., Dou, Z\BHBIY.\BCBL \BBA Liu, P.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleRefSum: Refactoring Neural Summarization Refsum: Refactoring neural summarization.\BBCQ \BIn \APACrefbtitleProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (\BPGS 1437–1448). \PrintBackRefs\CurrentBib
  • Y. Liu \BOthers. (\APACyear2020) \APACinsertmetastarliu2020multilingual{APACrefauthors}Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M.\BDBLZettlemoyer, L.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleMultilingual denoising pre-training for neural machine translation Multilingual denoising pre-training for neural machine translation.\BBCQ \APACjournalVolNumPagesTransactions of the Association for Computational Linguistics8726–742. \PrintBackRefs\CurrentBib
  • Loper \BBA Bird (\APACyear2002) \APACinsertmetastarloper2002nltk{APACrefauthors}Loper, E.\BCBT \BBA Bird, S.  \APACrefYearMonthDay2002. \BBOQ\APACrefatitleNltk: The natural language toolkit Nltk: The natural language toolkit.\BBCQ \APACjournalVolNumPagesarXiv preprint cs/0205028. \PrintBackRefs\CurrentBib
  • Loshchilov \BBA Hutter (\APACyear2019) \APACinsertmetastarDBLP:conf/iclr/LoshchilovH19{APACrefauthors}Loshchilov, I.\BCBT \BBA Hutter, F.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleDecoupled Weight Decay Regularization Decoupled weight decay regularization.\BBCQ \BIn \APACrefbtitle7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. 7th international conference on learning representations, ICLR 2019, new orleans, la, usa, may 6-9, 2019. \APACaddressPublisherOpenReview.net. {APACrefURL} \urlhttps://openreview.net/forum?id=Bkg6RiCqY7 \PrintBackRefs\CurrentBib
  • Low (\APACyear2003) \APACinsertmetastarlow2003singable{APACrefauthors}Low, P.  \APACrefYearMonthDay2003. \BBOQ\APACrefatitleSingable translations of songs Singable translations of songs.\BBCQ \APACjournalVolNumPagesPerspectives: Studies in Translatology11287–103. \PrintBackRefs\CurrentBib
  • Lu \BOthers. (\APACyear2019) \APACinsertmetastarlu2019syllable{APACrefauthors}Lu, X., Wang, J., Zhuang, B., Wang, S.\BCBL \BBA Xiao, J.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleA syllable-structured, contextually-based conditionally generation of Chinese lyrics A syllable-structured, contextually-based conditionally generation of Chinese lyrics.\BBCQ \BIn \APACrefbtitlePRICAI 2019: Trends in Artificial Intelligence: 16th Pacific Rim International Conference on Artificial Intelligence, Cuvu, Yanuca Island, Fiji, August 26-30, 2019, Proceedings, Part III 16 Pricai 2019: Trends in artificial intelligence: 16th pacific rim international conference on artificial intelligence, cuvu, yanuca island, fiji, august 26-30, 2019, proceedings, part iii 16 (\BPGS 257–265). \PrintBackRefs\CurrentBib
  • Ma \BOthers. (\APACyear2021) \APACinsertmetastarma2021ai{APACrefauthors}Ma, X., Wang, Y., Kan, M\BHBIY.\BCBL \BBA Lee, W\BPBIS.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleAI-Lyricist: Generating Music and Vocabulary Constrained Lyrics AI-Lyricist: Generating music and vocabulary constrained lyrics.\BBCQ \BIn \APACrefbtitleProceedings of the 29th ACM International Conference on Multimedia Proceedings of the 29th acm international conference on multimedia (\BPGS 1002–1011). \PrintBackRefs\CurrentBib
  • Melistas \BOthers. (\APACyear2021) \APACinsertmetastarmelistas2021lyrics{APACrefauthors}Melistas, T., Giannakopoulos, T.\BCBL \BBA Paraskevopoulos, G.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleLyrics and Vocal Melody Generation conditioned on Accompaniment Lyrics and vocal melody generation conditioned on accompaniment.\BBCQ \BIn \APACrefbtitleProceedings of the 2nd Workshop on NLP for Music and Spoken Audio (NLP4MusA) Proceedings of the 2nd workshop on nlp for music and spoken audio (nlp4musa) (\BPGS 11–16). \PrintBackRefs\CurrentBib
  • Meseguer-Brocal \BOthers. (\APACyear2020) \APACinsertmetastarmeseguer2020creating{APACrefauthors}Meseguer-Brocal, G., Cohen-Hadria, A.\BCBL \BBA Peeters, G.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleCreating DALI, a large dataset of synchronized audio, lyrics, and notes Creating DALI, a large dataset of synchronized audio, lyrics, and notes.\BBCQ \APACjournalVolNumPagesTransactions of the International Society for Music Information Retrieval31. \PrintBackRefs\CurrentBib
  • Nichols \BOthers. (\APACyear2009) \APACinsertmetastarnichols2009relationships{APACrefauthors}Nichols, E., Morris, D., Basu, S.\BCBL \BBA Raphael, C.  \APACrefYearMonthDay2009. \BBOQ\APACrefatitleRelationships between lyrics and melody in popular music Relationships between lyrics and melody in popular music.\BBCQ \BIn \APACrefbtitleISMIR 2009-Proceedings of the 11th International Society for Music Information Retrieval Conference Ismir 2009-proceedings of the 11th international society for music information retrieval conference (\BPGS 471–476). \PrintBackRefs\CurrentBib
  • Nikolov \BOthers. (\APACyear2020) \APACinsertmetastarnikolov2020rapformer{APACrefauthors}Nikolov, N\BPBII., Malmi, E., Northcutt, C.\BCBL \BBA Parisi, L.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleRapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders Rapformer: Conditional rap lyrics generation with denoising autoencoders.\BBCQ \BIn \APACrefbtitleProceedings of the 13th International Conference on Natural Language Generation Proceedings of the 13th international conference on natural language generation (\BPGS 360–373). \PrintBackRefs\CurrentBib
  • Noske \BBA Benton (\APACyear1988) \APACinsertmetastarnoske1988french{APACrefauthors}Noske, F.\BCBT \BBA Benton, R.  \APACrefYear1988. \APACrefbtitleFrench Song from Berlioz to Duparc: The Origin and Development of the M lodie French song from berlioz to duparc: The origin and development of the m lodie. \APACaddressPublisherCourier Corporation. \PrintBackRefs\CurrentBib
  • OpenAI (\APACyear2023) \APACinsertmetastaropenai2023gpt4{APACrefauthors}OpenAI.  \APACrefYearMonthDay2023. \APACrefbtitleGPT-4 Technical Report. Gpt-4 technical report. \PrintBackRefs\CurrentBib
  • Ormazabal \BOthers. (\APACyear2022) \APACinsertmetastarormazabal2022poelm{APACrefauthors}Ormazabal, A., Artetxe, M., Agirrezabal, M., Soroa, A.\BCBL \BBA Agirre, E.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitlePoelm: A meter-and rhyme-controllable language model for unsupervised poetry generation Poelm: A meter-and rhyme-controllable language model for unsupervised poetry generation.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2205.12206. \PrintBackRefs\CurrentBib
  • Ou \BOthers. (\APACyear2022) \APACinsertmetastarDBLP:conf/ismir/OuGW22{APACrefauthors}Ou, L., Gu, X.\BCBL \BBA Wang, Y.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleTransfer Learning of wav2vec 2.0 for Automatic Lyric Transcription Transfer learning of wav2vec 2.0 for automatic lyric transcription.\BBCQ \BIn P. Rao \BOthers. (\BEDS), \APACrefbtitleProceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022, Bengaluru, India, December 4-8, 2022 Proceedings of the 23rd international society for music information retrieval conference, ISMIR 2022, bengaluru, india, december 4-8, 2022 (\BPGS 891–899). {APACrefURL} \urlhttps://archives.ismir.net/ismir2022/paper/000107.pdf \PrintBackRefs\CurrentBib
  • Ou \BOthers. (\APACyear2023) \APACinsertmetastarou-etal-2023-songs{APACrefauthors}Ou, L., Ma, X., Kan, M\BHBIY.\BCBL \BBA Wang, Y.  \APACrefYearMonthDay2023\APACmonth07. \BBOQ\APACrefatitleSongs Across Borders: Singable and Controllable Neural Lyric Translation Songs across borders: Singable and controllable neural lyric translation.\BBCQ \BIn \APACrefbtitleProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) (\BPGS 447–467). \APACaddressPublisherToronto, CanadaAssociation for Computational Linguistics. {APACrefURL} \urlhttps://aclanthology.org/2023.acl-long.27 \PrintBackRefs\CurrentBib
  • Potash \BOthers. (\APACyear2015) \APACinsertmetastarpotash2015ghostwriter{APACrefauthors}Potash, P., Romanov, A.\BCBL \BBA Rumshisky, A.  \APACrefYearMonthDay2015. \BBOQ\APACrefatitleGhostWriter: Using an LSTM for automatic rap lyric generation GhostWriter: Using an LSTM for automatic rap lyric generation.\BBCQ \BIn \APACrefbtitleProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing Proceedings of the 2015 conference on empirical methods in natural language processing (\BPGS 1919–1924). \PrintBackRefs\CurrentBib
  • Qian \BOthers. (\APACyear2023) \APACinsertmetastarqian-etal-2023-unilg{APACrefauthors}Qian, T., Lou, F., Shi, J., Wu, Y., Guo, S., Yin, X.\BCBL \BBA Jin, Q.  \APACrefYearMonthDay2023\APACmonth07. \BBOQ\APACrefatitleUniLG: A Unified Structure-aware Framework for Lyrics Generation UniLG: A unified structure-aware framework for lyrics generation.\BBCQ \BIn \APACrefbtitleProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) (\BPGS 983–1001). \APACaddressPublisherToronto, CanadaAssociation for Computational Linguistics. {APACrefURL} \urlhttps://aclanthology.org/2023.acl-long.56 {APACrefDOI} 10.18653/v1/2023.acl-long.56 \PrintBackRefs\CurrentBib
  • Qian \BOthers. (\APACyear2022) \APACinsertmetastarqian2022training{APACrefauthors}Qian, T., Shi, J., Guo, S., Wu, P.\BCBL \BBA Jin, Q.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleTraining Strategies for Automatic Song Writing: A Unified Framework Perspective Training strategies for automatic song writing: A unified framework perspective.\BBCQ \BIn \APACrefbtitleICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Icassp 2022-2022 ieee international conference on acoustics, speech and signal processing (icassp) (\BPGS 4738–4742). \PrintBackRefs\CurrentBib
  • Sergio Oramas \BBA Serra (\APACyear2018) \APACinsertmetastardoi:10.1080/09298215.2018.1488878{APACrefauthors}Sergio Oramas, F\BPBIG., Luis Espinosa-Anke\BCBT \BBA Serra, X.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleNatural language processing for music knowledge discovery Natural language processing for music knowledge discovery.\BBCQ \APACjournalVolNumPagesJournal of New Music Research474365-382. {APACrefURL} \urlhttps://doi.org/10.1080/09298215.2018.1488878 {APACrefDOI} 10.1080/09298215.2018.1488878 \PrintBackRefs\CurrentBib
  • Sheng \BOthers. (\APACyear2021) \APACinsertmetastarsheng2021songmass{APACrefauthors}Sheng, Z., Song, K., Tan, X., Ren, Y., Ye, W., Zhang, S.\BCBL \BBA Qin, T.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleSongMASS: Automatic song writing with pre-training and alignment constraint SongMASS: Automatic song writing with pre-training and alignment constraint.\BBCQ \BIn \APACrefbtitleProceedings of the AAAI Conference on Artificial Intelligence Proceedings of the aaai conference on artificial intelligence (\BVOL 35, \BPGS 13798–13805). \PrintBackRefs\CurrentBib
  • Susanto \BOthers. (\APACyear2020) \APACinsertmetastarsusanto2020lexically{APACrefauthors}Susanto, R\BPBIH., Chollampatt, S.\BCBL \BBA Tan, L.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleLexically constrained neural machine translation with Levenshtein transformer Lexically constrained neural machine translation with levenshtein transformer.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2004.12681. \PrintBackRefs\CurrentBib
  • Tang \BOthers. (\APACyear2020) \APACinsertmetastartang2020multilingual{APACrefauthors}Tang, Y., Tran, C., Li, X., Chen, P\BHBIJ., Goyal, N., Chaudhary, V.\BDBLFan, A.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleMultilingual translation with extensible multilingual pretraining and finetuning Multilingual translation with extensible multilingual pretraining and finetuning.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2008.00401. \PrintBackRefs\CurrentBib
  • Tian \BOthers. (\APACyear2023) \APACinsertmetastartian-etal-2023-unsupervised{APACrefauthors}Tian, Y., Narayan-Chen, A., Oraby, S., Cervone, A., Sigurdsson, G., Tao, C.\BDBLPeng, N.  \APACrefYearMonthDay2023\APACmonth07. \BBOQ\APACrefatitleUnsupervised Melody-to-Lyrics Generation Unsupervised melody-to-lyrics generation.\BBCQ \BIn \APACrefbtitleProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) (\BPGS 9235–9254). \APACaddressPublisherToronto, CanadaAssociation for Computational Linguistics. {APACrefURL} \urlhttps://aclanthology.org/2023.acl-long.513 {APACrefDOI} 10.18653/v1/2023.acl-long.513 \PrintBackRefs\CurrentBib
  • Tong \BOthers. (\APACyear2019) \APACinsertmetastartong2019text{APACrefauthors}Tong, Y., Liu, Y., Wang, J.\BCBL \BBA Xin, G.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleText steganography on RNN-Generated lyrics Text steganography on RNN-generated lyrics.\BBCQ \APACjournalVolNumPagesMathematical Biosciences and Engineering1655451–5463. \PrintBackRefs\CurrentBib
  • Wang \BOthers. (\APACyear2022) \APACinsertmetastarwang-etal-2022-integrating{APACrefauthors}Wang, S., Tan, Z.\BCBL \BBA Liu, Y.  \APACrefYearMonthDay2022\APACmonth05. \BBOQ\APACrefatitleIntegrating Vectorized Lexical Constraints for Neural Machine Translation Integrating vectorized lexical constraints for neural machine translation.\BBCQ \BIn \APACrefbtitleProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers) (\BPGS 7063–7073). \APACaddressPublisherDublin, IrelandAssociation for Computational Linguistics. {APACrefURL} \urlhttps://aclanthology.org/2022.acl-long.487 {APACrefDOI} 10.18653/v1/2022.acl-long.487 \PrintBackRefs\CurrentBib
  • Watanabe \BBA Goto (\APACyear2021) \APACinsertmetastarwatanabe2021atypical{APACrefauthors}Watanabe, K.\BCBT \BBA Goto, M.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleAtypical lyrics completion considering musical audio signals Atypical lyrics completion considering musical audio signals.\BBCQ \BIn \APACrefbtitleMultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part I 27 Multimedia modeling: 27th international conference, mmm 2021, prague, czech republic, june 22–24, 2021, proceedings, part i 27 (\BPGS 174–186). \PrintBackRefs\CurrentBib
  • Watanabe \BOthers. (\APACyear2018) \APACinsertmetastarwatanabe2018melody{APACrefauthors}Watanabe, K., Matsubayashi, Y., Fukayama, S., Goto, M., Inui, K.\BCBL \BBA Nakano, T.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleA melody-conditioned lyrics language model A melody-conditioned lyrics language model.\BBCQ \BIn \APACrefbtitleProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (long papers) (\BPGS 163–172). \PrintBackRefs\CurrentBib
  • Wu \BOthers. (\APACyear2019) \APACinsertmetastarwu2019hierarchical{APACrefauthors}Wu, X., Du, Z., Guo, Y.\BCBL \BBA Fujita, H.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleHierarchical attention based long short-term memory for Chinese lyric generation Hierarchical attention based long short-term memory for Chinese lyric generation.\BBCQ \APACjournalVolNumPagesApplied Intelligence4944–52. \PrintBackRefs\CurrentBib
  • Xue \BOthers. (\APACyear2021) \APACinsertmetastarxue2021deeprapper{APACrefauthors}Xue, L., Song, K., Wu, D., Tan, X., Zhang, N\BPBIL., Qin, T.\BDBLLiu, T\BHBIY.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleDeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling DeepRapper: Neural rap generation with rhyme and rhythm modeling.\BBCQ \BIn \APACrefbtitleProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (\BPGS 69–81). \PrintBackRefs\CurrentBib
  • Yu \BOthers. (\APACyear2021) \APACinsertmetastaryu2021conditional{APACrefauthors}Yu, Y., Srivastava, A.\BCBL \BBA Canales, S.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleConditional LSTM-GAN for melody generation from lyrics Conditional LSTM-GAN for melody generation from lyrics.\BBCQ \APACjournalVolNumPagesACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)1711–20. \PrintBackRefs\CurrentBib
  • L. Zhang \BOthers. (\APACyear2022) \APACinsertmetastarzhang2022qiuniu{APACrefauthors}Zhang, L., Zhang, R., Mao, X.\BCBL \BBA Chang, Y.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleQiuNiu: A Chinese Lyrics Generation System with Passage-Level Input QiuNiu: A Chinese lyrics generation system with passage-level input.\BBCQ \BIn \APACrefbtitleProceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations Proceedings of the 60th annual meeting of the association for computational linguistics: System demonstrations (\BPGS 76–82). \PrintBackRefs\CurrentBib
  • R. Zhang \BOthers. (\APACyear2022) \APACinsertmetastarzhang2022youling{APACrefauthors}Zhang, R., Mao, X., Li, L., Jiang, L., Chen, L., Hu, Z.\BDBLHuang, M.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleYouling: an AI-assisted lyrics creation system Youling: an AI-assisted lyrics creation system.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2201.06724. \PrintBackRefs\CurrentBib
  • X. Zhang \BBA Cross (\APACyear2021) \APACinsertmetastarZhang:2021{APACrefauthors}Zhang, X.\BCBT \BBA Cross, I.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleAnalysing the relationship between tone and melody in Chaozhou songs Analysing the relationship between tone and melody in chaozhou songs.\BBCQ \APACjournalVolNumPagesJournal of New Music Research504299-311. {APACrefURL} \urlhttps://doi.org/10.1080/09298215.2021.1974490 {APACrefDOI} 10.1080/09298215.2021.1974490 \PrintBackRefs\CurrentBib

Appendix A Ethics Statement

In our human evaluation, we gathered evaluation scores without personal identifiers to ensure objective and fair comparison. Participants only provided ratings, with no other information being collected. Participation was entirely voluntary, with formal consent obtained from each participant. After participation, evaluators were compensated based on the time they spent completing the questionnaire. We have ensured the questionnaire is free from any offensive content. The process of collecting human annotations has received a review exemption from the Institutional Review Board of the National University of Singapore (NUS-IRB), under Reference Code Number: 2022-813.