[affiliation=1,2]ZeyuXie \name[affiliation=1]XuenanXu \name[affiliation=2,3]ZhizhengWu \name[affiliation=1]MengyueWu∗
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
Abstract
Recently, audio generation tasks have attracted considerable research interests. Precise temporal controllability is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. PicoAudio integrates temporal information to guide audio generation through tailored model design. It leverages data crawling, segmentation, filtering, and simulation of fine-grained temporally-aligned audio-text data. Both subjective and objective evaluations demonstrate that PicoAudio dramantically surpasses current state-of-the-art generation models in terms of timestamp and occurrence frequency controllability. The generated samples are available on the demo website https://zeyuxie29.github.io/PicoAudio.github.io.
keywords:
audio generation, data simulation, temporal control, timestamp control, occurrence frequency control1 Introduction
11footnotetext: Mengyue Wu is the corresponding author.Recently, significant progress has been made in audio generation. With the advancement of diffusion models, we can now synthesize vivid and lifelike audio segments [1, 2, 3, 4, 5]. A single model can generate universal audio, including speech, sound effects, and music [6, 7]. Some researchers are focusing on controllability, such as text-based audio editing or style transfer [8, 9], scene control for speech and sound effects [6], attributes-driven generation [10, 11], and the generation of extended, variable-length spatial music and sound [12].
Although existing models can generate sound by following instructions, when using audio generation models in content creation applications, it’s important to control timestamps and the occurrence frequencies of acoustic events precisely. Existing models overlook the temporal controllability of the timestamp, interval, duration, occurrence frequency, and relations like overlap or precedence. For example, most models struggle to produce sound occurrences accurately when given text inputs like “dog barks three times” or timestamps such as “bird chirping during 4-6 seconds”. These limitations significantly affect the models’ practical use in generating temporally-controllable audio content.
We argue that the missing of precise controllability in existing audio generation models has their root in the following two aspects: First, the deficiency of temporal control is partially due to insufficient temporally-aligned audio-text data. The commonly utilized audio-text datasets, such as AudioCaps [13] and Clotho [14], emphasize the fidelity of sound event descriptions and the linguistic sophistication of textual content, but they lack annotations pertaining to temporal aspects. In particular, in the largest audio captioning dataset AudioCaps, the phrase “xx times”, indicating frequency, appears only in annotations. Moreover, there are scarce annotations regarding timestamps. High-quality temporally-aligned audio-text data is crucial for training temporal controllable models. The more meticulously annotated the data, the better the models can learn the precise correspondence between audio outputs and temporal textual conditions, thereby achieving finer-grained control. Second, the diffusion model has limited knowledge of timestamp information. Existing diffusion-based models aim to learn the relationship between text description and audio event in the audio signal. Although the diffusion models can understand the text instructions at the high level, precise controlling information (e.g. “event-1 at timing-1 … and event-N at timing-N”) is not taken into consideration. This is due the nature of the current design of the diffusion models, which don’t take temporal information into consideration.
In this work, we propose PicoAudio which enables Precise tImestamp and frequency COntrollability of audio events, by leveraging data simulation111Simulated datasets for training and evaluation are available at https://github.com/zeyuxie29/PicoAudio, tailored model designs, and preprocessing with large language model. We focus on timestamp and frequency control, while other temporal conditions (e.g., ordering and interval) can be converted into timestamps through textual reasoning, akin to transforming frequency into timestamps in our experiment. PicoAudio proposes a pipeline to simulate data with temporally-aligned annotations. The pipeline entails crawling data from the Internet, segmenting and filtering audio clips to gather high-quality audio segments, as well as simulating to synthesize realistic audio. PicoAudio introduces tailored modules for temporal control. (a) Timestamp control is accomplished by incorporating customized input, namely timestamp caption. With the assistance of large language model (LLM) [15], (b) frequency control, (c) ordering via multi-event timestamp control and (d) multi-event frequency control can be implemented, as shown in Figure 1. Beyond (a)-(d), PicoAudio can achieve arbitrary precise temporal control as long as the LLM is capable of converting the requirement into timestamp captions, which is straightforward for LLM when prompted with simulated data. Our contributions encompass the following:
-
1.
A data simulation pipeline tailored specifically for temporal controllable audio generation frameworks;
-
2.
A timestamp and frequency controllable generation framework, enabling precise control over sound events;
-
3.
Achieving any temporal control by integrating LLM.
2 Temporal Controllable Model
To enable temporal control in audio generation, we first design a simulation pipeline that automatically acquires data and a tailored text processor to enhance audio generative models’ temporal awareness, as shown in Figure 2.
2.1 Temporally-aligned Data Simulation
Data crawling, segmentation & filtering
(1) Audios are crawled from the Internet using event tags as search keywords. These weakly annotated clips possess only sound event tags and may contain noise. (2) A text-to-audio grounding model [16] is employed to segment crawled data, as it can locate the temporal occurrence of events based on input text. Each localized segment encompasses one occurrence of a sound event, such as a “2-seconds cow mooing” segment . For generality, we also define a burst of continuous short sounds as one occurrence, such as a burst of “keyboard typing” or “door knocking”. (3) To ensure data quality, a contrastive language-audio pretraining (CLAP) model [17] is utilized for further filtering. Thus, we obtain a substantial number of high-quality one-occurrence segments, serving as a one-occurrence database.
Simulation
We randomly select events from the database and synthesize audio by randomly assigning occurrence on-set, following the approach of Xu et al. [18]. The timestamp of occurrence is annotated based on the on-set and the duration recorded in the grounding results. A simulated pair comprises a synthesized audio and a timestamp caption formatted as “event-1 at timing-1 … and event-N at timing-N”, as well as a frequency caption formatted as “event-1 j times … and event-N k times”.
2.2 Text Processor
The standard format makes rule-based transformations very straightforward. The one-hot timestamp matrix is derived from the timestamp caption, where and denote the number of sound events and the time dimension, respectively.
(1) |
LLM demonstrate excellent performance in text processing tasks. Thanks to LLM, PicoAudio framework can handle various input formats. For example, transforming input “a dog barking occurred between two and three seconds” into the timestamp caption format “dog barking at 2-3”.
LLM also empowers PicoAudio with more capabilities, such as (1) controlling occurrence frequency by transforming “a dog barks three times” into “dog barking at 1-2, 3-4, 7-9”, and (2) ordering by transforming “door knocking then door slamming” into “door knocking at 1-4 and door slamming at 6-8”. The duration of each occurrence is inferred by the LLM based on its own knowledge as well as the examples provided. We supplied GPT-4 with examples in traning set for learning, yielding an initial transformation error rate of and a refined second transformation error rate of . It can be observed that the transformation is straightforward for LLM when prompted with simulated training data.
PicoAudio employs a CLAP model [17] to extract event information beyond timestamp, denoted as event embedding . As the timestamp caption also encompass semantic information about sound events, which can also be utilized as guidance.
2.3 Audio Representation
PicoAudio employs a Variational Autoencoder (VAE) for audio representation, given the inherent difficulty in directly generating spectrograms. The VAE encoder compresses the audio spectrogram into the latent representation , where T, M, R, D denote the sequence length, the number of mel bands, the compression ratio and the latent dimension, respectively. is divided into two halves, representing the mean and variance in the latent space.
The VAE decoder reconstructs the spectrogram based on samples from the distribution . The vocoder following the VAE decoder converts the spectrogram back into a waveform.
2.4 Diffusion
PicoAudio utilizes a diffusion model to predict based on the timestamp matrix and event embedding , since it has demonstrated excellent capabilities in audio generation [1, 2, 3, 4, 5].
The diffusion model encompasses the forward steps that transform representation into the Gaussian distribution by noise injection, followed by the reverse steps that progressively denoise. A noise schedule defines the Markov chain’s transition probabilities in the forward steps:
(2) | |||
(3) |
where , follows distribution . At last step , follows an isotropic Gaussian noise. The model is trained to estimate noise based on input , and a weight related to Signal-to-Noise Ratio [19]:
(4) |
where denotes concatenation, is fused by cross-attention mechanism [20], and denotes the estimation network which can be employed to reconstruct from in the reverse steps with :
(5) |
3 Experiment
3.1 Data Simulation
Audio clips are crawled from Freesound222https://freesound.org/ using sound event as search keywords. Segmentation and filtering are conducted by a text-to-audio grounding model [21] and LAION-CLAP [17] with threshold set to and , respectively. The collection process results in a total of high-quality one-occurrence segments containing sound events. During simulation, the sound events and on-set time are randomly assigned, with the proportion of , , and occurrences for each sound event being approximately . A total of clips are simulated for training, single-event testing and multi-event testing, respectively.
Four temporal control tasks are designed: (a) single-event timestamp control using timestamp caption as input; (b) single-event frequency control using the frequency caption “xx k times” as input, which is directly fed into the baseline models. GPT-4 predicts the duration of segments and subsequently converts frequency captions into timestamp captions before feeding them into PicoAudio. (c) multi-event timestamp and (d) multi-event frequency control employ captions with multiple events.
3.2 Experiment Setup
The time resolution in the timestamp matrix is set to ms, which implies that temporal control can be achieved with precision at the millisecond level. The LAION-CLAP [17] is utilized as the event embedding extractor. PicoAudio adopts a pre-trained VAE model following Liu et al. [8]. The diffusion model employs a structure similar to Ghosal et al. [5] but with fewer parameters, with attention dimensions , block channels , and input channels ( for the timestamp matrix). HiFi-GAN vocoder is used to transforms spectrogram back to waveform.
PicoAudio is trained for epochs with a learning rate set to and decreasing according to a linear decay scheduler. VAE, LAION-CLAP and HiFi-GAN vocoder are frozen during trainging. The AdamW optimizer is utilized. During inference, the Classifier-free guidance scale is set to [22, 23].
Condition | Timestamp | Occurrence Frequency | |||||||
---|---|---|---|---|---|---|---|---|---|
Metrics | F1 | MOS | FAD | MOS | MOS | FAD | MOS | ||
Single Event | Ground Truth | 0.797 | 4.78 | 0 | 4.44 | 0.302 | 4.9 | 0 | 4.38 |
AudioLDM2 | 0.675 | 2.14 | 10.853 | 3.34 | 2.408 | 2.3 | 20.677 | 3.68 | |
Amphion | 0.566 | 1.98 | 11.774 | 2.82 | 2.060 | 2.22 | 11.999 | 3.54 | |
PicoAudio w/o T | 0.694 | 2.78 | 5.926 | 4.2 | 1.25 | 2.92 | 5.923 | 4.2 | |
PicoAudio (Ours) | 0.783 | 4.58 | 3.175 | 4.16 | 0.537 | 4.92 | 2.295 | 4.1 | |
Multiple Events | Ground Truth | 0.787 | 4.6 | 0 | 4.38 | 0.447 | 4.68 | 0 | 4.56 |
AudioLDM2 | 0.593 | 1.82 | 10.112 | 2.36 | 2.046 | 2.14 | 18.334 | 2.3 | |
Amphion | 0.520 | 2.2 | 10.979 | 2.72 | 1.851 | 2.48 | 11.769 | 3.24 | |
PicoAudio w/o T | 0.614 | 2.12 | 5.218 | 3.42 | 1.216 | 2.1 | 5.215 | 3.3 | |
PicoAudio (Ours) | 0.772 | 4.84 | 2.863 | 4.12 | 0.713 | 4.6 | 2.1823 | 4.38 |
3.3 Evaluation
Both subjective and objective evaluation metrics are introduced to conduct comprehensive assessments.
Subjective
Mean Opinion Score (MOS) are conducted from two perspectives: audio quality and temporal controllability. Audio quality considers the naturalness, distortion, and event accuracy of the generated audio. Temporal controllability evaluates the accuracy of timestamp / frequency control. For each task, audio clips from each model are rated by evaluators, and the mean score is calculated. All evaluators are screened for no hearing loss and have university-level education from prestigious universities, using designated headphones.
Objective
The commonly used FAD in audio generation tasks is utilized to assess the quality of generated audio [24]. The temporal condition in the timestamp / frequency caption is used as the ground truth for evaluation. A grounding model [21] is employed to detect the on- and off-sets of segments in generated audio. (a) For the timestamp control task, the accuracy of the detected segments is assessed by the segment F1 score [25], a commonly used metric in sound event detection. (b) For the frequency control task, accuracy is measured by the absolute difference between the specified frequency in the caption and the detected frequency in the audio. The difference is averaged on test samples and number of class , denoted as :
(6) |
Simulated audios in the test set are utilized as the ground truth to obtain an objective upper bound, since grounding model cannot detect and localize audio events with accuracy.
4 Result
The control of timestamp and frequency are evaluated separately on both single-event and multiple-event test sets. The results are presented in Table 1. Two mainstream audio generation models, AudioLDM2 [3] and Amphion [26, 9], are employed as baselines. Both subjective and objective metrics demonstrate that PicoAudio surpasses baseline models.
4.1 Timestamp & Occurrence Frequency Control
The timestamp controlled audios generated by PicoAudio are very close to the ground truth (upper bound), demonstrating the precision of control, whether in single-event or multi-event tasks. PicoAudio introduces tailored modules to convert the textual timestamp information into a timestamp matrix, achieving exact control of timestamp in the generated audio at a time resolution of ms. Equipped with prompted GPT-4, PicoAudio demonstrates outstanding performance in the frequency error metric . Even in the presence of grounding detection omission errors, it achieves an average error rate of / occurrences per sound event on the single-event / multi-event tasks, respectively. Achieving less than , akin to the grounding truth, implies that PicoAudio has demonstrated practicality in frequency controlling.
Mainstream generative baseline models, however, fall slightly short in performance. They obtain lower F1 scores and produce a frequency error around times per event, as they tend to excessively replicate events when faced with temporal conditions. Furthermore, the ablation study employs a model trained on simulated data without using timestamp matrix , which shares a similar framework with the baseline models. The ablation results lie between the baseline models and PicoAudio, indicating that achieving precise control requires not only temporally-aligned audio-text data but also specific model design.
4.2 Arbitrary Temporal Control Capabilities
With the powerful text processing capabilities of LLM, PicoAudio’s precise timestamp control capability provides infinite possibilities for temporal control. For instance, for temporal interval and duration control, expressions like “dog barks three times, with a 2-second interval / duration each time” can be transformed into single-event timestamp control. For events ordering, phrases like “dog barks then gunshot” can be transformed into multi-event timestamp control. Converting temporal control requirements into timestamp caption format is straightforward for GPT-4 after being prompted. Therefore, it can be said that the PicoAudio can achieve arbitrary precise temporal control.
However, due to constraints imposed by the audio sources, PicoAudio’s limitation lies in its temporary capacity to exercise temporal control over a limited number of events. Expanding the quantity of events and achieving comprehensive control beyond temporal are among our future research directions.
4.3 Audio Quality
Both the subjective metric MOS and the objective metric FAD demonstrate that PicoAudio outperforms the baseline models. On one hand, PicoAudio benefits from the advantage of having both the training and test sets derived from simulated data, whereas baseline models have not been trained on such data. On the other hand, as mentioned earlier, baseline models tend to excessively replicate events when confronted with temporal control, leading to significant discrepancies with the distribution of the test set. The ablation experiment demonstrates that solely employing mainstream baseline frameworks trained on simulated data yields limited improvements in audio quality. Timestamp information aids model in better discerning the distribution of audio.
5 Conclusion
Significant progress has been made in audio generation tasks, but performance in terms of temporal control remains subpar, primarily due to the lack of datasets with fine-grained annotations and specific model designs. PicoAudio addresses this issue by acquiring data with fine-grained timestamp annotation through web crawling, segmentation, filtering and simulation. In terms of model design, PicoAudio utilizes tailored modules to handle temporal information. It converts captions into one-hot matrices, assisting the diffusion model in achieving ms level control over timestamp. In evaluations encompassing controllability and quality, PicoAudio outperforms mainstream models in both subjective and objective metrics. With the support of GPT-4’s powerful text processing capabilities, PicoAudio can achieve a variety of temporal control capabilities, including frequency control, interval control, events ordering, etc. While PicoAudio’s limitation lies in its control over a limited number of events, this serves as a direction for our future work.
References
- [1] F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. Défossez, J. Copet, D. Parikh, Y. Taigman, and Y. Adi, “Audiogen: Textually guided audio generation,” in The Eleventh International Conference on Learning Representations, 2022.
- [2] D. Yang, J. Yu, H. Wang, W. Wang, C. Weng, Y. Zou, and D. Yu, “Diffsound: Discrete diffusion model for text-to-sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- [3] H. Liu, Q. Tian, Y. Yuan, X. Liu, X. Mei, Q. Kong, Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,” arXiv preprint arXiv:2308.05734, 2023.
- [4] J. Huang, Y. Ren, R. Huang, D. Yang, Z. Ye, C. Zhang, J. Liu, X. Yin, Z. Ma, and Z. Zhao, “Make-an-audio 2: Temporal-enhanced text-to-audio generation,” arXiv preprint arXiv:2305.18474, 2023.
- [5] D. Ghosal, N. Majumder, A. Mehrish, and S. Poria, “Text-to-audio generation using instruction-tuned llm and latent diffusion model,” arXiv preprint arXiv:2304.13731, 2023.
- [6] A. Vyas, B. Shi, M. Le, A. Tjandra, Y.-C. Wu, B. Guo, J. Zhang, X. Zhang, R. Adkins, W. Ngan et al., “Audiobox: Unified audio generation with natural language prompts,” arXiv preprint arXiv:2312.15821, 2023.
- [7] D. Yang, J. Tian, X. Tan, R. Huang, S. Liu, X. Chang, J. Shi, S. Zhao, J. Bian, X. Wu et al., “Uniaudio: An audio foundation model toward universal audio generation,” arXiv preprint arXiv:2310.00704, 2023.
- [8] H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “Audioldm: Text-to-audio generation with latent diffusion models,” in International Conference on Machine Learning. PMLR, 2023, pp. 21 450–21 474.
- [9] Y. Wang, Z. Ju, X. Tan, L. He, Z. Wu, J. Bian et al., “Audit: Audio editing by following instructions with latent diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- [10] Y. Chung, J. Lee, and J. Nam, “T-foley: A controllable waveform-domain diffusion model for temporal-event-guided foley sound synthesis,” arXiv preprint arXiv:2401.09294, 2024.
- [11] Z. Guo, J. Mao, R. Tao, L. Yan, K. Ouchi, H. Liu, and X. Wang, “Audio generation with multiple conditional diffusion model,” arXiv preprint arXiv:2308.11940, 2023.
- [12] Z. Evans, C. Carr, J. Taylor, S. H. Hawley, and J. Pons, “Fast timing-conditioned latent audio diffusion,” arXiv preprint arXiv:2402.04825, 2024.
- [13] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 119–132.
- [14] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 736–740.
- [15] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- [16] X. Xu, H. Dinkel, M. Wu, and K. Yu, “Text-to-audio grounding: Building correspondence between captions and sound events,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 606–610.
- [17] Y. Wu*, K. Chen*, T. Zhang*, Y. Hui*, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
- [18] X. Xu, X. Xu, Z. Xie, P. Zhang, M. Wu, and K. Yu, “A detailed audio-text data simulation pipeline using single-event sounds,” arXiv preprint arXiv:2403.04594, 2024.
- [19] T. Hang, S. Gu, C. Li, J. Bao, D. Chen, H. Hu, X. Geng, and B. Guo, “Efficient diffusion training via min-snr weighting strategy,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 2023, pp. 7407–7417.
- [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [21] X. Xu, Z. Ma, M. Wu, and K. Yu, “Towards weakly supervised text-to-audio grounding,” arXiv preprint arXiv:2401.02584, 2024.
- [22] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- [23] A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” in International Conference on Machine Learning. PMLR, 2022, pp. 16 784–16 804.
- [24] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms.” in INTERSPEECH, 2019, pp. 2350–2354.
- [25] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, p. 162, 2016.
- [26] X. Zhang, L. Xue, Y. Gu, Y. Wang, H. He, C. Wang, X. Chen, Z. Fang, H. Chen, J. Zhang, T. Y. Tang, L. Zou, M. Wang, J. Han, K. Chen, H. Li, and Z. Wu, “Amphion: An open-source audio, music and speech generation toolkit,” arXiv, vol. abs/2312.09911, 2024.