Audio Mixing Inversion via Embodied Self-supervised Learning

Zhou, Haotian; Yu, Feng; Wu, Xihong

doi:10.1007/s11633-023-1441-9

Audio Mixing Inversion via Embodied Self-supervised Learning

Research Article
Published: 15 January 2024

Volume 21, pages 55–62, (2024)
Cite this article

Machine Intelligence Research Aims and scope Submit manuscript

157 Accesses
1 Altmetric
Explore all metrics

Abstract

Audio mixing is a crucial part of music production. For analyzing or recreating audio mixing, it is of great importance to conduct research on estimating mixing parameters used to create mixdowns from music recordings, i.e., audio mixing inversion. However, approaches of audio mixing inversion are rarely explored. A method of estimating mixing parameters from raw tracks and a stereo mixdown via embodied self-supervised learning is presented. In this work, several commonly used audio effects including gain, pan, equalization, reverb, and compression, are taken into consideration. This method is able to learn an inference neural network that takes a stereo mixdown and the raw audio sources as input and estimate mixing parameters used to create the mixdown by iteratively sampling and training. During the sampling step, the inference network predicts a set of mixing parameters, which is sampled and fed to an audio-processing framework to generate audio data for the training step. During the training step, the same network used in the sampling step is optimized with the sampled data generated from the sampling step. This method is able to explicitly model the mixing process in an interpretable way instead of using a black-box neural network model. A set of objective measures are used for evaluation. The experimental results show that this method has better performance than current state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic music signal mixing system based on one-dimensional Wave-U-Net autoencoders

Article Open access 05 January 2023

Latent Timbre Synthesis

Article 20 October 2020

Modeling Audio Distortion Effects with Autoencoder Neural Networks

References

R. Izhaki. Mixing Audio: Concepts, Practices, and Tools, New York, USA: Routledge, 2017.
Book Google Scholar
J. T. Colonel, J. Reiss. Reverse engineering of a recording mix with differentiable digital signal processing. The Journal of the Acoustical Society of America, vol. 150, pp. 608–619, 2021. DOI: https://doi.org/10.1121/10.0005622.
Article Google Scholar
G. S. Yu, S. É Mallat, E. Bacry. Audio denoising by time-frequency block thresholding. IEEE Transactions on Signal Processing, vol. 56, no. 5, pp. 1830–1839, 2008. DOI: https://doi.org/10.1109/TSP.2007.912893.
Article MathSciNet Google Scholar
K. Lebart, J. M. Boucher, P. N. Denbigh. A new method based on spectral subtraction for speech dereverberation. Acta Acustica United with Acustica, vol. 87, no. 3, pp. 359–366, 2001.
Google Scholar
A. Belouchrani, K. Abed-Meraim, J. F. Cardoso, E. Moulines. A blind source separation technique using second-order statistics. IEEE Transactions on Signal Processing, vol. 45, no. 2, pp. 434–444, 1997. DOI: https://doi.org/10.1109/78.554307.
Article Google Scholar
A. Jourjine, S. Rickard, O. Yilmaz. Blind separation of disjoint orthogonal signals: Demixing n sources from 2 mixtures. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, vol. 5, pp. 2985–2988, 2000. DOI: https://doi.org/10.1109/ICASSP.2000.861162.
Google Scholar
S. Gorlow, S. Marchand. Reverse engineering stereo music recordings pursuing an informed two-stage approach, [Online], Available: https://hal.science/hal-00857676/document, 2013.
S. Gorlow, J. D. Reiss. Model-based inversion of dynamic range compression. IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1434–1444, 2013. DOI: https://doi.org/10.1109/TASL.2013.2253099.
Article Google Scholar
M. Ramírez, J. D. Reiss. End-to-end equalization with convo-lutional neural networks, [Online], Available: https://www.dafx.de/paper-archive/2018/papers/DAFx2018_paper_27.pdf, 2018.
S. Hawley, B. Colburn, S. I. Mimilakis. Profiling audio compressors with deep neural networks, [Online], Available: https://arxiv.org/abs/1905.11928, 2019.
C. J. Steinmetz, J. D. Reiss. Efficient neural networks for real-time modeling of analog dynamic range compression, [Online], Available: https://arxiv.org/abs/2102.06200, 2021.
M. A. M. Ramírez, J. D. Reiss. Modeling nonlinear audio effects with end-to-end deep neural networks. In Proceedings of ICASSPI/IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp. 171–175, 2019. DOI: https://doi.org/10.1109/ICASSP.2019.8683529.
Google Scholar
M. A. Martínez-Ramírez. Deep Learning for Audio Effects Modeling, Ph.D. dissertation, Queen Mary University of London, UK, 2021.
Google Scholar
D. Barchiesi, J. Reiss. Reverse engineering of a mix. Journal of The Audio Engineering Society, vol. 58, vol. 7, pp. 563–576, 2010.
Google Scholar
J. H. Engel, L. Hantrakul, C. J. Gu, A. Roberts. DDSP: Differentiable digital signal processing. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2019.
J. T. Colonel, M. Comunità, J. Reiss. Reverse engineering memoryless distortion effects with differentiable waveshapers, [Online], Available: https://www.eecs.qmul.ac.uk/∼josh/documents/2022/21955.pdf, 2022.
J. T. Colonel, J. D. Reiss. Approximating ballistics in a differentiable dynamic range compressor, [Online], Available: https://www.eecs.qmul.ac.uk/∼josh/documents/2022/21915.pdf, 2022.
Y. F. Sun, X. H. Wu. Embodied self-supervised learning by coordinated sampling and training, [Online], Available: https://arxiv.org/abs/2006.13350, 2020.
O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-assisted Intervention, Springer, Munich, Germany, pp. 234–241, 2015. DOI: https://doi.org/10.1007/978-3-319-24574-4_28.
Google Scholar
C. J. Steinmetz, J. Pons, S. Pascual, J. Serrà. Automatic multitrack mixing with a differentiable mixing console of neural audio effects. In Proceedings of ICASS/IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Toronto, Canada, pp. 71–75, 2021. DOI: https://doi.org/10.1109/ICASSP39728.2021.9414364.
Google Scholar
A. Gulati, J. Qin, C. C. Chiu, N. Parmar, Y. Zhang, J. H. Yu, W. Han, S. B. Wang, Z. D. Zhang, Y. H. Wu, R. M. Pang. Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, pp. 5036–5040, 2020.
D. Braun. DawDreamer: Bridging the gap between digital audio workstations and python interfaces, [Online], Available: https://arxiv.org/abs/2111.09931, 2021.
O. Gillet, G. Richard. ENST-Drums: An extensive audiovisual database for drum signals processing. In Proceedings of the 7th International Conference on Music Information Retrieval, Victoria, Canada, pp. 156–159, 2006.
R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, J. P. Bello. MedleyDB: A multitrack dataset for annotation-intensive MIR research. In Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, China, pp. 155–160, 2014.
R. Bittner, J. Wilkins, H. Yip, J. Bello. MedleyDB 2.0: New data and a system for sustainable data collection. In Proceedings of International Conference on Music Information Retrieval, New York, USA, 2016, [Online], Available: https://wp.nyu.edu/ismir2016/wp-content/uploads/sites/2294/2016/08/bittner-medleydb.pdf.
D. P. Kingma, J. Ba. Adam: A method for stochastic optimization, [Online], Available: https://arxiv.org/abs/1412.6980, 2014.
R. Yamamoto, E. Song, J. M. Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In-Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp. 6199–6203, 2020. DOI: https://doi.org/10.1109/ICASSP40776.2020.9053795.
C. J. Steinmetz, J. Reiss. Pyloudnorm: A simple yet flexible loudness meter in python, [Online], Available: https://csteinmetz1.github.io/pyloudnorm-eval/paper/pyloud-norm_preprint.pdf, 2021.

Download references

Acknowledgements

This work was supported by High-grade, Precision and Advanced Discipline Construction Project of Beijing Universities, Major Projects of National Social Science Fund of China (No. 21ZD19) and Nation Culture and Tourism Technological Innovation Engineering Project of China.

Author information

Authors and Affiliations

Department of AI Music and Music Information Technology, Central Conservatory of Music, Beijing, 100031, China
Haotian Zhou, Feng Yu & Xihong Wu
Laboratory of Music Artificial Intelligence, Laboratory of Philosophy and Social Sciences, Ministry of Education, Beijing, 100031, China
Haotian Zhou, Feng Yu & Xihong Wu
School of Intelligence Science and Technology, Peking University, Beijing, 100871, China
Xihong Wu

Authors

Haotian Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Feng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Xihong Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xihong Wu.

Ethics declarations

The authors declared that they have no conflicts of interest to this work.

Additional information

Colored figures are available in the online version at https://link.springer.com/journal/11633

Haotian Zhou received the B. Sc. degree in computer science from Peking University, China in 2018, and the M. Sc. degree in electronic and computer engineering from University of Rochester, USA in 2020. He is currently a Ph. D. degree candidate in AI music and music information technology at Central Conservatory of Music, China.

His research interest include audio and music signal processing, artificial intelligence for music and automatic music mixing.

Feng Yu received the B. A. and M. A. degrees in conducting from Central Conservatory of Music, China in 1988 and 1991, respectively, and the highest Artist Diploma in conducting from Academy of Music Hanns Eisler Berlin, Germany in 1996. He was the past president of China National Opera House, and now is the president and a professor of Central Conservatory of Music, China.

His research interest include artificial intelligence for music, automatic music generation and human-computer interaction for music.

Xihong Wu received the Ph. D. degree in computer science from Department of Radio Electronics, Peking University, China in 1995. He is currently a professor with School of Intelligence and Technology, Peking University, and with Department of AI Music and Music Information Technology, Central Conservatory of Music, China.

His research interests include artificial intelligence in music, machine perception, computational auditory scene analysis, speech signal processing and natural language processing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, H., Yu, F. & Wu, X. Audio Mixing Inversion via Embodied Self-supervised Learning. Mach. Intell. Res. 21, 55–62 (2024). https://doi.org/10.1007/s11633-023-1441-9

Download citation

Received: 29 November 2022
Accepted: 23 March 2023
Published: 15 January 2024
Issue Date: February 2024
DOI: https://doi.org/10.1007/s11633-023-1441-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audio Mixing Inversion via Embodied Self-supervised Learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic music signal mixing system based on one-dimensional Wave-U-Net autoencoders

Latent Timbre Synthesis

Modeling Audio Distortion Effects with Autoencoder Neural Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Audio Mixing Inversion via Embodied Self-supervised Learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic music signal mixing system based on one-dimensional Wave-U-Net autoencoders

Latent Timbre Synthesis

Modeling Audio Distortion Effects with Autoencoder Neural Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation