Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Audio Mixing Inversion via Embodied Self-supervised Learning

  • Research Article
  • Published:
Machine Intelligence Research Aims and scope Submit manuscript

Abstract

Audio mixing is a crucial part of music production. For analyzing or recreating audio mixing, it is of great importance to conduct research on estimating mixing parameters used to create mixdowns from music recordings, i.e., audio mixing inversion. However, approaches of audio mixing inversion are rarely explored. A method of estimating mixing parameters from raw tracks and a stereo mixdown via embodied self-supervised learning is presented. In this work, several commonly used audio effects including gain, pan, equalization, reverb, and compression, are taken into consideration. This method is able to learn an inference neural network that takes a stereo mixdown and the raw audio sources as input and estimate mixing parameters used to create the mixdown by iteratively sampling and training. During the sampling step, the inference network predicts a set of mixing parameters, which is sampled and fed to an audio-processing framework to generate audio data for the training step. During the training step, the same network used in the sampling step is optimized with the sampled data generated from the sampling step. This method is able to explicitly model the mixing process in an interpretable way instead of using a black-box neural network model. A set of objective measures are used for evaluation. The experimental results show that this method has better performance than current state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. R. Izhaki. Mixing Audio: Concepts, Practices, and Tools, New York, USA: Routledge, 2017.

    Book  Google Scholar 

  2. J. T. Colonel, J. Reiss. Reverse engineering of a recording mix with differentiable digital signal processing. The Journal of the Acoustical Society of America, vol. 150, pp. 608–619, 2021. DOI: https://doi.org/10.1121/10.0005622.

    Article  Google Scholar 

  3. G. S. Yu, S. É Mallat, E. Bacry. Audio denoising by time-frequency block thresholding. IEEE Transactions on Signal Processing, vol. 56, no. 5, pp. 1830–1839, 2008. DOI: https://doi.org/10.1109/TSP.2007.912893.

    Article  MathSciNet  Google Scholar 

  4. K. Lebart, J. M. Boucher, P. N. Denbigh. A new method based on spectral subtraction for speech dereverberation. Acta Acustica United with Acustica, vol. 87, no. 3, pp. 359–366, 2001.

    Google Scholar 

  5. A. Belouchrani, K. Abed-Meraim, J. F. Cardoso, E. Moulines. A blind source separation technique using second-order statistics. IEEE Transactions on Signal Processing, vol. 45, no. 2, pp. 434–444, 1997. DOI: https://doi.org/10.1109/78.554307.

    Article  Google Scholar 

  6. A. Jourjine, S. Rickard, O. Yilmaz. Blind separation of disjoint orthogonal signals: Demixing n sources from 2 mixtures. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, vol. 5, pp. 2985–2988, 2000. DOI: https://doi.org/10.1109/ICASSP.2000.861162.

    Google Scholar 

  7. S. Gorlow, S. Marchand. Reverse engineering stereo music recordings pursuing an informed two-stage approach, [Online], Available: https://hal.science/hal-00857676/document, 2013.

  8. S. Gorlow, J. D. Reiss. Model-based inversion of dynamic range compression. IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1434–1444, 2013. DOI: https://doi.org/10.1109/TASL.2013.2253099.

    Article  Google Scholar 

  9. M. Ramírez, J. D. Reiss. End-to-end equalization with convo-lutional neural networks, [Online], Available: https://www.dafx.de/paper-archive/2018/papers/DAFx2018_paper_27.pdf, 2018.

  10. S. Hawley, B. Colburn, S. I. Mimilakis. Profiling audio compressors with deep neural networks, [Online], Available: https://arxiv.org/abs/1905.11928, 2019.

  11. C. J. Steinmetz, J. D. Reiss. Efficient neural networks for real-time modeling of analog dynamic range compression, [Online], Available: https://arxiv.org/abs/2102.06200, 2021.

  12. M. A. M. Ramírez, J. D. Reiss. Modeling nonlinear audio effects with end-to-end deep neural networks. In Proceedings of ICASSPI/IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp. 171–175, 2019. DOI: https://doi.org/10.1109/ICASSP.2019.8683529.

    Google Scholar 

  13. M. A. Martínez-Ramírez. Deep Learning for Audio Effects Modeling, Ph.D. dissertation, Queen Mary University of London, UK, 2021.

    Google Scholar 

  14. D. Barchiesi, J. Reiss. Reverse engineering of a mix. Journal of The Audio Engineering Society, vol. 58, vol. 7, pp. 563–576, 2010.

    Google Scholar 

  15. J. H. Engel, L. Hantrakul, C. J. Gu, A. Roberts. DDSP: Differentiable digital signal processing. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2019.

  16. J. T. Colonel, M. Comunità, J. Reiss. Reverse engineering memoryless distortion effects with differentiable waveshapers, [Online], Available: https://www.eecs.qmul.ac.uk/∼josh/documents/2022/21955.pdf, 2022.

  17. J. T. Colonel, J. D. Reiss. Approximating ballistics in a differentiable dynamic range compressor, [Online], Available: https://www.eecs.qmul.ac.uk/∼josh/documents/2022/21915.pdf, 2022.

  18. Y. F. Sun, X. H. Wu. Embodied self-supervised learning by coordinated sampling and training, [Online], Available: https://arxiv.org/abs/2006.13350, 2020.

  19. O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-assisted Intervention, Springer, Munich, Germany, pp. 234–241, 2015. DOI: https://doi.org/10.1007/978-3-319-24574-4_28.

    Google Scholar 

  20. C. J. Steinmetz, J. Pons, S. Pascual, J. Serrà. Automatic multitrack mixing with a differentiable mixing console of neural audio effects. In Proceedings of ICASS/IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Toronto, Canada, pp. 71–75, 2021. DOI: https://doi.org/10.1109/ICASSP39728.2021.9414364.

    Google Scholar 

  21. A. Gulati, J. Qin, C. C. Chiu, N. Parmar, Y. Zhang, J. H. Yu, W. Han, S. B. Wang, Z. D. Zhang, Y. H. Wu, R. M. Pang. Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, pp. 5036–5040, 2020.

  22. D. Braun. DawDreamer: Bridging the gap between digital audio workstations and python interfaces, [Online], Available: https://arxiv.org/abs/2111.09931, 2021.

  23. O. Gillet, G. Richard. ENST-Drums: An extensive audiovisual database for drum signals processing. In Proceedings of the 7th International Conference on Music Information Retrieval, Victoria, Canada, pp. 156–159, 2006.

  24. R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, J. P. Bello. MedleyDB: A multitrack dataset for annotation-intensive MIR research. In Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, China, pp. 155–160, 2014.

  25. R. Bittner, J. Wilkins, H. Yip, J. Bello. MedleyDB 2.0: New data and a system for sustainable data collection. In Proceedings of International Conference on Music Information Retrieval, New York, USA, 2016, [Online], Available: https://wp.nyu.edu/ismir2016/wp-content/uploads/sites/2294/2016/08/bittner-medleydb.pdf.

  26. D. P. Kingma, J. Ba. Adam: A method for stochastic optimization, [Online], Available: https://arxiv.org/abs/1412.6980, 2014.

  27. R. Yamamoto, E. Song, J. M. Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In-Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp. 6199–6203, 2020. DOI: https://doi.org/10.1109/ICASSP40776.2020.9053795.

  28. C. J. Steinmetz, J. Reiss. Pyloudnorm: A simple yet flexible loudness meter in python, [Online], Available: https://csteinmetz1.github.io/pyloudnorm-eval/paper/pyloud-norm_preprint.pdf, 2021.

Download references

Acknowledgements

This work was supported by High-grade, Precision and Advanced Discipline Construction Project of Beijing Universities, Major Projects of National Social Science Fund of China (No. 21ZD19) and Nation Culture and Tourism Technological Innovation Engineering Project of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xihong Wu.

Ethics declarations

The authors declared that they have no conflicts of interest to this work.

Additional information

Colored figures are available in the online version at https://link.springer.com/journal/11633

Haotian Zhou received the B. Sc. degree in computer science from Peking University, China in 2018, and the M. Sc. degree in electronic and computer engineering from University of Rochester, USA in 2020. He is currently a Ph. D. degree candidate in AI music and music information technology at Central Conservatory of Music, China.

His research interest include audio and music signal processing, artificial intelligence for music and automatic music mixing.

Feng Yu received the B. A. and M. A. degrees in conducting from Central Conservatory of Music, China in 1988 and 1991, respectively, and the highest Artist Diploma in conducting from Academy of Music Hanns Eisler Berlin, Germany in 1996. He was the past president of China National Opera House, and now is the president and a professor of Central Conservatory of Music, China.

His research interest include artificial intelligence for music, automatic music generation and human-computer interaction for music.

Xihong Wu received the Ph. D. degree in computer science from Department of Radio Electronics, Peking University, China in 1995. He is currently a professor with School of Intelligence and Technology, Peking University, and with Department of AI Music and Music Information Technology, Central Conservatory of Music, China.

His research interests include artificial intelligence in music, machine perception, computational auditory scene analysis, speech signal processing and natural language processing.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, H., Yu, F. & Wu, X. Audio Mixing Inversion via Embodied Self-supervised Learning. Mach. Intell. Res. 21, 55–62 (2024). https://doi.org/10.1007/s11633-023-1441-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11633-023-1441-9

Keywords