Abstract
Audio mixing is a crucial part of music production. For analyzing or recreating audio mixing, it is of great importance to conduct research on estimating mixing parameters used to create mixdowns from music recordings, i.e., audio mixing inversion. However, approaches of audio mixing inversion are rarely explored. A method of estimating mixing parameters from raw tracks and a stereo mixdown via embodied self-supervised learning is presented. In this work, several commonly used audio effects including gain, pan, equalization, reverb, and compression, are taken into consideration. This method is able to learn an inference neural network that takes a stereo mixdown and the raw audio sources as input and estimate mixing parameters used to create the mixdown by iteratively sampling and training. During the sampling step, the inference network predicts a set of mixing parameters, which is sampled and fed to an audio-processing framework to generate audio data for the training step. During the training step, the same network used in the sampling step is optimized with the sampled data generated from the sampling step. This method is able to explicitly model the mixing process in an interpretable way instead of using a black-box neural network model. A set of objective measures are used for evaluation. The experimental results show that this method has better performance than current state-of-the-art methods.
Similar content being viewed by others
References
R. Izhaki. Mixing Audio: Concepts, Practices, and Tools, New York, USA: Routledge, 2017.
J. T. Colonel, J. Reiss. Reverse engineering of a recording mix with differentiable digital signal processing. The Journal of the Acoustical Society of America, vol. 150, pp. 608–619, 2021. DOI: https://doi.org/10.1121/10.0005622.
G. S. Yu, S. É Mallat, E. Bacry. Audio denoising by time-frequency block thresholding. IEEE Transactions on Signal Processing, vol. 56, no. 5, pp. 1830–1839, 2008. DOI: https://doi.org/10.1109/TSP.2007.912893.
K. Lebart, J. M. Boucher, P. N. Denbigh. A new method based on spectral subtraction for speech dereverberation. Acta Acustica United with Acustica, vol. 87, no. 3, pp. 359–366, 2001.
A. Belouchrani, K. Abed-Meraim, J. F. Cardoso, E. Moulines. A blind source separation technique using second-order statistics. IEEE Transactions on Signal Processing, vol. 45, no. 2, pp. 434–444, 1997. DOI: https://doi.org/10.1109/78.554307.
A. Jourjine, S. Rickard, O. Yilmaz. Blind separation of disjoint orthogonal signals: Demixing n sources from 2 mixtures. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, vol. 5, pp. 2985–2988, 2000. DOI: https://doi.org/10.1109/ICASSP.2000.861162.
S. Gorlow, S. Marchand. Reverse engineering stereo music recordings pursuing an informed two-stage approach, [Online], Available: https://hal.science/hal-00857676/document, 2013.
S. Gorlow, J. D. Reiss. Model-based inversion of dynamic range compression. IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1434–1444, 2013. DOI: https://doi.org/10.1109/TASL.2013.2253099.
M. Ramírez, J. D. Reiss. End-to-end equalization with convo-lutional neural networks, [Online], Available: https://www.dafx.de/paper-archive/2018/papers/DAFx2018_paper_27.pdf, 2018.
S. Hawley, B. Colburn, S. I. Mimilakis. Profiling audio compressors with deep neural networks, [Online], Available: https://arxiv.org/abs/1905.11928, 2019.
C. J. Steinmetz, J. D. Reiss. Efficient neural networks for real-time modeling of analog dynamic range compression, [Online], Available: https://arxiv.org/abs/2102.06200, 2021.
M. A. M. Ramírez, J. D. Reiss. Modeling nonlinear audio effects with end-to-end deep neural networks. In Proceedings of ICASSPI/IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp. 171–175, 2019. DOI: https://doi.org/10.1109/ICASSP.2019.8683529.
M. A. Martínez-Ramírez. Deep Learning for Audio Effects Modeling, Ph.D. dissertation, Queen Mary University of London, UK, 2021.
D. Barchiesi, J. Reiss. Reverse engineering of a mix. Journal of The Audio Engineering Society, vol. 58, vol. 7, pp. 563–576, 2010.
J. H. Engel, L. Hantrakul, C. J. Gu, A. Roberts. DDSP: Differentiable digital signal processing. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2019.
J. T. Colonel, M. Comunità, J. Reiss. Reverse engineering memoryless distortion effects with differentiable waveshapers, [Online], Available: https://www.eecs.qmul.ac.uk/∼josh/documents/2022/21955.pdf, 2022.
J. T. Colonel, J. D. Reiss. Approximating ballistics in a differentiable dynamic range compressor, [Online], Available: https://www.eecs.qmul.ac.uk/∼josh/documents/2022/21915.pdf, 2022.
Y. F. Sun, X. H. Wu. Embodied self-supervised learning by coordinated sampling and training, [Online], Available: https://arxiv.org/abs/2006.13350, 2020.
O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-assisted Intervention, Springer, Munich, Germany, pp. 234–241, 2015. DOI: https://doi.org/10.1007/978-3-319-24574-4_28.
C. J. Steinmetz, J. Pons, S. Pascual, J. Serrà. Automatic multitrack mixing with a differentiable mixing console of neural audio effects. In Proceedings of ICASS/IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Toronto, Canada, pp. 71–75, 2021. DOI: https://doi.org/10.1109/ICASSP39728.2021.9414364.
A. Gulati, J. Qin, C. C. Chiu, N. Parmar, Y. Zhang, J. H. Yu, W. Han, S. B. Wang, Z. D. Zhang, Y. H. Wu, R. M. Pang. Conformer: Convolution-augmented transformer for speech recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, pp. 5036–5040, 2020.
D. Braun. DawDreamer: Bridging the gap between digital audio workstations and python interfaces, [Online], Available: https://arxiv.org/abs/2111.09931, 2021.
O. Gillet, G. Richard. ENST-Drums: An extensive audiovisual database for drum signals processing. In Proceedings of the 7th International Conference on Music Information Retrieval, Victoria, Canada, pp. 156–159, 2006.
R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, J. P. Bello. MedleyDB: A multitrack dataset for annotation-intensive MIR research. In Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, China, pp. 155–160, 2014.
R. Bittner, J. Wilkins, H. Yip, J. Bello. MedleyDB 2.0: New data and a system for sustainable data collection. In Proceedings of International Conference on Music Information Retrieval, New York, USA, 2016, [Online], Available: https://wp.nyu.edu/ismir2016/wp-content/uploads/sites/2294/2016/08/bittner-medleydb.pdf.
D. P. Kingma, J. Ba. Adam: A method for stochastic optimization, [Online], Available: https://arxiv.org/abs/1412.6980, 2014.
R. Yamamoto, E. Song, J. M. Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In-Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp. 6199–6203, 2020. DOI: https://doi.org/10.1109/ICASSP40776.2020.9053795.
C. J. Steinmetz, J. Reiss. Pyloudnorm: A simple yet flexible loudness meter in python, [Online], Available: https://csteinmetz1.github.io/pyloudnorm-eval/paper/pyloud-norm_preprint.pdf, 2021.
Acknowledgements
This work was supported by High-grade, Precision and Advanced Discipline Construction Project of Beijing Universities, Major Projects of National Social Science Fund of China (No. 21ZD19) and Nation Culture and Tourism Technological Innovation Engineering Project of China.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors declared that they have no conflicts of interest to this work.
Additional information
Colored figures are available in the online version at https://link.springer.com/journal/11633
Haotian Zhou received the B. Sc. degree in computer science from Peking University, China in 2018, and the M. Sc. degree in electronic and computer engineering from University of Rochester, USA in 2020. He is currently a Ph. D. degree candidate in AI music and music information technology at Central Conservatory of Music, China.
His research interest include audio and music signal processing, artificial intelligence for music and automatic music mixing.
Feng Yu received the B. A. and M. A. degrees in conducting from Central Conservatory of Music, China in 1988 and 1991, respectively, and the highest Artist Diploma in conducting from Academy of Music Hanns Eisler Berlin, Germany in 1996. He was the past president of China National Opera House, and now is the president and a professor of Central Conservatory of Music, China.
His research interest include artificial intelligence for music, automatic music generation and human-computer interaction for music.
Xihong Wu received the Ph. D. degree in computer science from Department of Radio Electronics, Peking University, China in 1995. He is currently a professor with School of Intelligence and Technology, Peking University, and with Department of AI Music and Music Information Technology, Central Conservatory of Music, China.
His research interests include artificial intelligence in music, machine perception, computational auditory scene analysis, speech signal processing and natural language processing.
Rights and permissions
About this article
Cite this article
Zhou, H., Yu, F. & Wu, X. Audio Mixing Inversion via Embodied Self-supervised Learning. Mach. Intell. Res. 21, 55–62 (2024). https://doi.org/10.1007/s11633-023-1441-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-023-1441-9