Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Video2Music: : Suitable music generation from videos using an Affective Multimodal Transformer model

Published: 01 September 2024 Publication History

Abstract

Numerous studies in the field of music generation have demonstrated impressive performance, yet virtually no models are able to directly generate music to match accompanying videos. In this work, we develop a generative music AI framework, Video2Music, that can match a provided video. We first curated a unique collection of music videos. Then, we analyzed the music videos to obtain semantic, scene offset, motion, and emotion features. These distinct features are then employed as guiding input to our music generation model. We transcribe the audio files into MIDI and chords, and extract features such as note density and loudness. This results in a rich multimodal dataset, called MuVi-Sync, on which we train a novel Affective Multimodal Transformer (AMT) model to generate music given a video. This model includes a novel mechanism to enforce affective similarity between video and music. Finally, post-processing is performed based on a biGRU-based regression model to estimate note density and loudness based on the video features. This ensures a dynamic rendering of the generated chords with varying rhythm and volume. In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion. The musical quality, along with the quality of music-video matching is confirmed in a user study. The proposed AMT model, along with the new MuVi-Sync dataset, presents a promising step for the new task of music generation for videos.

Graphical abstract

Display Omitted

Highlights

Pioneering generative music AI model with video emotion matching.
New MuVi-Sync dataset with matched video and music features.
Video2Music framework with Affective Multimodal Transformer.
Post-processing to adjust music dynamics to sync with video.
Outperforms baseline in terms of Music Quality and Video Matching.

References

[1]
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6836–6846).
[2]
Bellmann H., About the determination of key of a musical excerpt, in: Computer music modeling and retrieval: third international symposium, CMMR 2005, Pisa, Italy, September 26-28, 2005. revised papers 3, Springer, 2006, pp. 76–91.
[3]
Briot J.-P., Hadjeres G., Pachet F.-D., Deep learning techniques for music generation, Springer, 2020.
[4]
Burrows T., Beacom M., Gaitan M., Moviepy, 2021.
[5]
Casella P., Paiva A., Magenta: An architecture for real time automatic composition of background music, in: International workshop on intelligent virtual agents, Springer, 2001, pp. 224–232.
[6]
Castellano B., Pyscenedetect: Intelligent scene cut detection and video splitting tool, 2018.
[7]
Chase W., How music really works!: The essential handbook for songwriters, performers, and music students, Roedy Black Pub., 2006.
[8]
Chen L., Lu K., Rajeswaran A., Lee K., Grover A., Laskin M., et al., Decision transformer: Reinforcement learning via sequence modeling, Advances in neural information processing systems, vol. 34, 2021, pp. 15084–15097.
[9]
Cheuk K.W., Agres K., Herremans D., The impact of audio input representations on neural network based music transcription, in: 2020 international joint conference on neural networks, IEEE, 2020, pp. 1–6.
[10]
Cheuk, K. W., Herremans, D., & Su, L. (2021). Reconvat: A semi-supervised automatic music transcription framework for low-resource real-world data. In Proceedings of the 29th ACM international conference on multimedia (pp. 3918–3926).
[11]
Cheuk K.W., Sawata R., Uesaka T., Murata N., Takahashi N., Takahashi S., et al., Diffroll: Diffusion-based generative music transcription with unsupervised pretraining capability, in: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing, IEEE, 2023, pp. 1–5.
[12]
Choi K., Park J., Heo W., Jeon S., Park J., Chord conditioned melody generation with transformer based decoders, IEEE Access 9 (2021) 42071–42080.
[13]
Chua P., Makris D., Herremans D., Roig G., Agres K., Predicting emotion from music videos: Exploring the relative contribution of visual and auditory information to affective responses, 2022, arXiv preprint arXiv:2202.10453.
[14]
Chuan C.-H., Agres K., Herremans D., From context to concept: Exploring semantic relationships in music with word2vec, Neural Computing and Applications 32 (2020) 1023–1036.
[15]
Chuan C.-H., Herremans D., Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation, Proceedings of the AAAI conference on artificial intelligence, vol. 32, 2018.
[16]
Civit M., Civit-Masot J., Cuadrado F., Escalona M.J., A systematic review of artificial intelligence-based music generation: Scope, applications, and future trends, Expert Systems with Applications (2022).
[17]
Collins K., Game sound: An introduction to the history, theory, and practice of video game music and sound design, Mit Press, 2008.
[18]
Cuthbert M.S., Ariza C., Music21: A toolkit for computer-aided musicology and symbolic music data, 2010.
[19]
Dai S., Jin Z., Gomes C., Dannenberg R.B., Controllable deep melody generation via hierarchical music structure representation, 2021, arXiv preprint arXiv:2109.00663.
[20]
Dannenberg R.B., Neuendorffer T., Sound synthesis from real-time video images, 2003.
[21]
Deng J., Dong W., Socher R., Li L.-J., Li K., Fei-Fei L., Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255.
[22]
Di, S., Jiang, Z., Liu, S., Wang, Z., Zhu, L., He, Z., et al. (2021). Video background music generation with controllable shaw2018selfsformer. In Proceedings of the 29th ACM international conference on multimedia (pp. 2037–2045).
[23]
Ding J., Li W., Pei L., Yang M., Ye C., Yuan B., Sw-YoloX: An anchor-free detector based transformer for sea surface object detection, Expert Systems with Applications 217 (2023).
[24]
Django Software Foundation J., Django, 2019.
[25]
Engels S., Tong T., Chan F., Automatic real-time music generation for games, Proceedings of the AAAI conference on artificial intelligence and interactive digital entertainment, vol. 11, 2015, pp. 220–222.
[26]
Gan C., Huang D., Chen P., Tenenbaum J.B., Torralba A., Foley music: Learning to generate music from videos, in: Computer vision–ECCV 2020: 16th European conference, glasgow, UK, August 23–28, 2020, proceedings, part XI 16, Springer, 2020, pp. 758–775.
[27]
Gao, D., Zhou, L., Ji, L., Zhu, L., Yang, Y., & Shou, M. Z. (2023). MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14773–14783).
[28]
Goel K., Vohra R., Sahoo J.K., Polyphonic music generation by modeling temporal dependencies using a rnn-dbn, in: Artificial neural networks and machine learning–ICANN 2014: 24th international conference on artificial neural networks, hamburg, Germany, September 15-19, 2014. proceedings 24, Springer, 2014, pp. 217–224.
[29]
Gong Y., Chung Y.-A., Glass J., Ast: Audio spectrogram transformer, 2021, arXiv preprint arXiv:2104.01778.
[30]
Guo, R., Simpson, I., Magnusson, T., Kiefer, C., & Herremans, D. (2020). A variational autoencoder for music generation controlled by tonal tension. In Joint conference on AI music creativity.
[31]
Hadjeres G., Nielsen F., Anticipation-RNN: Enforcing unary constraints in sequence generation, with application to interactive music generation, Neural Computing and Applications 32 (4) (2020) 995–1005.
[32]
Herremans D., Chew E., MorpheuS: Generating structured music with constrained patterns and tension, IEEE Transactions on Affective Computing 10 (4) (2017) 510–523.
[33]
Herremans D., Chuan C.-H., Chew E., A functional taxonomy of music generation systems, ACM Computing Surveys 50 (5) (2017) 1–30.
[34]
Herremans D., Sörensen K., Composing fifth species counterpoint music with a variable neighborhood search algorithm, Expert Systems with Applications 40 (16) (2013) 6427–6437.
[35]
Herremans D., Weisser S., Sörensen K., Conklin D., Generating structured music for bagana using quality metrics based on Markov models, Expert Systems with Applications 42 (21) (2015) 7424–7435.
[36]
Huang, C.-Z. A., Duvenaud, D., & Gajos, K. Z. (2016). Chordripple: Recommending chords to help novice composers go beyond the ordinary. In Proceedings of the 21st international conference on intelligent user interfaces (pp. 241–250).
[37]
Huang C.-Z.A., Vaswani A., Uszkoreit J., Shazeer N., Simon I., Hawthorne C., et al., Music transformer, 2018, arXiv preprint arXiv:1809.04281.
[38]
Johnson S., The long zoom, The New York Times Magazine 8 (2006).
[39]
Kamien R., Kamien A., Music: An appreciation, McGraw-Hill New York, 1988.
[40]
Kania M., Łukaszewicz T., Kania D., Mościńska K., Kulisz J., A comparison of the music key detection approaches utilizing key-profiles with a new method based on the signature of fifths, Applied Sciences 12 (21) (2022) 11261.
[41]
Kelz R., Dorfer M., Korzeniowski F., Böck S., Arzt A., Widmer G., On the potential of simple framewise approaches to piano transcription, 2016, arXiv preprint arXiv:1612.05153.
[42]
Khan A.U., Mazaheri A., Lobo N.D.V., Shah M., Mmft-bert: Multimodal fusion transformer with bert encodings for visual question answering, 2020, arXiv preprint arXiv:2010.14095.
[43]
Koepke A.S., Wiles O., Moses Y., Zisserman A., Sight to sound: An end-to-end approach for visual piano transcription, in: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing, IEEE, 2020, pp. 1838–1842.
[44]
Krumhansl C.L., Cognitive foundations of musical pitch, Oxford University Press, 2001.
[45]
Li, G., Zhu, L., Liu, P., & Yang, Y. (2019). Entangled transformer for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8928–8937).
[46]
Littlefield R., Unheard melodies: Narrative film music, 1990.
[47]
Makris D., Agres K.R., Herremans D., Generating lead sheets with affect: A novel conditional seq2seq framework, in: 2021 international joint conference on neural networks, IEEE, 2021, pp. 1–8.
[48]
Makris D., Zixun G., Kaliakatsos-Papakostas M., Herremans D., Conditional drums generation using compound word representations, in: International conference on computational intelligence in music, sound, art and design (part of evoStar), Springer, 2022, pp. 179–194.
[49]
Melechovsky J., Guo Z., Ghosal D., Majumder N., Herremans D., Poria S., Mustango: Toward controllable text-to-music generation, 2023, arXiv preprint arXiv:2311.08355.
[50]
Mittal G., Engel J., Hawthorne C., Simon I., Symbolic music generation with diffusion models, 2021, arXiv preprint arXiv:2103.16091.
[51]
Muhamed A., Li L., Shi X., Yaddanapudi S., Chi W., Jackson D., et al., Symbolic music generation with transformer-GANs, Proceedings of the AAAI conference on artificial intelligence, vol. 35, 2021, pp. 408–417.
[52]
Nakamura J.-I., Kaku T., Hyun K., Noma T., Yoshida S., Automatic background music generation based on actors’ mood and motions, The Journal of Visualization and Computer Animation 5 (4) (1994) 247–264.
[53]
Narasimhan M., Rohrbach A., Darrell T., Clip-it! Language-guided video summarization, Advances in Neural Information Processing Systems 34 (2021) 13988–14000.
[54]
Pandeya Y.R., Bhattarai B., Lee J., Deep-learning-based multimodal emotion classification for music videos, Sensors 21 (14) (2021) 4927.
[55]
Park J., Choi K., Jeon S., Kim D., Park J., A bi-directional transformer for musical chord recognition, 2019, arXiv preprint arXiv:1907.02698.
[56]
Parke R., Chew E., Kyriakakis C., Quantitative and visual analysis of the impact of music on perceived emotion of film, Computers in Entertainment (CIE) 5 (3) (2007) 5.
[57]
Payne C., MuseNet, OpenAI Blog 3 (2019).
[58]
Plans D., Morelli D., Experience-driven procedural music generation for games, IEEE Transactions on Computational Intelligence and AI in Games 4 (3) (2012) 192–198.
[59]
Prechtl A., Adaptive music generation for computer games, Open University (United Kingdom), 2016.
[60]
Radford A., Kim J.W., Hallacy C., Ramesh A., Goh G., Agarwal S., et al., Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763.
[61]
Raffel C., McFee B., Humphrey E.J., Salamon J., Nieto O., Liang D., et al., MIR_EVAL: A transparent implementation of common MIR metrics, ISMIR, vol. 10, 2014, p. 2014.
[62]
Schuller B., Dorfner J., Rigoll G., Determination of nonprototypical valence and arousal in popular music: Features and performances, EURASIP Journal on Audio, Speech, and Music Processing 2010 (2010) 1–19.
[63]
Shaw P., Uszkoreit J., Vaswani A., Self-attention with relative position representations, 2018, arXiv preprint arXiv:1803.02155.
[64]
Sturm B.L., Ben-Tal O., Monaghan Ú., Collins N., Herremans D., Chew E., et al., Machine learning research that matters for music creation: A case study, Journal of New Music Research 48 (1) (2019) 36–55.
[65]
Su K., Li J.Y., Huang Q., Kuzmin D., Lee J., Donahue C., et al., V2Meow: Meowing to the visual beat via music generation, 2023, arXiv preprint arXiv:2305.06594.
[66]
Su K., Liu X., Shlizerman E., Audeo: Audio generation for a silent performance video, Advances in Neural Information Processing Systems 33 (2020) 3325–3337.
[67]
Su K., Liu X., Shlizerman E., Multi-instrumentalist net: Unsupervised generation of music from body movements, 2020, arXiv preprint arXiv:2012.03478.
[68]
Su K., Liu X., Shlizerman E., How does it sound?, Advances in Neural Information Processing Systems 34 (2021) 29258–29273.
[69]
Tan H.H., Herremans D., Music fadernets: Controllable music generation based on high-level features via low-level feature modelling, Proceedings of ISMIR (2020).
[70]
Temperley D., Music and probability, Mit Press, 2007.
[71]
Thao H.T.P., Roig G., Herremans D., EmoMV: Affective music-video correspondence learning datasets for classification and retrieval, Information Fusion 91 (2023) 64–79.
[72]
Valenti A., Berti S., Bacciu D., Calliope–A polyphonic shaw2018selfsformer, 2021, arXiv preprint arXiv:2107.05546.
[73]
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., et al., Attention is all you need, Advances in neural information processing systems, vol. 30, 2017.
[74]
Wang, Q., Yin, H., Hu, Z., Lian, D., Wang, H., & Huang, Z. (2018). Neural memory streaming recommender networks with adversarial training. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 2467–2475).
[75]
Wu Y.-T., Luo Y.-J., Chen T.-P., Wei I.-C., Hsu J.-Y., Chuang Y.-C., et al., Omnizart: A general toolbox for automatic shaw2018selfscription, Journal of Open Source Software 6 (68) (2021) 3391.
[76]
Wu S.-L., Yang Y.-H., The jazz transformer on the front line: Exploring the shortcomings of ai-composed music through quantitative measures, 2020, arXiv preprint arXiv:2008.01307.
[77]
Yin H., Wang W., Wang H., Chen L., Zhou X., Spatial-aware hierarchical collaborative deep learning for poi recommendation, IEEE Transactions on Knowledge and Data Engineering 29 (11) (2017) 2537–2551.
[78]
Yin H., Zou L., Nguyen Q.V.H., Huang Z., Zhou X., Joint event-partner recommendation in event-based social networks, in: 2018 IEEE 34th international conference on data engineering, IEEE, 2018, pp. 929–940.
[79]
Yu J., Li J., Yu Z., Huang Q., Multimodal transformer with multi-view visual representation for image captioning, IEEE Transactions on Circuits and Systems for Video Technology 30 (12) (2019) 4467–4480.
[80]
Zeng M., Tan X., Wang R., Ju Z., Qin T., Liu T.-Y., Musicbert: Symbolic music understanding with large-scale pre-training, 2021, arXiv preprint arXiv:2106.05630.
[81]
Zhang N., Learning adversarial transformer for symbolic music generation, IEEE Transactions on Neural Networks and Learning Systems (2020).
[82]
Zhao B., Gong M., Li X., Hierarchical multimodal transformer to summarize videos, Neurocomputing 468 (2022) 360–369.
[83]
Zhu Y., Olszewski K., Wu Y., Achlioptas P., Chai M., Yan Y., et al., Quantized gan for complex music generation from dance videos, in: Computer vision–ECCV 2022: 17th European conference, tel aviv, Israel, October 23–27, 2022, proceedings, part XXXVII, Springer, 2022, pp. 182–199.
[84]
Zhu Y., Wu Y., Olszewski K., Ren J., Tulyakov S., Yan Y., Discrete contrastive diffusion for cross-modal and conditional generation, 2022, arXiv preprint arXiv:2206.07771.
[85]
Zhu Y., Zhao W., Hua R., Wu X., Topic-aware video summarization using multimodal transformer, Pattern Recognition 140 (2023).
[86]
Zixun G., Makris D., Herremans D., Hierarchical recurrent neural networks for conditional melody generation with long-term structure, in: 2021 international joint conference on neural networks, IEEE, 2021, pp. 1–8.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal
Expert Systems with Applications: An International Journal  Volume 249, Issue PC
Sep 2024
1587 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 September 2024

Author Tags

  1. Generative AI
  2. Music generation
  3. Transformer
  4. Multimodal
  5. Affective computing
  6. Music video matching

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media