Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Will You Ever Become Popular? Learning to Predict Virality of Dance Clips

Published: 16 February 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Dance challenges are going viral in video communities like TikTok nowadays. Once a challenge becomes popular, thousands of short-form videos will be uploaded within a couple of days. Therefore, virality prediction from dance challenges is of great commercial value and has a wide range of applications, such as smart recommendation and popularity promotion. In this article, a novel multi-modal framework that integrates skeletal, holistic appearance, facial and scenic cues is proposed for comprehensive dance virality prediction. To model body movements, we propose a pyramidal skeleton graph convolutional network (PSGCN) that hierarchically refines spatio-temporal skeleton graphs. Meanwhile, we introduce a relational temporal convolutional network (RTCN) to exploit appearance dynamics with non-local temporal relations. An attentive fusion approach is finally proposed to adaptively aggregate predictions from different modalities. To validate our method, we introduce a large-scale viral dance video (VDV) dataset, which contains over 4,000 dance clips of eight viral dance challenges. Extensive experiments on the VDV dataset well demonstrate the effectiveness of our approach. Furthermore, we show that short video applications such as multi-dimensional recommendation and action feedback can be derived from our model.

    References

    [1]
    Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the CVPR. IEEE, 3686–3693.
    [2]
    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arxiv:1607.06450.
    [3]
    Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arxiv:1803.01271.
    [4]
    Adam Bielski and Tomasz Trzcinski. 2018. Pay attention to virality: Understanding popularity of social media videos with the attention mechanism. In Proceedings of the CVPRW. IEEE, 2335–2337.
    [5]
    Adam Bielski and Tomasz Trzcinski. 2018. Understanding multimodal popularity prediction of social media videos with self-attention. IEEE Access 6 (2018), 74277–74287.
    [6]
    Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N. Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the ICML. JMLR, 89–96.
    [7]
    Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2018. VGGFace2: A dataset for recognising faces across pose and age. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG). IEEE, 67–74.
    [8]
    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the ECCV. Springer, 213–229.
    [9]
    Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the CVPR. IEEE, 6299–6308.
    [10]
    Jingyuan Chen, Xuemeng Song, Liqiang Nie, Xiang Wang, Hanwang Zhang, and Tat-Seng Chua. 2016. Micro tells macro: Predicting the popularity of micro-videos via a transductive model. In Proceedings of the 24th ACM International Conference on Multimedia (MM). ACM, 898–907.
    [11]
    Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018. GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Proceedings of the ICML. JMLR, 794–803.
    [12]
    International Olympic Committee. 2021. List of summer and winter olympic sports. Retrieved from https://www.olympic.org/sports.
    [13]
    Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Proceedings of the ICML. JMLR, 933–941.
    [14]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL. ACL, 4171–4186.
    [15]
    Hazel Doughty, Dima Damen, and Walterio Mayol-Cuevas. 2018. Who’s better? who’s best? Pairwise deep ranking for skill determination. In Proceedings of the CVPR. IEEE, 6057–6066.
    [16]
    Hazel Doughty, Walterio Mayol-Cuevas, and Dima Damen. 2019. The pros and cons: Rank-aware temporal attention for skill determination in long videos. In Proceedings of the CVPR. IEEE, 7862–7871.
    [17]
    Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional multi-person pose estimation. In Proceedings of the ICCV. IEEE, 2334–2343.
    [18]
    Yazan Abu Farha and Jurgen Gall. 2019. MS-TCN: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the CVPR. IEEE, 3575–3584.
    [19]
    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the ICCV. IEEE, 6202–6211.
    [20]
    Jibin Gao, Wei-Shi Zheng, Jia-Hui Pan, Chengying Gao, Yaowei Wang, Wei Zeng, and Jianhuang Lai. 2020. An asymmetric modeling for action assessment. In Proceedings of the ECCV. Springer, 222–238.
    [21]
    Yixin Gao, S. Swaroop Vedula, Carol E. Reiley, Narges Ahmidi, Balakrishnan Varadarajan, Henry C. Lin, Lingling Tao, Luca Zappella, Benjamın Béjar, David D. Yuh et al. 2014. JHU-ISI gesture and skill assessment working set (JIGSAWS): A surgical activity dataset for human motion modeling. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention Workshop (MICCAIW), Vol. 3. Springer, 3.
    [22]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. IEEE, 770–778.
    [23]
    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.
    [24]
    Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the CVPR. IEEE, 1725–1732.
    [25]
    Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the ICLR. OpenReview.net.
    [26]
    Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the ICLR. OpenReview.net.
    [27]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the NeurIPS. 1097–1105.
    [28]
    Quoc V. Le, Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. 2011. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the CVPR. IEEE, 3361–3368.
    [29]
    Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the CVPR. IEEE, 156–165.
    [30]
    Jing Li. 2018. Studies on Douyin app communication in social platforms: Take relevant Douyin short videos and posts on microblog as examples. In Proceedings of the ICALLH. 313–317.
    [31]
    Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the CVPR. IEEE, 3595–3603.
    [32]
    Maosen Li, Siheng Chen, Yangheng Zhao, Ya Zhang, Yanfeng Wang, and Qi Tian. 2020. Dynamic multiscale graph neural networks for 3D skeleton-based human motion prediction. In Proceedings of the CVPR. IEEE, 214–223.
    [33]
    Yongjun Li, Xiujuan Chai, and Xilin Chen. 2018. End-to-end learning for action quality assessment. In Proceedings of the PCM. Springer, 125–134.
    [34]
    Yongjun Li, Xiujuan Chai, and Xilin Chen. 2018. ScoringNet: Learning key fragment for action quality assessment with ranking loss in skilled sports. In Proceedings of the ACCV. Springer, 149–164.
    [35]
    J. A. Martin, Glenn Regehr, Richard Reznick, Helen Macrae, John Murnaghan, Carol Hutchison, and M. Brown. 1997. Objective structured assessment of technical skill (OSATS) for surgical residents. Brit. J. Surg. 84, 2 (1997), 273–278.
    [36]
    Maryam Mohsin. 2021. 10 TikTok statistics that you need to know in 2021. Retrieved from https://www.oberlo.com/blog/tiktok-statistics.
    [37]
    Liqiang Nie, Meng Liu, and Xuemeng Song. 2019. Multimodal learning toward micro-video understanding. Synth. Lect. Image, Vid. Multimedia Process. 9, 4 (2019), 1–186.
    [38]
    Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning convolutional neural networks for graphs. In Proceedings of the ICML. JMLR, 2014–2023.
    [39]
    Jia-Hui Pan, Jibin Gao, and Wei-Shi Zheng. 2019. Action assessment by joint relation graphs. In Proceedings of the ICCV. IEEE, 6331–6340.
    [40]
    Paritosh Parmar and Brendan Morris. 2019. Action quality assessment across multiple actions. In Proceedings of the WACV. IEEE, 1468–1476.
    [41]
    Paritosh Parmar and Brendan Tran Morris. 2016. Measuring the quality of exercises. In Proceedings of the EMBC. IEEE, 2241–2244.
    [42]
    Paritosh Parmar and Brendan Tran Morris. 2019. What and how well you performed? A multitask learning approach to action quality assessment. In Proceedings of the CVPR. IEEE, 304–313.
    [43]
    Paritosh Parmar and Brendan Tran Morris. 2017. Learning to score olympic events. In Proceedings of the CVPRW. IEEE, 20–28.
    [44]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proc. NeurIPS. 8024–8035.
    [45]
    Henrique Pinto, Jussara M. Almeida, and Marcos A. Gonçalves. 2013. Using early view patterns to predict the popularity of YouTube videos. In Proceedings of the WSDM. ACM, 365–374.
    [46]
    Hamed Pirsiavash, Carl Vondrick, and Antonio Torralba. 2014. Assessing the quality of actions. In Proceedings of the ECCV. Springer, 556–571.
    [47]
    Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the ICCV. IEEE, 5533–5541.
    [48]
    Yuting Su, Yang Li, Xu Bai, and Peiguang Jing. 2020. Predicting the popularity of micro-videos via a feature-discrimination transductive model. Multimedia Syst. 26, 5 (2020), 519–534.
    [49]
    Yansong Tang, Zanlin Ni, Jiahuan Zhou, Danyang Zhang, Jiwen Lu, Ying Wu, and Jie Zhou. 2020. Uncertainty-aware score distribution learning for action quality assessment. In Proceedings of the CVPR. IEEE, 9839–9848.
    [50]
    Lili Tao, Adeline Paiement, Dima Damen, Majid Mirmehdi, Sion Hannuna, Massimo Camplani, Tilo Burghardt, and Ian Craddock. 2016. A comparative study of pose representation and dynamics modelling for online motion quality assessment. Comput. Vis. Image Underst. 148 (2016), 136–152.
    [51]
    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the ICCV. IEEE, 4489–4497.
    [52]
    Tomasz Trzciński and Przemysław Rokita. 2017. Predicting popularity of online videos using support vector regression. IEEE Trans. Multimedia 19, 11 (2017), 2561–2570.
    [53]
    Marc Van Droogenbroeck and Olivier Paquot. 2012. Background subtraction: Experiments and improvements for ViBe. In Proceedings of the CVPRW. IEEE, 32–37.
    [54]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NeurIPS. 5998–6008.
    [55]
    Vinay Venkataraman, Ioannis Vlachos, and Pavan K. Turaga. 2015. Dynamical regularity for action analysis. In Proceedings of the BMVC. 67–1.
    [56]
    Jiahao Wang, Zhengyin Du, Annan Li, and Yunhong Wang. 2019. Atrous temporal convolutional network for video action segmentation. In Proceedings of the ICIP. IEEE, 1585–1589.
    [57]
    Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the ECCV. Springer, 20–36.
    [58]
    Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the CVPR. IEEE, 7794–7803.
    [59]
    Ekapol Wongsuparatkul and Sukree Sinthupinyo. 2020. View count of online videos prediction using clustering view count patterns with multivariate linear model. In Proceedings of the ICCCM. 123–129.
    [60]
    Xiang Xiang, Ye Tian, Austin Reiter, Gregory D. Hager, and Trac D. Tran. 2018. S3D: Stacking segmental P3D for action quality assessment. In Proceedings of the ICIP. IEEE, 928–932.
    [61]
    Jiayi Xie, Yaochen Zhu, Zhibin Zhang, Jian Peng, Jing Yi, Yaosi Hu, Hongyi Liu, and Zhenzhong Chen. 2020. A multimodal variational encoder-decoder framework for micro-video popularity prediction. In Proceedings of the WWW. 2542–2548.
    [62]
    Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI.
    [63]
    Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In Proceedings of the ICLR. OpenReview.net.
    [64]
    Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23, 10 (2016), 1499–1503.
    [65]
    Yuchao Zhang, Pengmiao Li, Zhili Zhang, Chaorui Zhang, Wendong Wang, Yishuang Ning, and Bo Lian. 2020. GraphInf: A GCN-based popularity prediction system for short video networks. In Proceedings of the ICWS. Springer, 61–76.
    [66]
    Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2017), 1452–1464.
    [67]
    Aneeq Zia, Yachna Sharma, Vinay Bettadapura, Eric L. Sarin, Mark A. Clements, and Irfan Essa. 2015. Automated assessment of surgical skills using frequency analysis. In Proceedings of the MICCAI. Springer, 430–438.

    Cited By

    View all
    • (2024)Bridging the Domain Gap in Scene Flow Estimation via Hierarchical Smoothness RefinementACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366182320:8(1-21)Online publication date: 12-Jun-2024
    • (2024)High Fidelity Makeup via 2D and 3D Identity Preservation NetACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647520:8(1-24)Online publication date: 13-Jun-2024
    • (2024)RAST: Restorable Arbitrary Style TransferACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363877020:5(1-21)Online publication date: 22-Jan-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2
    May 2022
    494 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3505207
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 February 2022
    Accepted: 01 July 2021
    Revised: 01 May 2021
    Received: 01 November 2020
    Published in TOMM Volume 18, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Dance challenge
    2. virality prediction
    3. multi-modal approach

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Natural Science Foundation of China
    • Foundation for Innovative Research Groups through the National Natural Science Foundation of China
    • CCF-Tencent Rhino-Bird Research Fund

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)235
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Bridging the Domain Gap in Scene Flow Estimation via Hierarchical Smoothness RefinementACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366182320:8(1-21)Online publication date: 12-Jun-2024
    • (2024)High Fidelity Makeup via 2D and 3D Identity Preservation NetACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647520:8(1-24)Online publication date: 13-Jun-2024
    • (2024)RAST: Restorable Arbitrary Style TransferACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363877020:5(1-21)Online publication date: 22-Jan-2024
    • (2024)Learning Offset Probability Distribution for Accurate Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363721420:5(1-24)Online publication date: 22-Jan-2024
    • (2024)HARR: Learning Discriminative and High-Quality Hash Codes for Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/362716220:5(1-23)Online publication date: 22-Jan-2024
    • (2024)Efficient Crowd Counting via Dual Knowledge DistillationIEEE Transactions on Image Processing10.1109/TIP.2023.334360933(569-583)Online publication date: 1-Jan-2024
    • (2024)The Influence of User Profile and Post Metadata on the Popularity of Image-Based Social Media: A Data Perspective2024 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)10.1109/ICAIIC60209.2024.10463510(806-811)Online publication date: 19-Feb-2024
    • (2024)Mapping the scholarly landscape of TikTok (Douyin): A bibliometric exploration of research topics and trendsDigital Business10.1016/j.digbus.2024.1000754:1(100075)Online publication date: Jun-2024
    • (2024)Focus for Free in Density-Based CountingInternational Journal of Computer Vision10.1007/s11263-024-01990-3132:7(2600-2617)Online publication date: 9-Feb-2024
    • (2023)Soft news in original videos. Adaptation to TikTok of the main Spanish online mediaEl Profesional de la información10.3145/epi.2023.mar.22Online publication date: 5-Apr-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media