Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3551626.3564941acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

ObjectMix: Data Augmentation by Copy-Pasting Objects in Videos for Action Recognition

Published: 13 December 2022 Publication History

Abstract

In this paper, we propose a data augmentation method for action recognition using instance segmentation. Although many data augmentation methods have been proposed for image recognition, few of them are tailored for action recognition. Our proposed method, ObjectMix, extracts each object region from two videos using instance segmentation and combines them to create new videos. Experiments on two action recognition datasets, UCF101 and HMDB51, demonstrate the effectiveness of the proposed method and show its superiority over VideoMix, a prior work.

References

[1]
Alexander Buslaev, Vladimir I. Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A. Kalinin. 2020. Albumentations: Fast and Flexible Image Augmentations. Information 11, 2 (2020).
[2]
D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. 2012. A naturalistic open source movie for optical flow evaluation. In European Conf. on Computer Vision (ECCV) (Part IV, LNCS 7577), A. Fitzgibbon et al. (Eds.) (Ed.). Springer-Verlag, 611--625.
[3]
Terrance Devries and Graham W. Taylor. 2017. Improved Regularization of Convolutional Neural Networks with Cutout. CoRR abs/1708.04552 (2017). arXiv:1708.04552 http://arxiv.org/abs/1708.04552
[4]
A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbaş, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. 2015. FlowNet: Learning Optical Flow with Convolutional Networks. In IEEE International Conference on Computer Vision (ICCV). http://lmb.informatik.uni-freiburg.de/Publications/2015/DFIB15
[5]
Debidatta Dwibedi, Ishan Misra, and Martial Hebert. 2017. Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[6]
Hajar Emami, Ming Dong, Siamak P. Nejad-Davarani, and Carri K. Glide-Hurst. 2021. SA-GAN: Structure-Aware GAN for Organ-Preserving Synthetic CT Generation. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2021, Marleen de Bruijne, Philippe C. Cattin, Stéphane Cotin, Nicolas Padoy, Stefanie Speidel, Yefeng Zheng, and Caroline Essert (Eds.). Springer International Publishing, Cham, 471--481.
[7]
Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow-Fast Networks for Video Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
[9]
Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, and Kaiming He. 2021. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3299--3309.
[10]
Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, and Barret Zoph. 2021. Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2918--2928.
[11]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
[12]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. 2017. The "Something Something" Video Database for Learning and Evaluating Visual Common Sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[13]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6546--6555.
[14]
Lianghua Huang, Yu Liu, Bin Wang, Pan Pan, Yinghui Xu, and Rong Jin. 2021. Self-Supervised Video Representation Learning by Context and Motion Decoupling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13886--13895.
[15]
Matthew S. Hutchinson and Vijay N. Gadepally. 2021. Video Action Understanding. IEEE Access 9 (2021), 134611--134637.
[16]
Yasunori Ishii and Takayoshi Yamashita. 2021. CutDepth: Edge-aware Data Augmentation in Depth Estimation. CoRR abs/2107.07684 (2021). arXiv:2107.07684 https://arxiv.org/abs/2107.07684
[17]
Alexander B. Jung, Kentaro Wada, Jon Crall, Satoshi Tanaka, Jake Graving, Christoph Reinders, Sarthak Yadav, Joy Banerjee, Gábor Vecsei, Adam Kraft, Zheng Rui, Jirka Borovec, Christian Vallentin, Semen Zhydenko, Kilian Pfeiffer, Ben Cook, Ismael Fernández, François-Michel De Rainville, Chi-Hung Weng, Abner Ayala-Acevedo, Raphael Meudec, Matias Laporte, et al. 2020. imgaug. https://github.com/aleju/imgaug. Online; accessed 01-Feb-2020.
[18]
Euijin Jung, Miguel Luna, and Sang Hyun Park. 2021. Conditional GAN with an Attention-Based Generator and a 3D Discriminator for 3D Medical Image Generation. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2021, Marleen de Bruijne, Philippe C. Cattin, Stéphane Cotin, Nicolas Padoy, Stefanie Speidel, Yefeng Zheng, and Caroline Essert (Eds.). Springer International Publishing, Cham, 318--328.
[19]
Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. CoRR abs/1705.06950 (2017). arXiv:1705.06950 http://arxiv.org/abs/1705.06950
[20]
Nour Eldeen Mahmoud Khalifa, Mohamed Loey, and Seyedali Mirjalili. 2022. A comprehensive survey of recent trends in deep learning for digital images augmentation. Artificial Intelligence Review 55, 3 (2022), 2351--2377.
[21]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. CoRR abs/1405.0312 (2014). arXiv:1405.0312 http://arxiv.org/abs/1405.0312
[22]
N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. 2016. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). http://lmb.informatik.uni-freiburg.de/Publications/2016/MIFDB16 arXiv:1512.02134.
[23]
Yingxue Pang, Jianxin Lin, Tao Qin, and Zhibo Chen. 2021. Image-to-Image Translation: Methods and Applications. IEEE Transactions on Multimedia (2021), 1--1.
[24]
Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. 2021. Spatiotemporal Contrastive Video Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6964--6974.
[25]
Connor Shorten and Taghi M. Khoshgoftaar. 2019. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data 6 (2019), 60.
[26]
Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/00ec53c4682d36f5c4359f4ae7bd7ba1-Paper.pdf
[27]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR abs/1212.0402 (2012). arXiv:1212.0402 http://arxiv.org/abs/1212.0402
[28]
Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N. Metaxas, and Sergey Tulyakov. 2021. A Good Image Generator Is What You Need for High-Resolution Video Synthesis. In International Conference on Learning Representations. https://openreview.net/forum?id=6puCSjH3hwA
[29]
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. CoRR abs/2203.12602 (2022). arXiv:2203.12602
[30]
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2018. MoCoGAN: Decomposing Motion and Content for Video Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[31]
Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. 2017. Learning from Synthetic Humans. In CVPR.
[32]
Julius von Kügelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Schölkopf, Michel Besserve, and Francesco Locatello. 2021. Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style. In Advances in Neural Information Processing Systems, Vol. 34. https://proceedings.neurips.cc/paper/2021/hash/8929c70f8d710e412d38da624b21c3c8-Abstract.html
[33]
Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J. Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, and Xing Sun. 2021. Removing the Background by Adding the Background: Towards Background Robust Self-Supervised Video Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11804--11813.
[34]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[35]
David S Wishart, Yannick Djoumbou Feunang, Ana Marcu, An Chi Guo, Kevin Liang, Rosa Vázquez-Fresno, Tanvir Sajed, Daniel Johnson, Carin Li, Naama Karu, et al. 2018. HMDB 4.0: the human metabolome database for 2018. Nucleic acids research 46, D1 (2018), D608--D617.
[36]
Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. 2019. Detectron2. https://github.com/facebookresearch/detectron2.
[37]
Mingle Xu, Sook Yoon, Alvaro Fuentes, and Dong Sun Park. 2022. A Comprehensive Survey of Image Augmentation Techniques for Deep Learning. CoRR abs/2205.01491 (2022). arXiv:2205.01491
[38]
Jaejun Yoo, Namhyuk Ahn, and Kyung-Ah Sohn. 2020. Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy. arXiv preprint arXiv:2004.00448 (2020).
[39]
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
[40]
Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, and Jinhyung Kim. 2020. VideoMix: Rethinking Data Augmentation for Video Classification. CoRR abs/2012.03457 (2020). arXiv:2012.03457 https://arxiv.org/abs/2012.03457
[41]
Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=r1Ddp1-Rb

Cited By

View all
  • (2023)Annotation Cost Minimization for Ultrasound Image Segmentation using Cross-domain Transfer LearningIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2023.3236989(1-11)Online publication date: 2023
  • (2023)Mitigating and Evaluating Static Bias of Action Representations in the Background and the Foreground2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01823(19854-19866)Online publication date: 1-Oct-2023
  • (2023)TRandAugment: temporal random augmentation strategy for surgical activity recognition from videosInternational Journal of Computer Assisted Radiology and Surgery10.1007/s11548-023-02864-818:9(1665-1672)Online publication date: 22-Mar-2023
  • Show More Cited By

Index Terms

  1. ObjectMix: Data Augmentation by Copy-Pasting Objects in Videos for Action Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia
    December 2022
    296 pages
    ISBN:9781450394789
    DOI:10.1145/3551626
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 December 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. action recognition
    2. data augmentation
    3. instance segmentation

    Qualifiers

    • Research-article

    Funding Sources

    • JSPS

    Conference

    MMAsia '22
    Sponsor:
    MMAsia '22: ACM Multimedia Asia
    December 13 - 16, 2022
    Tokyo, Japan

    Acceptance Rates

    Overall Acceptance Rate 59 of 204 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Annotation Cost Minimization for Ultrasound Image Segmentation using Cross-domain Transfer LearningIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2023.3236989(1-11)Online publication date: 2023
    • (2023)Mitigating and Evaluating Static Bias of Action Representations in the Background and the Foreground2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01823(19854-19866)Online publication date: 1-Oct-2023
    • (2023)TRandAugment: temporal random augmentation strategy for surgical activity recognition from videosInternational Journal of Computer Assisted Radiology and Surgery10.1007/s11548-023-02864-818:9(1665-1672)Online publication date: 22-Mar-2023
    • (2022)LGST-Drop: label-guided structural dropout for spatial–temporal convolutional neural networksJournal of Electronic Imaging10.1117/1.JEI.31.3.03303631:03Online publication date: 1-May-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media