research-article

ObjectMix: Data Augmentation by Copy-Pasting Objects in Videos for Action Recognition

Authors:

Toru TamakiAuthors Info & Claims

MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia

Article No.: 26, Pages 1 - 7

https://doi.org/10.1145/3551626.3564941

Published: 13 December 2022 Publication History

Abstract

In this paper, we propose a data augmentation method for action recognition using instance segmentation. Although many data augmentation methods have been proposed for image recognition, few of them are tailored for action recognition. Our proposed method, ObjectMix, extracts each object region from two videos using instance segmentation and combines them to create new videos. Experiments on two action recognition datasets, UCF101 and HMDB51, demonstrate the effectiveness of the proposed method and show its superiority over VideoMix, a prior work.

References

[1]

Alexander Buslaev, Vladimir I. Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A. Kalinin. 2020. Albumentations: Fast and Flexible Image Augmentations. Information 11, 2 (2020).

[2]

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. 2012. A naturalistic open source movie for optical flow evaluation. In European Conf. on Computer Vision (ECCV) (Part IV, LNCS 7577), A. Fitzgibbon et al. (Eds.) (Ed.). Springer-Verlag, 611--625.

[3]

Terrance Devries and Graham W. Taylor. 2017. Improved Regularization of Convolutional Neural Networks with Cutout. CoRR abs/1708.04552 (2017). arXiv:1708.04552 http://arxiv.org/abs/1708.04552

[4]

A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbaş, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. 2015. FlowNet: Learning Optical Flow with Convolutional Networks. In IEEE International Conference on Computer Vision (ICCV). http://lmb.informatik.uni-freiburg.de/Publications/2015/DFIB15

[5]

Debidatta Dwibedi, Ishan Misra, and Martial Hebert. 2017. Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[6]

Hajar Emami, Ming Dong, Siamak P. Nejad-Davarani, and Carri K. Glide-Hurst. 2021. SA-GAN: Structure-Aware GAN for Organ-Preserving Synthetic CT Generation. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2021, Marleen de Bruijne, Philippe C. Cattin, Stéphane Cotin, Nicolas Padoy, Stefanie Speidel, Yefeng Zheng, and Caroline Essert (Eds.). Springer International Publishing, Cham, 471--481.

[7]

Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow-Fast Networks for Video Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

[9]

Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, and Kaiming He. 2021. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3299--3309.

[10]

Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, and Barret Zoph. 2021. Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2918--2928.

[11]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

[12]

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. 2017. The "Something Something" Video Database for Learning and Evaluating Visual Common Sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[13]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6546--6555.

[14]

Lianghua Huang, Yu Liu, Bin Wang, Pan Pan, Yinghui Xu, and Rong Jin. 2021. Self-Supervised Video Representation Learning by Context and Motion Decoupling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13886--13895.

[15]

Matthew S. Hutchinson and Vijay N. Gadepally. 2021. Video Action Understanding. IEEE Access 9 (2021), 134611--134637.

[16]

Yasunori Ishii and Takayoshi Yamashita. 2021. CutDepth: Edge-aware Data Augmentation in Depth Estimation. CoRR abs/2107.07684 (2021). arXiv:2107.07684 https://arxiv.org/abs/2107.07684

[17]

Alexander B. Jung, Kentaro Wada, Jon Crall, Satoshi Tanaka, Jake Graving, Christoph Reinders, Sarthak Yadav, Joy Banerjee, Gábor Vecsei, Adam Kraft, Zheng Rui, Jirka Borovec, Christian Vallentin, Semen Zhydenko, Kilian Pfeiffer, Ben Cook, Ismael Fernández, François-Michel De Rainville, Chi-Hung Weng, Abner Ayala-Acevedo, Raphael Meudec, Matias Laporte, et al. 2020. imgaug. https://github.com/aleju/imgaug. Online; accessed 01-Feb-2020.

[18]

Euijin Jung, Miguel Luna, and Sang Hyun Park. 2021. Conditional GAN with an Attention-Based Generator and a 3D Discriminator for 3D Medical Image Generation. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2021, Marleen de Bruijne, Philippe C. Cattin, Stéphane Cotin, Nicolas Padoy, Stefanie Speidel, Yefeng Zheng, and Caroline Essert (Eds.). Springer International Publishing, Cham, 318--328.

Digital Library

[19]

Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. CoRR abs/1705.06950 (2017). arXiv:1705.06950 http://arxiv.org/abs/1705.06950

[20]

Nour Eldeen Mahmoud Khalifa, Mohamed Loey, and Seyedali Mirjalili. 2022. A comprehensive survey of recent trends in deep learning for digital images augmentation. Artificial Intelligence Review 55, 3 (2022), 2351--2377.

Digital Library

[21]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. CoRR abs/1405.0312 (2014). arXiv:1405.0312 http://arxiv.org/abs/1405.0312

[22]

N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. 2016. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). http://lmb.informatik.uni-freiburg.de/Publications/2016/MIFDB16 arXiv:1512.02134.

[23]

Yingxue Pang, Jianxin Lin, Tao Qin, and Zhibo Chen. 2021. Image-to-Image Translation: Methods and Applications. IEEE Transactions on Multimedia (2021), 1--1.

[24]

Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. 2021. Spatiotemporal Contrastive Video Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6964--6974.

[25]

Connor Shorten and Taghi M. Khoshgoftaar. 2019. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data 6 (2019), 60.

[26]

Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/00ec53c4682d36f5c4359f4ae7bd7ba1-Paper.pdf

[27]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR abs/1212.0402 (2012). arXiv:1212.0402 http://arxiv.org/abs/1212.0402

[28]

Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N. Metaxas, and Sergey Tulyakov. 2021. A Good Image Generator Is What You Need for High-Resolution Video Synthesis. In International Conference on Learning Representations. https://openreview.net/forum?id=6puCSjH3hwA

[29]

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. CoRR abs/2203.12602 (2022). arXiv:2203.12602

[30]

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2018. MoCoGAN: Decomposing Motion and Content for Video Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]

Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. 2017. Learning from Synthetic Humans. In CVPR.

[32]

Julius von Kügelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Schölkopf, Michel Besserve, and Francesco Locatello. 2021. Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style. In Advances in Neural Information Processing Systems, Vol. 34. https://proceedings.neurips.cc/paper/2021/hash/8929c70f8d710e412d38da624b21c3c8-Abstract.html

[33]

Jinpeng Wang, Yuting Gao, Ke Li, Yiqi Lin, Andy J. Ma, Hao Cheng, Pai Peng, Feiyue Huang, Rongrong Ji, and Xing Sun. 2021. Removing the Background by Adding the Background: Towards Background Robust Self-Supervised Video Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11804--11813.

[34]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]

David S Wishart, Yannick Djoumbou Feunang, Ana Marcu, An Chi Guo, Kevin Liang, Rosa Vázquez-Fresno, Tanvir Sajed, Daniel Johnson, Carin Li, Naama Karu, et al. 2018. HMDB 4.0: the human metabolome database for 2018. Nucleic acids research 46, D1 (2018), D608--D617.

[36]

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. 2019. Detectron2. https://github.com/facebookresearch/detectron2.

[37]

Mingle Xu, Sook Yoon, Alvaro Fuentes, and Dong Sun Park. 2022. A Comprehensive Survey of Image Augmentation Techniques for Deep Learning. CoRR abs/2205.01491 (2022). arXiv:2205.01491

[38]

Jaejun Yoo, Namhyuk Ahn, and Kyung-Ah Sohn. 2020. Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy. arXiv preprint arXiv:2004.00448 (2020).

[39]

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

[40]

Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, and Jinhyung Kim. 2020. VideoMix: Rethinking Data Augmentation for Video Classification. CoRR abs/2012.03457 (2020). arXiv:2012.03457 https://arxiv.org/abs/2012.03457

[41]

Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=r1Ddp1-Rb

Cited By

Monkam PJin SLu W(2023)Annotation Cost Minimization for Ultrasound Image Segmentation using Cross-domain Transfer LearningIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2023.3236989(1-11)Online publication date: 2023
https://doi.org/10.1109/JBHI.2023.3236989
Li HLiu YZhang HLi B(2023)Mitigating and Evaluating Static Bias of Action Representations in the Background and the Foreground2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01823(19854-19866)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01823
Ramesh SDall’Alba DGonzalez CYu TMascagni PMutter DMarescaux JFiorini PPadoy N(2023)TRandAugment: temporal random augmentation strategy for surgical activity recognition from videosInternational Journal of Computer Assisted Radiology and Surgery10.1007/s11548-023-02864-818:9(1665-1672)Online publication date: 22-Mar-2023
https://doi.org/10.1007/s11548-023-02864-8
Show More Cited By

Index Terms

ObjectMix: Data Augmentation by Copy-Pasting Objects in Videos for Action Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Action recognition based on multimode fusion for VR online platform
Abstract
The current popular online communication platforms can convey information only in the form of text, voice, pictures, and other electronic means. The richness and reliability of information is not comparable to traditional face-to-face ...
Random Walk Erasing with Attention Calibration for Action Recognition
PRICAI 2021: Trends in Artificial Intelligence
Abstract
Action recognition in videos has attracted growing research interests because of the explosive surveillance data in social security applications. In this process, due to the distraction and deviation of the network caused by occlusions, human ...
Recognizing 50 human action categories of web videos

Action recognition on large categories of unconstrained videos taken from the web is a very challenging problem compared to datasets like KTH (6 actions), IXMAS (13 actions), and Weizmann (10 actions). Challenges like camera motion, different viewpoints,...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia

December 2022

296 pages

ISBN:9781450394789

DOI:10.1145/3551626

Conference Chair:
Shuqiang Jiang
CASROLE@GENERAL CHAIR
,
General Chairs:
Kiyoharu Aizawa
The University of Tokyo
,
Phoebe Chen
La Trobe
,
Keiji Yanai
The University of Electro-Communications

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 December 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

JSPS

Conference

MMAsia '22

Sponsor:

SIGMM

MMAsia '22: ACM Multimedia Asia

December 13 - 16, 2022

Tokyo, Japan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
82
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)3

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Monkam PJin SLu W(2023)Annotation Cost Minimization for Ultrasound Image Segmentation using Cross-domain Transfer LearningIEEE Journal of Biomedical and Health Informatics10.1109/JBHI.2023.3236989(1-11)Online publication date: 2023
https://doi.org/10.1109/JBHI.2023.3236989
Li HLiu YZhang HLi B(2023)Mitigating and Evaluating Static Bias of Action Representations in the Background and the Foreground2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01823(19854-19866)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01823
Ramesh SDall’Alba DGonzalez CYu TMascagni PMutter DMarescaux JFiorini PPadoy N(2023)TRandAugment: temporal random augmentation strategy for surgical activity recognition from videosInternational Journal of Computer Assisted Radiology and Surgery10.1007/s11548-023-02864-818:9(1665-1672)Online publication date: 22-Mar-2023
https://doi.org/10.1007/s11548-023-02864-8
Cui HHuang RZhang RHuang C(2022)LGST-Drop: label-guided structural dropout for spatial–temporal convolutional neural networksJournal of Electronic Imaging10.1117/1.JEI.31.3.03303631:03Online publication date: 1-May-2022
https://doi.org/10.1117/1.JEI.31.3.033036

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents