Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3552485.3554940acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Sequential Transformation Information of Ingredients for Fine-Grained Cooking Activity Recognition

Published: 10 October 2022 Publication History

Abstract

The goal of our research is to recognize the fine-grained cooking activities (e.g., dicing or mincing in cutting) in the egocentric videos from the sequential transformation of ingredients that are processed by the camera-wearer; these types of activities are classified according to the state of ingredients after processing, and we often utilize the same cooking utensils and similar motions in such activities. Due to the above conditions, the recognition of such activities is a challenging task in computer vision and multimedia analysis. To tackle this problem, we need to perceive the sequential state transformation of ingredients precisely. In this research, to realize this, we propose a new GAN-based network whose characteristic points are 1) we crop images around the ingredient as a preprocessing to remove the environmental information, 2) we generate intermediate images from the past and future images to obtain the sequential information in the generator network, 3) the adversarial network is employed as a discriminator to classify whether the input image is generated one or not, and 4) we employ the temporally coherent network to check the temporal smoothness of input images and to predict cooking activities by comparing the original sequential images and the generated ones. To investigate the effectiveness of our proposed method, for the first step, we especially focus on "\textitcutting activities ". From the experimental results with our originally prepared dataset, in this paper, we report the effectiveness of our proposed method.

References

[1]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luvcić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. In Proc. of ICCV.
[2]
Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proc. of CVPR.
[3]
Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. Efros. 2019. Everybody Dance Now. In Proc. of ICCV.
[4]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2022. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100. International Journal of Computer Vision, Vol. 130 (2022), 33--55.
[5]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In Proc. of CVPR.
[6]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proc. of ICLR.
[7]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast Networks for Video Recognition. In Proc. of ICCV.
[8]
Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2016. Spatiotemporal Residual Networks for Video Action Recognition. In Proc. of NIPS.
[9]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proc. of NIPS.
[10]
Katsufumi Inoue, Misa Ono, and Michifumi Yoshioka. 2016. Hand Detection and Cooking Activities Recognition in Egocentric Videos. In Proc. of ACIS.
[11]
Aisha U. Khan and Ali Borji. 2018. Analysis of Hand Segmentation in the Wild. In Proc. of CVPR.
[12]
Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. of ICLR.
[13]
Yin Li, Mian Liu, and Hames H. Rehg. 2018. In the Eye of the Beholder: Gaze and Actions in First Person Vision. In Proc. of ECCV.
[14]
Zhengqin Li and Jiansheng Chen. 2015. Superpixel Segmentation using Linear Spectral Clustering. In Proc. of CVPR.
[15]
Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. 2017. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation. In Proc. of CVPR.
[16]
Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proc. of ICML.
[17]
Shinya Michibata, Katsufumi Inoue, Michifumi Yoshioka, and Atsushi Hashimoto. 2020. Cooking Activity Recognition in Egocentric Videos with a Hand Mask Image Branch in the Multi-stream CNN. In Proc. of CEA.
[18]
Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, Yoko Yamakata, and Shinsuke Mori. 2021. Structure-Aware Procedural Text Generation From an Image Sequence. IEEE Access, Vol. 9 (2021), 2125--2141.
[19]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proc. of MICCAI.
[20]
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Xhen. 2016. Improved Techniques for Training GANs. In Proc. of NIPS.
[21]
Shuichi Urabe, Katsufumi Inoue, and Michifumi Yoshioka. 2018. Cooking Activities Recognition in Egocentric Videos Using Combining 2DCNN and 3DCNN. In Proc. of CEA.
[22]
Ashish Vaswani, Noam Shazeer, Niki Parmer, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Poloskhin. 2017. Attention Is All You Need. In Proc. of NIPS.
[23]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Proc. of ECCV.
[24]
Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, and Cordelia Schmid. 2022. Multiview Transformers for Video Recognition. In Proc. of CVPR.
[25]
Hanhua Ye, Guorong Li, Yuankai Qi, Shuhui Wang, Qingming Huang, and Ming-Hsuan Yang. 2001. Detection of important segments in cooking videos. In Proc. of CBAIVL.
[26]
Hanhua Ye, Guorong Li, Yuankai Qi, Shuhui Wang, Qingming Huang, and Ming-Hsuan Yang. 2020. Active Learning of the Cutting of Cooking Ingredients using Simulation with Object Splitting. In Proc. of SII.
[27]
Hanhua Ye, Guorong Li, Yuankai Qi, Shuhui Wang, Qingming Huang, and Ming-Hsuan Yang. 2022. Hierarchical Modular Network for Video Captioning. In Proc. of CVPR.

Index Terms

  1. Learning Sequential Transformation Information of Ingredients for Fine-Grained Cooking Activity Recognition

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CEA++ '22: Proceedings of the 1st International Workshop on Multimedia for Cooking, Eating, and related APPlications
      October 2022
      66 pages
      ISBN:9781450395038
      DOI:10.1145/3552485
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 October 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cooking activities recognition
      2. egocentric video analysis
      3. generative adversarial networks

      Qualifiers

      • Research-article

      Conference

      MM '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 20 of 33 submissions, 61%

      Upcoming Conference

      MM '24
      The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 69
        Total Downloads
      • Downloads (Last 12 months)23
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 06 Oct 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media