research-article

Learning Sequential Transformation Information of Ingredients for Fine-Grained Cooking Activity Recognition

Authors:

Atsushi Okamoto,

Katsufumi Inoue,

Michifumi YoshiokaAuthors Info & Claims

CEA++ '22: Proceedings of the 1st International Workshop on Multimedia for Cooking, Eating, and related APPlications

Pages 29 - 36

https://doi.org/10.1145/3552485.3554940

Published: 10 October 2022 Publication History

Abstract

The goal of our research is to recognize the fine-grained cooking activities (e.g., dicing or mincing in cutting) in the egocentric videos from the sequential transformation of ingredients that are processed by the camera-wearer; these types of activities are classified according to the state of ingredients after processing, and we often utilize the same cooking utensils and similar motions in such activities. Due to the above conditions, the recognition of such activities is a challenging task in computer vision and multimedia analysis. To tackle this problem, we need to perceive the sequential state transformation of ingredients precisely. In this research, to realize this, we propose a new GAN-based network whose characteristic points are 1) we crop images around the ingredient as a preprocessing to remove the environmental information, 2) we generate intermediate images from the past and future images to obtain the sequential information in the generator network, 3) the adversarial network is employed as a discriminator to classify whether the input image is generated one or not, and 4) we employ the temporally coherent network to check the temporal smoothness of input images and to predict cooking activities by comparing the original sequential images and the generated ones. To investigate the effectiveness of our proposed method, for the first step, we especially focus on "\textitcutting activities ". From the experimental results with our originally prepared dataset, in this paper, we report the effectiveness of our proposed method.

References

[1]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luvcić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. In Proc. of ICCV.

[2]

Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proc. of CVPR.

[3]

Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. Efros. 2019. Everybody Dance Now. In Proc. of ICCV.

[4]

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2022. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100. International Journal of Computer Vision, Vol. 130 (2022), 33--55.

Digital Library

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In Proc. of CVPR.

[6]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proc. of ICLR.

[7]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast Networks for Video Recognition. In Proc. of ICCV.

[8]

Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2016. Spatiotemporal Residual Networks for Video Action Recognition. In Proc. of NIPS.

Digital Library

[9]

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Proc. of NIPS.

[10]

Katsufumi Inoue, Misa Ono, and Michifumi Yoshioka. 2016. Hand Detection and Cooking Activities Recognition in Egocentric Videos. In Proc. of ACIS.

[11]

Aisha U. Khan and Ali Borji. 2018. Analysis of Hand Segmentation in the Wild. In Proc. of CVPR.

[12]

Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. of ICLR.

[13]

Yin Li, Mian Liu, and Hames H. Rehg. 2018. In the Eye of the Beholder: Gaze and Actions in First Person Vision. In Proc. of ECCV.

[14]

Zhengqin Li and Jiansheng Chen. 2015. Superpixel Segmentation using Linear Spectral Clustering. In Proc. of CVPR.

[15]

Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. 2017. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation. In Proc. of CVPR.

[16]

Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proc. of ICML.

[17]

Shinya Michibata, Katsufumi Inoue, Michifumi Yoshioka, and Atsushi Hashimoto. 2020. Cooking Activity Recognition in Egocentric Videos with a Hand Mask Image Branch in the Multi-stream CNN. In Proc. of CEA.

Digital Library

[18]

Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, Yoko Yamakata, and Shinsuke Mori. 2021. Structure-Aware Procedural Text Generation From an Image Sequence. IEEE Access, Vol. 9 (2021), 2125--2141.

[19]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proc. of MICCAI.

[20]

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Xhen. 2016. Improved Techniques for Training GANs. In Proc. of NIPS.

[21]

Shuichi Urabe, Katsufumi Inoue, and Michifumi Yoshioka. 2018. Cooking Activities Recognition in Egocentric Videos Using Combining 2DCNN and 3DCNN. In Proc. of CEA.

Digital Library

[22]

Ashish Vaswani, Noam Shazeer, Niki Parmer, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Poloskhin. 2017. Attention Is All You Need. In Proc. of NIPS.

[23]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Proc. of ECCV.

[24]

Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, and Cordelia Schmid. 2022. Multiview Transformers for Video Recognition. In Proc. of CVPR.

[25]

Hanhua Ye, Guorong Li, Yuankai Qi, Shuhui Wang, Qingming Huang, and Ming-Hsuan Yang. 2001. Detection of important segments in cooking videos. In Proc. of CBAIVL.

[26]

Hanhua Ye, Guorong Li, Yuankai Qi, Shuhui Wang, Qingming Huang, and Ming-Hsuan Yang. 2020. Active Learning of the Cutting of Cooking Ingredients using Simulation with Object Splitting. In Proc. of SII.

[27]

Hanhua Ye, Guorong Li, Yuankai Qi, Shuhui Wang, Qingming Huang, and Ming-Hsuan Yang. 2022. Hierarchical Modular Network for Video Captioning. In Proc. of CVPR.

Index Terms

Learning Sequential Transformation Information of Ingredients for Fine-Grained Cooking Activity Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Cooking Activity Recognition in Egocentric Videos with a Hand Mask Image Branch in the Multi-stream CNN
CEA '20: Proceedings of the 12th Workshop on Multimedia for Cooking and Eating Activities

Fine-grained activity recognition, especially cooking activity one, with egocentric videos is a hot topic and a challenging task in computer vision. To tackle this problem, many researchers have tried to leverage the information of cooking tools such as ...
Cooking activities recognition in egocentric videos using combining 2DCNN and 3DCNN
CEA/MADiMa '18: Proceedings of the Joint Workshop on Multimedia for Cooking and Eating Activities and Multimedia Assisted Dietary Management

Recently activities recognition in egocentric videos using wearable camera is one of the hot topic in computer vision. We mainly focus on the problem of cooking activities recognition in egocentric videos. An accurate cooking activities recognition ...
Multimodal Wearable Sensing for Fine-Grained Activity Recognition in Healthcare
State-of-the-art in-home activity recognition schemes with wearable devices are mostly capable of detecting coarse-grained activities (sitting, standing, walking, or lying down), but can't distinguish complex activities (sitting on the floor versus the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CEA++ '22: Proceedings of the 1st International Workshop on Multimedia for Cooking, Eating, and related APPlications

October 2022

66 pages

ISBN:9781450395038

DOI:10.1145/3552485

General Chair:
Yoko Yamakata
The University of Tokyo, Japan
,
Program Chairs:
Atsushi Hashimoto
OMRON SINIC X Corporation, Japan
,
Jingjing Chen
Fudan University, China

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 20 of 33 submissions, 61%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
69
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)2

Reflects downloads up to 06 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents