research-article

Unsupervised method for video action segmentation through spatio-temporal and positional-encoded embeddings

Authors:

Guilherme de A. P. Marques,

Antonio José G. Busson,

Alan Lívio V. Guedes,

Julio Cesar Duarte,

Sérgio ColcherAuthors Info & Claims

MMSys '22: Proceedings of the 13th ACM Multimedia Systems Conference

Pages 136 - 149

https://doi.org/10.1145/3524273.3528187

Published: 05 August 2022 Publication History

Abstract

Action segmentation consists of temporally segmenting a video and labeling each segmented interval with a specific action label. In this work, we propose a novel action segmentation method that requires no prior video analysis and no annotated data. Our method involves extracting spatio-temporal features from videos using a pre-trained deep network. Data is then transformed using a positional encoder, and finally a clustering algorithm is applied, where each produced cluster presumably corresponds to a different single and distinguishable action. In experiments, we show that our method produces competitive results on the Breakfast and Inria Instructional Videos dataset benchmarks.

References

[1]

Sathyanarayanan N. Aakur and Sudeep Sarkar. 2019. A Perceptual Prediction Framework for Self Supervised Event Segmentation. arXiv:1811.04869 [cs] (April 2019). http://arxiv.org/abs/1811.04869 arXiv: 1811.04869.

[2]

Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised Learning from Narrated Instruction Videos. In CVPR2016 - 29th IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, United States. https://hal.inria.fr/hal-01171193

[3]

Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic. 2014. Weakly Supervised Action Labeling in Videos Under Ordering Constraints. arXiv:1407.1208 [cs] (July 2014). http://arxiv.org/abs/1407.1208 arXiv: 1407.1208.

[4]

João Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A Short Note about Kinetics-600. CoRR abs/1808.01340 (2018). arXiv:1808.01340 http://arxiv.org/abs/1808.01340

[5]

Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 6299--6308. https://openaccess.thecvf.com/content_cvpr_2017/html/Carreira_Quo_Vadis_Action_CVPR_2017_paper.html

[6]

Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei, and Juan Carlos Niebles. 2019. D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation. arXiv:1901.02598 [cs] (April 2019). http://arxiv.org/abs/1901.02598 arXiv: 1901.02598.

[7]

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014.

[8]

Li Ding and Chenliang Xu. 2018. Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment. arXiv:1803.10699 [cs] (March 2018). http://arxiv.org/abs/1803.10699 arXiv: 1803.10699.

[9]

Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 677--691.

Digital Library

[10]

Mohsen Fayyaz and Juergen Gall. 2020. SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation. arXiv:2003.14266 [cs] (March 2020). http://arxiv.org/abs/2003.14266 arXiv: 2003.14266.

[11]

Christoph Feichtenhofer. 2020. X3D: Expanding Architectures for Efficient Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast Networks for Video Recognition. 6202--6211. https://openaccess.thecvf.com/content_ICCV_2019/html/Feichtenhofer_SlowFast_Networks_for_Video_Recognition_ICCV_2019_paper.html

[13]

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional Sequence to Sequence Learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (Sydney, NSW, Australia) (ICML'17). JMLR.org, 1243--1252.

[14]

R. Goyal, S. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. 2017. The "Something Something" Video Database for Learning and Evaluating Visual Common Sense. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, 5843--5851.

[15]

Hongfeng Han, Guoxing Yang, Yuqi Huo, Zhiwu Lu, and Ji-Rong Wen. 2021. Complex Action Segmentation in Compressed Videos. In 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.

[16]

Samitha Herath, Mehrtash Harandi, and Fatih Porikli. 2017. Going Deeper into Action Recognition. Image Vision Comput. 60, C (apr 2017), 4--21.

Digital Library

[17]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997).

Digital Library

[18]

Berthold K.P. Horn and Brian G. Schunck. 1981. Determining optical flow. Artificial Intelligence 17, 1 (1981), 185--203.

Digital Library

[19]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732.

Digital Library

[20]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. arXiv:1705.06950 [cs] (May 2017). http://arxiv.org/abs/1705.06950 arXiv: 1705.06950.

[21]

Alexander Klaser, Marcin Marszalek, and Cordelia Schmid. 2008. A SpatioTemporal Descriptor Based on 3D-Gradients. In BMVC 2008 - 19th British Machine Vision Conference, Mark Everingham, Chris Needham, and Roberto Fraile (Eds.). British Machine Vision Association, Leeds, United Kingdom, 275:1--10. https://hal.inria.fr/inria-00514853

[22]

Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3416--3424.

[23]

Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing Gong. 2021. MoViNets: Mobile Video Networks for Efficient Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16020--16030.

[24]

JB Kruskal and Mark Liberman. 1983. The symmetric time-warping problem: From continuous to discrete. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (Jan. 1983).

[25]

Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 780--787. ISSN: 1063-6919.

Digital Library

[26]

Hilde Kuehne, Alexander Richard, and Juergen Gall. 2017. Weakly supervised learning of actions from transcripts. arXiv:1610.02237 [cs] (June 2017). http://arxiv.org/abs/1610.02237 arXiv: 1610.02237.

[27]

Hilde Kuehne, Alexander Richard, and Juergen Gall. 2020. A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 4 (April 2020), 765--779. arXiv: 1906.01028.

Digital Library

[28]

H.W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 1--2 (1955), 83--97.

[29]

H. W. Kuhn. 1956. Variants of the hungarian method for assignment problems. Naval Research Logistics Quarterly 3, 4 (December 1956), 253--258.

[30]

Anna Kukleva, Hilde Kuehne, Fadime Sener, and Jurgen Gall. 2019. Unsupervised Learning of Action Classes With Continuous Temporal Embedding. 12066--12074. https://openaccess.thecvf.com/content_CVPR_2019/html/Kukleva_Unsupervised_Learning_of_Action_Classes_With_Continuous_Temporal_Embedding_CVPR_2019_paper.html

[31]

Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. 2020. MotionSqueeze: Neural Motion Feature learning for Video Understanding. In ECCV.

[32]

Jun Li, Peng Lei, and Sinisa Todorovic. 2019. Weakly Supervised Energy-Based Learning for Action Segmentation. arXiv:1909.13155 [cs] (Sept. 2019). http://arxiv.org/abs/1909.13155 arXiv: 1909.13155.

[33]

Jun Li and Sinisa Todorovic. 2021. Action Shuffle Alternating Learning for Unsupervised Action Segmentation. arXiv:2104.02116 [cs] (April 2021). http://arxiv.org/abs/2104.02116 arXiv: 2104.02116.

[34]

Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 7082--7092.

[35]

J. Macqueen. 1967. Some methods for classification and analysis of multivariate observations. In In 5-th Berkeley Symposium on Mathematical Statistics and Probability. 281--297.

[36]

P. Mendes, A. Busson, S. Colcher, D. Schwabe, A. Guedes, and C. Laufer. 2020. A Cluster-Matching-Based Method for Video Face Recognition. In Proceedings of the Brazilian Symposium on Multimedia and the Web. 97--104.

[37]

Paulo Renato C Mendes, Eduardo S Vieira, Pedro Vinicius A de Freitas, Antonio José G Busson, Álan Lívio V Guedes, Carlos de Salles Soares Neto, and Sérgio Colcher. 2020. Shaping the Video Conferences of Tomorrow With AI. In Anais Estendidos do XXVI Simpósio Brasileiro de Sistemas Multimídia e Web. SBC, 165--168.

[38]

James Munkres. 1957. Algorithms for the Assignment and Transportation Problems. J. Soc. Indust. Appl. Math. 5, 1 (1957), 32--38. http://www.jstor.org/stable/2098689

[39]

Ronald Poppe. 2010. A survey on vision-based human action recognition. Image and Vision Computing 28, 6 (2010), 976--990.

Digital Library

[40]

Alexander Richard, Hilde Kuehne, and Juergen Gall. 2017. Weakly Supervised Action Learning with RNN based Fine-to-coarse Modeling. arXiv:1703.08132 [cs] (Oct. 2017). http://arxiv.org/abs/1703.08132 arXiv: 1703.08132.

[41]

Alexander Richard, Hilde Kuehne, and Juergen Gall. 2017. Weakly Supervised Action Learning with RNN based Fine-to-coarse Modeling. CoRR abs/1703.08132 (2017). arXiv:1703.08132 http://arxiv.org/abs/1703.08132

[42]

Alexander Richard, Hilde Kuehne, Ahsan Iqbal, and Juergen Gall. 2018. NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning. arXiv:1805.06875 [cs] (May 2018). http://arxiv.org/abs/1805.06875 arXiv: 1805.06875.

[43]

Gabriel NP dos Santos, Pedro VA de Freitas, Antonio José G Busson, Álan LV Guedes, Ruy Milidiú, and Sérgio Colcher. 2019. Deep learning methods for video understanding. In Proceedings of the 25th Brazillian Symposium on Multimedia and the Web. 21--23.

Digital Library

[44]

M. Saquib Sarfraz, Vivek Sharma, and Rainer Stiefelhagen. 2019. Efficient Parameter-free Clustering Using First Neighbor Relations. (Feb. 2019). https://arxiv.org/abs/1902.11266v1

[45]

Saquib Sarfraz, Naila Murray, Vivek Sharma, Ali Diba, Luc Van Gool, and Rainer Stiefelhagen. 2021. Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11225--11234.

[46]

Konrad Schindler and Luc van Gool. 2008. Action snippets: How many frames does human action recognition require?. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. 1--8.

[47]

Fadime Sener and Angela Yao. 2018. Unsupervised Learning and Segmentation of Complex Activities from Video. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, 8368--8376.

[48]

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. CoRR abs/1803.02155 (2018). arXiv:1803.02155 http://arxiv.org/abs/1803.02155

[49]

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. In Computer Vision - ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 510--526.

[50]

Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 568--576. http://papers.nips.cc/paper/5353-two-stream-convolutional-networks-for-action-recognition-in-videos.pdf

[51]

Yaser Souri, Mohsen Fayyaz, Luca Minciullo, Gianpiero Francesca, and Juergen Gall. 2020. Fast Weakly Supervised Action Segmentation Using Mutual Consistency. arXiv:1904.03116 [cs] (March 2020). http://arxiv.org/abs/1904.03116 arXiv: 1904.03116.

[52]

Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2017. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. CoRR abs/1709.02371 (2017). arXiv:1709.02371 http://arxiv.org/abs/1709.02371

[53]

Javier Sánchez Pérez, Enric Meinhardt-Llopis, and Gabriele Facciolo. 2013. TV-L1 Optical Flow Estimation. Image Processing On Line 3 (2013), 137--150.

[54]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 [cs] (Dec. 2017). http://arxiv.org/abs/1706.03762 arXiv: 1706.03762.

[55]

Rosaura G. VidalMata, Walter J. Scheirer, Anna Kukleva, David Cox, and Hilde Kuehne. 2020. Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences. arXiv:2001.11122 [cs] (Sept. 2020). http://arxiv.org/abs/2001.11122 arXiv: 2001.11122.

[56]

Heng Wang and Cordelia Schmid. 2013. Action Recognition with Improved Trajectories. 3551--3558. https://openaccess.thecvf.com/content_iccv_2013/html/Wang_Action_Recognition_with_2013_ICCV_paper.html

[57]

Heng Wang and Cordelia Schmid. 2013. Action Recognition with Improved Trajectories. In 2013 IEEE International Conference on Computer Vision. 3551--3558.

Digital Library

[58]

Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. 2017. Nonlocal Neural Networks. CoRR abs/1711.07971 (2017). arXiv:1711.07971 http://arxiv.org/abs/1711.07971

[59]

Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J Smola, and Philipp Krähenbühl. 2018. Compressed video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6026--6035.

[60]

Huifen Xia and Yongzhao Zhan. 2020. A Survey on Temporal Action Localization. IEEE Access 8 (2020), 70477--70487.

[61]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond Short Snippets: Deep Networks for Video Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Cited By

Marques GBoaro JBusson AGuedes ADuarte JColcher S(2024)Action Segmentation through Self-Supervised Video Features and Positional-Encoded EmbeddingsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946520:9(1-23)Online publication date: 24-Feb-2024
https://dl.acm.org/doi/10.1145/3649465

Index Terms

Unsupervised method for video action segmentation through spatio-temporal and positional-encoded embeddings
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
2. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

Action Segmentation through Self-Supervised Video Features and Positional-Encoded Embeddings
Action segmentation consists of temporally segmenting a video and labeling each segmented interval with a specific action label. In this work, we propose a novel action segmentation method that requires no initial video analysis and no annotated data. Our ...
A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings
WebMedia '21: Proceedings of the Brazilian Symposium on Multimedia and the Web

A crucial task to overall video understanding is the recognition and localisation in time of different actions or events that are present along the scenes. To address this problem, action segmentation must be achieved. Action segmentation consists of ...
Multistage temporal convolution transformer for action segmentation
Abstract
This paper addresses fully supervised action segmentation. Transformers have been shown to have large model capacity and powerful sequence modeling abilities, and hence seem quite suitable for capturing action grammar in videos. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMSys '22: Proceedings of the 13th ACM Multimedia Systems Conference

June 2022

432 pages

ISBN:9781450392839

DOI:10.1145/3524273

General Chairs:
Niall Murray
Technological University of the Shannon: Midlands Midwest
,
Gwendal Simon
Synamedia
,
Mylene Farias
University of Brasilia
,
Program Chairs:
Irene Viola
Centrum Wiskunde & Informatica
,
Mario Montagud
i2CAT Foundation & University of Valencia

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Available / v1.1

Author Tags

Qualifiers

Research-article

Conference

MMSys '22

Sponsor:

SIGMM

MMSys '22: 13th ACM Multimedia Systems Conference

June 14 - 17, 2022

Athlone, Ireland

Acceptance Rates

Overall Acceptance Rate 176 of 530 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
168
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Marques GBoaro JBusson AGuedes ADuarte JColcher S(2024)Action Segmentation through Self-Supervised Video Features and Positional-Encoded EmbeddingsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946520:9(1-23)Online publication date: 24-Feb-2024
https://dl.acm.org/doi/10.1145/3649465

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents