research-article

Action Anticipation Using Pairwise Human-Object Interactions and Transformers

Authors:

Basura FernandoAuthors Info & Claims

IEEE Transactions on Image Processing, Volume 30

Pages 8116 - 8129

https://doi.org/10.1109/TIP.2021.3113114

Published: 01 January 2021 Publication History

Abstract

The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at <uri>https://github.com/debadityaroy/MM-Transformer_ActAnt</uri>) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation.

References

[1]

Y. A. Farha, A. Richard, and J. Gall, “When will you do what?—Anticipating temporal occurrences of activities,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 5343–5352.

[2]

D. Acharya, Z. Huang, D. P. Paudel, and L. Van Gool, “Covariance pooling for facial expression recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2018, pp. 367–374.

[3]

V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “Log-Euclidean metrics for fast and simple calculus on diffusion tensors,” Magn. Reson. Med., Off. J. Int. Soc. Magn. Reson. Med., vol. 56, no. 2, pp. 411–421, 2006.

[4]

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. 3rd Int. Conf. Learn. Represent. (ICLR), 2015, pp. 1–15.

[5]

F. Baradel, N. Neverova, C. Wolf, J. Mille, and G. Mori, “Object level visual reasoning in videos,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 105–121.

[6]

L. Berrada, A. Zisserman, and M. P. Kumar, “Smooth loss functions for deep top-k classification,” in Proc. Int. Conf. Learn. Represent., 2018, pp. 1–25.

[7]

F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents in dashcam videos,” in Proc. Asian Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 136–153.

[8]

A. Cherian and S. Gould, “Second-order temporal pooling for action recognition,” Int. J. Comput. Vis., vol. 127, no. 4, pp. 340–362, Apr. 2019.

Digital Library

[9]

D. Damenet al., “The epic-kitchens dataset: Collection, challenges and baselines,” IEEE Trans. Pattern Anal. Mach. Intell., early access, May 6, 2020. 10.1109/TPAMI.2020.2991965.

[10]

D. Damenet al., “Rescaling egocentric vision,” 2020, arXiv:2006.13256. [Online]. Available: http://arxiv.org/abs/2006.13256

[11]

M. Engin, L. Wang, L. Zhou, and X. Liu, “DeepKSPD: Learning kernel-matrix-based SPD representation for fine-grained image recognition,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 612–627.

[12]

Y. A. Farha and J. Gall, “MS-TCN: Multi-stage temporal convolutional network for action segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 3575–3584.

[13]

B. Fernando and S. Herath, “Anticipating human actions by correlating past with the future with Jaccard similarity measures,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2021, pp. 13224–13233.

[14]

J. Friedman, T. Hastie, and R. Tibshirani, The Elements of Statistical Learning (Springer Series in Statistics), vol. 1. New York, NY, USA: Springer-Verlag, 2001.

[15]

A. Furnari, S. Battiato, and G. M. Farinella, “Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation,” in Proc. Eur. Conf. Comput. Vis. Workshops (ECCVW), 2018, pp. 1–17.

[16]

A. Furnari and G. Farinella, “What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2019, pp. 6252–6261.

[17]

H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Forecasting future action sequences with neural memory networks,” in Proc. 30th Brit. Mach. Vis. Conf. (BMVC). Cardiff, U.K.: BMVA Press, Sep. 2019, p. 298.

[18]

H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Predicting the future: A jointly learnt model for action anticipation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 5562–5571.

[19]

K. Gavrilyuk, R. Sanford, M. Javan, and C. G. M. Snoek, “Actor-transformers for group activity recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 839–848.

[20]

R. Girdhar, J. Joao Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 244–253.

[21]

G. Gkioxari, R. Girshick, P. Dollár, and K. He, “Detecting and recognizing human-object interactions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8359–8367.

[22]

X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proc. 14th Int. Conf. Artif. Intell. Statist., 2011, pp. 315–323.

[23]

S. Halder, K. H. Lim, J. Chan, and X. Zhang, “Transformer-based multi-task learning for queuing time aware next POI recommendation,” in Advances in Knowledge Discovery and Data Mining (Lecture Notes in Computer Science). Cham, Switzerland: Springer, 2021, pp. 510–523.

[24]

K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 2961–2969.

[25]

J. F. Kolen and S. C. Kremer, “Gradient flow in recurrent nets: The difficulty of learning longterm dependencies,” in A Field Guide to Dynamical Recurrent Networks. 2001, pp. 237–243. 10.1109/9780470544037.ch14.

[26]

C. Ionescu, O. Vantzos, and C. Sminchisescu, “Matrix backpropagation for deep networks with structured layers,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 2965–2973.

[27]

A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-RNN: Deep learning on spatio-temporal graphs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 5308–5317.

[28]

V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, “Joint learning of object and action detectors,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 4163–4172.

[29]

Q. Ke, M. Fritz, and B. Schiele, “Time-conditioned action anticipation in one shot,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 9925–9934.

[30]

D.-J. Kim, X. Sun, J. Choi, S. Lin, and I. S. Kweon, “Detecting human-object interactions with action co-occurrence priors,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020, pp. 718–736.

[31]

H. Koppula and A. Saxena, “Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation,” in Proc. Int. Conf. Mach. Learn., 2013, pp. 792–800.

[32]

H. S. Koppula and A. Saxena, “Anticipating human activities using object affordances for reactive robotic response,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 14–29, Jan. 2016.

Digital Library

[33]

H. Kuehne, A. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 780–787.

[34]

C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 156–165.

[35]

J. Libovický, J. Helcl, and D. Mareček, “Input combination strategies for multi-source transformer decoder,” in Proc. 3rd Conf. Mach. Transl., Res. Papers, 2018, pp. 253–260.

[36]

T.-Y. Linet al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, pp. 740–755.

[37]

T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear CNN models for fine-grained visual recognition,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1449–1457.

[38]

W. Liuet al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37.

[39]

A. Mallya and S. Lazebnik, “Learning models for actions and person-object interactions with transfer to question answering,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 414–428.

[40]

P. Mettes and C. G. M. Snoek, “Spatial-aware object embeddings for zero-shot localization and classification of actions,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 4443–4452.

[41]

A. Miech, I. Laptev, J. Sivic, H. Wang, L. Torresani, and D. Tran, “Leveraging the present to anticipate the future in videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2019, pp. 1–8.

[42]

T. Nagarajan, Y. Li, C. Feichtenhofer, and K. Grauman, “Ego-topo: Environment affordances from egocentric video,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 163–172.

[43]

Y. B. Ng and B. Fernando, “Forecasting future action sequences with attention: A new approach to weakly supervised action forecasting,” IEEE Trans. Image Process., vol. 29, pp. 8880–8891, 2020.

[44]

M. Oliu, J. Selva, and S. Escalera, “Folded recurrent neural networks for future video prediction,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 716–731.

[45]

X. Pennec, P. Fillard, and N. Ayache, “A Riemannian framework for tensor computing,” Int. J. Comput. Vis., vol. 66, no. 1, pp. 41–66, 2006.

Digital Library

[46]

S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu, “Learning human-object interactions by graph parsing neural networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 401–417.

[47]

V. Ramanathanet al., “Learning semantic relationships for better action retrieval in images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1100–1109.

[48]

A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Pedestrian action anticipation using contextual feature fusion in stacked RNNs,” in Proc. 30th Brit. Mach. Vis. Conf. (BMVC). Cardiff, U.K.: BMVA Press, Sep. 2019, p. 171.

[49]

C. Rodriguez, B. Fernando, and H. Li, “Action anticipation by predicting future dynamic images,” in Proc. Eur. Conf. Comput. Vis. Workshops (ECCV), 2018, pp. 1–16.

[50]

M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in Proc. Int. Conf. Comput. Vis., Nov. 2011, pp. 1036–1043.

[51]

M. S. Aliakbarian, F. S. Saleh, M. Salzmann, B. Fernando, L. Petersson, and L. Andersson, “Encouraging LSTMs to anticipate actions very early,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 280–289.

[52]

F. Sener, D. Singhania, and A. Yao, “Temporal aggregate representations for long-range video understanding,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020, pp. 154–171.

[53]

Y. Shi, B. Fernando, and R. Hartley, “Action anticipation with RBF kernelized feature mapping RNN,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 301–317.

[54]

B. Soran, A. Farhadi, and L. Shapiro, “Generating notifications for missing actions: Don’t forget to turn the lights off!,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 4669–4677.

[55]

S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” in Proc. ACM Int. Joint Conf. Pervasive Ubiquitous Comput., 2013, pp. 729–738.

[56]

C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid, “Actor-centric relation network,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 318–334.

[57]

I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 3104–3112.

[58]

T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents with adaptive loss and large-scale incident DB,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3521–3529.

[59]

O. Tuzel, F. Porikli, and P. Meer, “Pedestrian detection via classification on Riemannian manifolds,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 10, pp. 1713–1727, Oct. 2008.

Digital Library

[60]

A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.

[61]

C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 98–106.

[62]

B. Wan, D. Zhou, Y. Liu, R. Li, and X. He, “Pose-aware multi-level feature network for human object interaction detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 9469–9478.

[63]

B. Wang, L. Huang, and M. Hoai, “Active vision for early recognition of human actions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 1081–1091.

[64]

Q. Wang, J. Xie, W. Zuo, L. Zhang, and P. Li, “Deep CNNs meet global covariance pooling: Better representation and generalization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 8, pp. 2582–2597, Aug. 2021.

[65]

X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 7794–7803.

[66]

X. Wei, P. Lucey, S. Vidas, S. Morgan, and S. Sridharan, “Forecasting events using an augmented hidden conditional random field,” in Proc. Asian Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, pp. 569–582.

[67]

S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-C. Woo, “Convolutional LSTM network: A machine learning approach for precipitation nowcasting,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 802–810.

[68]

M. Xu, M. Gao, Y.-T. Chen, L. Davis, and D. Crandall, “Temporal recurrent networks for online action detection,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2019, pp. 5532–5541.

[69]

L. Yu, E. Park, A. C. Berg, and T. L. Berg, “Visual madlibs: Fill in the blank description generation and question answering,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 2461–2469.

[70]

Y. Zhang, P. Tokmakov, M. Hebert, and C. Schmid, “A structured model for action detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 9975–9984.

[71]

S. Zhenget al., “Conditional random fields as recurrent neural networks,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1529–1537.

[72]

B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 803–818.

Cited By

Zhang YLi XXie HZhuang WGuo SLi Z(2024)Multi-Label Action Anticipation for Real-World Videos With Scene UnderstandingIEEE Transactions on Image Processing10.1109/TIP.2024.339169233(3242-3255)Online publication date: 25-Apr-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3391692
Bai JQin HLai SGuo JGuo Y(2024)GLPanoDepth: Global-to-Local Panoramic Depth EstimationIEEE Transactions on Image Processing10.1109/TIP.2024.338640333(2936-2949)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3386403
Zhang TMin WLiu TJiang SRui Y(2023)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1145/3633333
Show More Cited By

Recommendations

Pairwise Body-Part Attention for Recognizing Human-Object Interactions
Computer Vision – ECCV 2018
Abstract
In human-object interactions (HOI) recognition, conventional methods consider the human body as a whole and pay a uniform attention to the entire body region. They ignore the fact that normally, human interacts with an object by using some parts ...
Online human action detection and anticipation in videos: A survey
Abstract
To meet the demand for powerful models for practical applications in real time, the focus of research on human actions has shifted from offline detection to online and real-time understanding, such as driver-assistance systems, surveillance ...
Untrimmed Action Anticipation
Image Analysis and Processing – ICIAP 2022
Abstract
Egocentric action anticipation consists in predicting a future action the camera wearer will perform from egocentric video. While the task has recently attracted the attention of the research community, current approaches assume that the input ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Image Processing

IEEE Transactions on Image Processing Volume 30, Issue

2021

5053 pages

ISSN:1057-7149

Issue’s Table of Contents

1941-0042 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 January 2021

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YLi XXie HZhuang WGuo SLi Z(2024)Multi-Label Action Anticipation for Real-World Videos With Scene UnderstandingIEEE Transactions on Image Processing10.1109/TIP.2024.339169233(3242-3255)Online publication date: 25-Apr-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3391692
Bai JQin HLai SGuo JGuo Y(2024)GLPanoDepth: Global-to-Local Panoramic Depth EstimationIEEE Transactions on Image Processing10.1109/TIP.2024.338640333(2936-2949)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3386403
Zhang TMin WLiu TJiang SRui Y(2023)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1145/3633333
Selva JJohansen AEscalera SNasrollahi KMoeslund TClapés A(2023)Video Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.324346545:11(12922-12943)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1109/TPAMI.2023.3243465
Lin WZhang HFan ZLiu JYang LLei QDu J(2023)Point-Based Learnable Query Generator for Human–Object Interaction DetectionIEEE Transactions on Image Processing10.1109/TIP.2023.333410032(6469-6484)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TIP.2023.3334100
Zeng ZDai PZhang XZhang LCao X(2023)Cognition Guided Human-Object Relationship DetectionIEEE Transactions on Image Processing10.1109/TIP.2023.327004032(2468-2480)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TIP.2023.3270040
Guan WSong XWang KWen HNi HWang YChang X(2023)Egocentric Early Action Prediction via Multimodal Transformer-Based Dual Action PredictionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.324827133:9(4472-4483)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3248271
Bai JFang XFang JXue JYuan C(2022)Deep Virtual-to-Real Distillation for Pedestrian Crossing Prediction2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)10.1109/ITSC55140.2022.9921771(1586-1592)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1109/ITSC55140.2022.9921771

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents