Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Action Anticipation Using Pairwise Human-Object Interactions and Transformers

Published: 01 January 2021 Publication History

Abstract

The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at <uri>https://github.com/debadityaroy/MM-Transformer_ActAnt</uri>) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation.

References

[1]
Y. A. Farha, A. Richard, and J. Gall, “When will you do what?—Anticipating temporal occurrences of activities,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 5343–5352.
[2]
D. Acharya, Z. Huang, D. P. Paudel, and L. Van Gool, “Covariance pooling for facial expression recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2018, pp. 367–374.
[3]
V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “Log-Euclidean metrics for fast and simple calculus on diffusion tensors,” Magn. Reson. Med., Off. J. Int. Soc. Magn. Reson. Med., vol. 56, no. 2, pp. 411–421, 2006.
[4]
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. 3rd Int. Conf. Learn. Represent. (ICLR), 2015, pp. 1–15.
[5]
F. Baradel, N. Neverova, C. Wolf, J. Mille, and G. Mori, “Object level visual reasoning in videos,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 105–121.
[6]
L. Berrada, A. Zisserman, and M. P. Kumar, “Smooth loss functions for deep top-k classification,” in Proc. Int. Conf. Learn. Represent., 2018, pp. 1–25.
[7]
F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents in dashcam videos,” in Proc. Asian Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 136–153.
[8]
A. Cherian and S. Gould, “Second-order temporal pooling for action recognition,” Int. J. Comput. Vis., vol. 127, no. 4, pp. 340–362, Apr. 2019.
[9]
D. Damenet al., “The epic-kitchens dataset: Collection, challenges and baselines,” IEEE Trans. Pattern Anal. Mach. Intell., early access, May 6, 2020. 10.1109/TPAMI.2020.2991965.
[10]
D. Damenet al., “Rescaling egocentric vision,” 2020, arXiv:2006.13256. [Online]. Available: http://arxiv.org/abs/2006.13256
[11]
M. Engin, L. Wang, L. Zhou, and X. Liu, “DeepKSPD: Learning kernel-matrix-based SPD representation for fine-grained image recognition,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 612–627.
[12]
Y. A. Farha and J. Gall, “MS-TCN: Multi-stage temporal convolutional network for action segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 3575–3584.
[13]
B. Fernando and S. Herath, “Anticipating human actions by correlating past with the future with Jaccard similarity measures,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2021, pp. 13224–13233.
[14]
J. Friedman, T. Hastie, and R. Tibshirani, The Elements of Statistical Learning (Springer Series in Statistics), vol. 1. New York, NY, USA: Springer-Verlag, 2001.
[15]
A. Furnari, S. Battiato, and G. M. Farinella, “Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation,” in Proc. Eur. Conf. Comput. Vis. Workshops (ECCVW), 2018, pp. 1–17.
[16]
A. Furnari and G. Farinella, “What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2019, pp. 6252–6261.
[17]
H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Forecasting future action sequences with neural memory networks,” in Proc. 30th Brit. Mach. Vis. Conf. (BMVC). Cardiff, U.K.: BMVA Press, Sep. 2019, p. 298.
[18]
H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Predicting the future: A jointly learnt model for action anticipation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 5562–5571.
[19]
K. Gavrilyuk, R. Sanford, M. Javan, and C. G. M. Snoek, “Actor-transformers for group activity recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 839–848.
[20]
R. Girdhar, J. Joao Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 244–253.
[21]
G. Gkioxari, R. Girshick, P. Dollár, and K. He, “Detecting and recognizing human-object interactions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8359–8367.
[22]
X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proc. 14th Int. Conf. Artif. Intell. Statist., 2011, pp. 315–323.
[23]
S. Halder, K. H. Lim, J. Chan, and X. Zhang, “Transformer-based multi-task learning for queuing time aware next POI recommendation,” in Advances in Knowledge Discovery and Data Mining (Lecture Notes in Computer Science). Cham, Switzerland: Springer, 2021, pp. 510–523.
[24]
K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 2961–2969.
[25]
J. F. Kolen and S. C. Kremer, “Gradient flow in recurrent nets: The difficulty of learning longterm dependencies,” in A Field Guide to Dynamical Recurrent Networks. 2001, pp. 237–243. 10.1109/9780470544037.ch14.
[26]
C. Ionescu, O. Vantzos, and C. Sminchisescu, “Matrix backpropagation for deep networks with structured layers,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 2965–2973.
[27]
A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-RNN: Deep learning on spatio-temporal graphs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 5308–5317.
[28]
V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, “Joint learning of object and action detectors,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 4163–4172.
[29]
Q. Ke, M. Fritz, and B. Schiele, “Time-conditioned action anticipation in one shot,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 9925–9934.
[30]
D.-J. Kim, X. Sun, J. Choi, S. Lin, and I. S. Kweon, “Detecting human-object interactions with action co-occurrence priors,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020, pp. 718–736.
[31]
H. Koppula and A. Saxena, “Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation,” in Proc. Int. Conf. Mach. Learn., 2013, pp. 792–800.
[32]
H. S. Koppula and A. Saxena, “Anticipating human activities using object affordances for reactive robotic response,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 14–29, Jan. 2016.
[33]
H. Kuehne, A. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 780–787.
[34]
C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 156–165.
[35]
J. Libovický, J. Helcl, and D. Mareček, “Input combination strategies for multi-source transformer decoder,” in Proc. 3rd Conf. Mach. Transl., Res. Papers, 2018, pp. 253–260.
[36]
T.-Y. Linet al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, pp. 740–755.
[37]
T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear CNN models for fine-grained visual recognition,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1449–1457.
[38]
W. Liuet al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37.
[39]
A. Mallya and S. Lazebnik, “Learning models for actions and person-object interactions with transfer to question answering,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 414–428.
[40]
P. Mettes and C. G. M. Snoek, “Spatial-aware object embeddings for zero-shot localization and classification of actions,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 4443–4452.
[41]
A. Miech, I. Laptev, J. Sivic, H. Wang, L. Torresani, and D. Tran, “Leveraging the present to anticipate the future in videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2019, pp. 1–8.
[42]
T. Nagarajan, Y. Li, C. Feichtenhofer, and K. Grauman, “Ego-topo: Environment affordances from egocentric video,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 163–172.
[43]
Y. B. Ng and B. Fernando, “Forecasting future action sequences with attention: A new approach to weakly supervised action forecasting,” IEEE Trans. Image Process., vol. 29, pp. 8880–8891, 2020.
[44]
M. Oliu, J. Selva, and S. Escalera, “Folded recurrent neural networks for future video prediction,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 716–731.
[45]
X. Pennec, P. Fillard, and N. Ayache, “A Riemannian framework for tensor computing,” Int. J. Comput. Vis., vol. 66, no. 1, pp. 41–66, 2006.
[46]
S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu, “Learning human-object interactions by graph parsing neural networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 401–417.
[47]
V. Ramanathanet al., “Learning semantic relationships for better action retrieval in images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1100–1109.
[48]
A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Pedestrian action anticipation using contextual feature fusion in stacked RNNs,” in Proc. 30th Brit. Mach. Vis. Conf. (BMVC). Cardiff, U.K.: BMVA Press, Sep. 2019, p. 171.
[49]
C. Rodriguez, B. Fernando, and H. Li, “Action anticipation by predicting future dynamic images,” in Proc. Eur. Conf. Comput. Vis. Workshops (ECCV), 2018, pp. 1–16.
[50]
M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in Proc. Int. Conf. Comput. Vis., Nov. 2011, pp. 1036–1043.
[51]
M. S. Aliakbarian, F. S. Saleh, M. Salzmann, B. Fernando, L. Petersson, and L. Andersson, “Encouraging LSTMs to anticipate actions very early,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 280–289.
[52]
F. Sener, D. Singhania, and A. Yao, “Temporal aggregate representations for long-range video understanding,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020, pp. 154–171.
[53]
Y. Shi, B. Fernando, and R. Hartley, “Action anticipation with RBF kernelized feature mapping RNN,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 301–317.
[54]
B. Soran, A. Farhadi, and L. Shapiro, “Generating notifications for missing actions: Don’t forget to turn the lights off!,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 4669–4677.
[55]
S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” in Proc. ACM Int. Joint Conf. Pervasive Ubiquitous Comput., 2013, pp. 729–738.
[56]
C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid, “Actor-centric relation network,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 318–334.
[57]
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 3104–3112.
[58]
T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents with adaptive loss and large-scale incident DB,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3521–3529.
[59]
O. Tuzel, F. Porikli, and P. Meer, “Pedestrian detection via classification on Riemannian manifolds,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 10, pp. 1713–1727, Oct. 2008.
[60]
A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[61]
C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 98–106.
[62]
B. Wan, D. Zhou, Y. Liu, R. Li, and X. He, “Pose-aware multi-level feature network for human object interaction detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 9469–9478.
[63]
B. Wang, L. Huang, and M. Hoai, “Active vision for early recognition of human actions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 1081–1091.
[64]
Q. Wang, J. Xie, W. Zuo, L. Zhang, and P. Li, “Deep CNNs meet global covariance pooling: Better representation and generalization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 8, pp. 2582–2597, Aug. 2021.
[65]
X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 7794–7803.
[66]
X. Wei, P. Lucey, S. Vidas, S. Morgan, and S. Sridharan, “Forecasting events using an augmented hidden conditional random field,” in Proc. Asian Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, pp. 569–582.
[67]
S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-C. Woo, “Convolutional LSTM network: A machine learning approach for precipitation nowcasting,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 802–810.
[68]
M. Xu, M. Gao, Y.-T. Chen, L. Davis, and D. Crandall, “Temporal recurrent networks for online action detection,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2019, pp. 5532–5541.
[69]
L. Yu, E. Park, A. C. Berg, and T. L. Berg, “Visual madlibs: Fill in the blank description generation and question answering,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 2461–2469.
[70]
Y. Zhang, P. Tokmakov, M. Hebert, and C. Schmid, “A structured model for action detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 9975–9984.
[71]
S. Zhenget al., “Conditional random fields as recurrent neural networks,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1529–1537.
[72]
B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 803–818.

Cited By

View all
  • (2024)Multi-Label Action Anticipation for Real-World Videos With Scene UnderstandingIEEE Transactions on Image Processing10.1109/TIP.2024.339169233(3242-3255)Online publication date: 25-Apr-2024
  • (2024)GLPanoDepth: Global-to-Local Panoramic Depth EstimationIEEE Transactions on Image Processing10.1109/TIP.2024.338640333(2936-2949)Online publication date: 15-Apr-2024
  • (2023)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 4-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Image Processing
IEEE Transactions on Image Processing  Volume 30, Issue
2021
5053 pages

Publisher

IEEE Press

Publication History

Published: 01 January 2021

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Multi-Label Action Anticipation for Real-World Videos With Scene UnderstandingIEEE Transactions on Image Processing10.1109/TIP.2024.339169233(3242-3255)Online publication date: 25-Apr-2024
  • (2024)GLPanoDepth: Global-to-Local Panoramic Depth EstimationIEEE Transactions on Image Processing10.1109/TIP.2024.338640333(2936-2949)Online publication date: 15-Apr-2024
  • (2023)Toward Egocentric Compositional Action Anticipation with Adaptive Semantic DebiasingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333320:5(1-21)Online publication date: 4-Dec-2023
  • (2023)Video Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.324346545:11(12922-12943)Online publication date: 1-Nov-2023
  • (2023)Point-Based Learnable Query Generator for Human–Object Interaction DetectionIEEE Transactions on Image Processing10.1109/TIP.2023.333410032(6469-6484)Online publication date: 1-Jan-2023
  • (2023)Cognition Guided Human-Object Relationship DetectionIEEE Transactions on Image Processing10.1109/TIP.2023.327004032(2468-2480)Online publication date: 1-Jan-2023
  • (2023)Egocentric Early Action Prediction via Multimodal Transformer-Based Dual Action PredictionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.324827133:9(4472-4483)Online publication date: 1-Sep-2023
  • (2022)Deep Virtual-to-Real Distillation for Pedestrian Crossing Prediction2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)10.1109/ITSC55140.2022.9921771(1586-1592)Online publication date: 8-Oct-2022

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media