Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Rethinking Matching-Based Few-Shot Action Recognition

  • Conference paper
  • First Online:
Image Analysis (SCIA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13885))

Included in the following conference series:

  • 726 Accesses

Abstract

Few-shot action recognition, i.e. recognizing new action classes given only a few examples, benefits from incorporating temporal information. Prior work either encodes such information in the representation itself and learns classifiers at test time, or obtains frame-level features and performs pairwise temporal matching. We first evaluate a number of matching-based approaches using features from spatio-temporal backbones, a comparison missing from the literature, and show that the gap in performance between simple baselines and more complicated methods is significantly reduced. Inspired by this, we propose Chamfer++, a non-temporal matching function that achieves state-of-the-art results in few-shot action recognition. We show that, when starting from temporal features, our parameter-free and interpretable approach can outperform all other matching-based and classifier methods for one-shot action recognition on three common datasets without using temporal information in the matching stage.

Project page: https://jbertrand89.github.io/matching-based-fsar

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Prototypical networks can be seen as an extension of matching based methods [21, 27, 36]. Hence we group them with the matching-based family.

  2. 2.

    In prototypical networks [21, 27, 36], the pairwise clip-to-clip similarities are used as weights to compute a class prototype specific to the query example. The class probabilities are computed as the distance between a query and its prototype.

  3. 3.

    https://github.com/xianyongqin/few-shot-video-classification.

  4. 4.

    https://github.com/tobyperrett/trx.

  5. 5.

    Note that although TSL uses randomly-sampled clips, we found that its performance is usually better when switching to uniformly-sampled clips.

References

  1. Bishay, M., Zoumpourlis, G., Patras, I.: Tarn: temporal attentive relation network for few-shot and zero-shot action recognition. In: BMVC (2019)

    Google Scholar 

  2. Cao, K., Brbić, M., Leskovec, J.: Concept learners for few-shot learning. In: ICLR (2021)

    Google Scholar 

  3. Cao, K., Ji, J., Cao, Z., Chang, C.-Y., Niebles, J.C.: Few-shot video classification via temporal alignment. In: CVPR (2020)

    Google Scholar 

  4. Chang, C.-Y., Huang, D.-A., Sui, Y., Fei-Fei, L., Niebles, J.C.: D3tw: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: CVPR (2019)

    Google Scholar 

  5. Chen, W.-Y., Liu, Y.-C., Kira, Z., Wang, Y.-C.F., Huang, J.-B.: A closer look at few-shot classification. In: ICLR (2019)

    Google Scholar 

  6. Doersch, C., Gupta, A., Zisserman, A.: Crosstransformers: spatially-aware few-shot transfer. In: NeurIPS (2020)

    Google Scholar 

  7. Dvornik, M., Hadji, I., Derpanis, K.G., Garg, A., Jepson, A.: Aligning common signal between sequences while dropping outliers. In: NeurIPS, Drop-dtw (2021)

    Google Scholar 

  8. Dwivedi, S.K., Gupta, V., Mitra, R., Ahmed, S., Jain, A.: Towards few shot learning for action recognition. In: ICCVW, Protogan (2019)

    Google Scholar 

  9. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)

    Google Scholar 

  10. Finn, C., Xu, K., Levine, S.: Probabilistic model-agnostic meta-learning. In: NeurIPS (2018)

    Google Scholar 

  11. Gidaris, S., Komodakis, N.: Dynamic few-shot visual learning without forgetting. In: CVPR (2018)

    Google Scholar 

  12. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV (2017)

    Google Scholar 

  13. Grant, E., Finn, C., Levine, S., Darrell, T., Griffiths, T.: Recasting gradient-based meta-learning as hierarchical bayes. In: ICLR (2018)

    Google Scholar 

  14. Huang, D.-A., et al.: What makes a video a video: Analyzing temporal information in video understanding models and datasets. In CVPR (2018)

    Google Scholar 

  15. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)

    Google Scholar 

  16. Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, I.: Fine-grained spatio-temporal video similarity learning. In: ICCV, Visil (2019)

    Google Scholar 

  17. Li, S., et al.: Ta2n: two-stage action alignment network for few-shot action recognition. In: AAAI (2022)

    Google Scholar 

  18. Lifchitz, Y., Avrithis, Y., Picard, S., Bursuc, A.: Dense classification and implanting for few-shot learning. In: CVPR (2019)

    Google Scholar 

  19. Huang, Y., Yang, L., Sato, Y.: Compound prototype matching for few-shot action recognition. In: ECCV (2022)

    Google Scholar 

  20. Müller, M.: Dynamic time warping. Information Retrieval for Music and Motion (2007)

    Google Scholar 

  21. Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., Damen, D.: Temporal-relational cross transformers for few-shot action recognition. In: CVPR (2021)

    Google Scholar 

  22. Andrei, A., et al.: Meta-learning with latent embedding optimization. In: ICLR (2019)

    Google Scholar 

  23. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)

    Google Scholar 

  24. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. In: CRCV-TR-12-01 (2012)

    Google Scholar 

  25. Su, B., Wen, J.-R.: Temporal alignment prediction for supervised representation learning and few-shot sequence classification. In: ICLR (2022)

    Google Scholar 

  26. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P., Hospedales, T.: Learning to compare: Relation network for few-shot learning. In: CVPR (2018)

    Google Scholar 

  27. Thatipelli, A., Narayan, S., Khan, S., Anwer, R.M., Khan, F.S., Ghanem, B.: Spatio-temporal relation modeling for few-shot action recognition. In: CVPR (2022)

    Google Scholar 

  28. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018)

    Google Scholar 

  29. Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., Wierstra, D.: Matching networks for one shot learning. In: NeurIPS (2016)

    Google Scholar 

  30. Wang, X., et al.: Hybrid relation guided set matching for few-shot action recognition. In: CVPR (2022)

    Google Scholar 

  31. Wang, Y., Chao, W.-L., Weinberger, K.Q., van der Maaten, L.: Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. arXiv preprint arXiv:1911.04623 (2019)

  32. Wu, J., Zhang, T., Zhang, Z., Wu, F., Zhang, Y.: Motion-modulated temporal fragment alignment network for few-shot action recognition. In: CVPR (2022)

    Google Scholar 

  33. Xian, Y., Korbar, B., Douze, M., Torresani, L., Schiele, B., Akata, Z.: Generalized few-shot video classification with video retrieval and feature generation. IEEE TPAMI (2021)

    Google Scholar 

  34. Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P.H.S., Koniusz, P.: Few-shot action recognition with permutation-invariant attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 525–542. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_31

    Chapter  Google Scholar 

  35. Zhu, L., Yang, Y.: Compound memory networks for few-shot video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 782–797. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_46

    Chapter  Google Scholar 

  36. Zhu, X., Toisoul, A., Pérez-Rúa, J.-M., Zhang, L., Martinez, B., Xiang, T.: Few-shot action recognition with prototype-centered attentive learning. In: BMVC (2021)

    Google Scholar 

  37. Zhu, Z., Wang, L., Guo, S., Wu, G.: A new baseline and benchmark. In: MVC, A closer look at few-shot video classification (2021)

    Google Scholar 

Download references

Acknowledgements

This work was supported by Naver Labs Europe, by Junior Star GACR GM 21-28830M, and by student grant SGS23/173/OHK3/3T/13. The authors would like to sincerely thank Toby Perrett and Dima Damen for sharing their early code and supporting us, Diane Larlus for insightful conversations, feedback, and support, and Zakaria Laskar, Monish Keswani, and Assia Benbihi for their feedback.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Juliette Bertrand or Yannis Kalantidis .

Editor information

Editors and Affiliations

A Appendix

A Appendix

In this appendix, we present more formally the matching functions used as baselines for our study in (Sect. A.1), as well as extra experiments that study the impact of different hyper-parameters and present results of different task setups (Sect. A.2).

1.1 A.1 Baseline Matching Functions

In this section, we describe the different matching functions used as our study’s baseline,  6. Some matching functions use temporal information by leveraging the absolute or relative position of the pairwise similarities \(m_{ij}\). The matching functions that use temporal information are called temporal whereas the others are called non-temporal.

Fig. 6.
figure 6

Matching functions on the temporal similarity matrix M . We show how each method estimates a scalar video-to-video similarity given the input pairwise similarity matrix. The functions can be classified as a) temporal or b) non-temporal whether they use the temporal position of the features or not.

Temporal Matching Functions. We provide a list of the temporal matching functions implemented in this study as baselines. Some of them were already introduced in prior work.

Diagonal (Diag) is used as a baseline in prior work [3]. It is given by \(s(M)= \displaystyle \nicefrac {\sum _{ij} m_{ii}}{n}\). It assumes temporally aligned video pairs.

OTAM  [3] uses and extends Dynamic Time Warping [20] to find an alignment path on M over which similarities are averaged to produce the video-to-video similarity. A differentiable variant is used for training.

Flatten+FC (Linear) is a simple baseline we use to learn temporal matching by flattening M and feeding it to a Fully Connected (FC) layer without bias and with a single scalar output. Video-to-video similarity is therefore given by \(s(M) = \sum _{ij} w_{ij} m_{ij}\), where \(w_{ij}\) are learnable parameters which are \(n^2\) in total.

Table 4. Impact of learning a feature projection on performance of matching-based methods. \(^\dagger \) denotes hand-crafted matching methods, i.e. no training is performed for the cases where a feature projection is not learned.

ViSiL  [16] is an approach originally introduced for the task of video retrieval. We apply it to few-shot action recognition for the first time. A small Fully Convolutional Network (FCN) is applied on M. Its output is a filtered temporal similarity matrix, and the Chamfer similarity is applied on it. The filtering is performed according to the small temporal context captured by the small receptive field of this network.

Non-temporal Matching Functions. We provide a list of the non-temporal matching functions that were implemented in this study as baselines.

Mean is used as a baseline in prior work [3]. It is given by \(s(M)= \displaystyle \nicefrac {\sum _{ij} m_{ij}}{n^2}\). It supposes all the clip pairs should contribute equally to the similarity score.

Max is used as a baseline in prior work [3]. It is given by \(s(M)= \max _{ij} m_{ij}\). It supposes that selecting the best matching clip pair is enough to recognize the action.

1.2 A.2 Additional Ablations and Impact of Hyper-parameters

In this section, we present additional ablations to evaluate the impact of the feature projection head, the ordering of the tuples, and the number of examples per class used in the support set. We also report the impact of using the different variants for the Kinetics-100 and UCF-101 datasets.

Table 5. Impact of the dimension size of the feature projection head for Chamfer++ using ordered-tuples and \(l=3\).
Table 6. Chamfer++ variants on the three datasets: UCF-101, Kinetics-100 and SS-v2.

The Impact of the Projection Layer for matching methods is validated in Table 4. The performance is consistently improved on all setups and methods by including and learning a projection layer. Although the backbone is trained with TSL on the same meta-train set, the projection layer allows features and values in the temporal similarity matrix to better align with each matching process.

Projection Head Dimension. To match the work from [21], we set the projection dimension to \(D=1152\). This section evaluates the effect of using different values for D. The results are reported in Table 5. A minimum value of \(D=1024\) seems enough and could be used for future experiments.

Impact of Using Different Variants. We report the accuracy for the different variants of Chamfer++ for the Kinetics-100 and UCF-101 datasets in Table 6. As for SS-v2, both variants improve performances compared to the vanilla approach.

Impact of Ordering the Clip-Tuples. In this section, we evaluate the impact of using ordered clip feature tuples \(\textbf{t}^{l}\) versus using all the clip feature tuples \(\textbf{t}_{all}^{l}\). The comparison between ordered tuples and all tuples is presented in Table 7. On the SS-v2 dataset, using ordered clip feature tuples boosts the accuracy. On the Kinetics-100 and the UCF-101 datasets, using ordered tuples doesn’t provide a boost and can even slightly harm the performance. Since the number of tuples is significantly lower when they are in order, using the ordered clip feature tuples is preferable.

Table 7. Impact of using ordered clip feature tuples vs all the features for different values of l.

Impact of the Number of Examples per Class. The impact of k is shown in Fig. 7 by measuring performance for an increasing number of support examples per class while keeping the number of classes fixed, \(C_{t}=C_{f}=5\). We observe that TSL and TRX have inferior performances for the low-shot regime, while their performance increases faster with the number of shots. In the low-regime, Chamfer-QS++ outperforms the other methods and still keeps some benefits while the number of shots increases.

Fig. 7.
figure 7

Evolution of the accuracy with the number of examples per class in the support set for the three datasets.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bertrand, J., Kalantidis, Y., Tolias, G. (2023). Rethinking Matching-Based Few-Shot Action Recognition. In: Gade, R., Felsberg, M., Kämäräinen, JK. (eds) Image Analysis. SCIA 2023. Lecture Notes in Computer Science, vol 13885. Springer, Cham. https://doi.org/10.1007/978-3-031-31435-3_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-31435-3_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-31434-6

  • Online ISBN: 978-3-031-31435-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics