Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Learning cricket strokes from spatial and motion visual word sequences

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

There are a number of challenges involved in recognizing actions from Cricket telecast videos, mainly, due to the rapid camera motion, camera switching, and variations in background/foreground, scale, position and viewpoint. Our work deals with the task of trimmed Cricket stroke classification. We used the Cricket Highlights dataset of Gupta and Balan (2020) and manually labeled the 562 trimmed strokes into 5 categories based on the direction of stroke play. These categories are independent of the batsman pose orientations (or handedness) and are useful in determining the outcome of a Cricket stroke. Models trained on our proposed categories can have applications in building player profiles, automated extraction of direction dependent strokes and highlights generation. The Gated Recurrent Unit (GRU) based models were trained on sequences of spatial and motion visual words, obtained by hard(HA) and soft assignment(SA). Extensive set of experiments were carried out on the frame-level dense optical flow grid(OF Grid) features, histogram of oriented optical flow(HOOF), pretrained 2D ResNet and pretrained 3D ResNet extracted features. The training on visual word sequences gives better results as compared to the training on raw feature sequences. Moreover, the soft assignment based word sequences perform better than the hard assignment based sequences of OF Grid features. We present strong baseline results for this new dataset, with the best accuracy of 81.13% on the test set, using soft assignment on optical flow based grid features. We compare our results with Transformer and 2-stream GRU models trained on HA/SA visual words, and 3D convolutional models (C3D/I3D) trained on raw frame sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Code Availability

Our implementation is available on Github.Footnote 4

Notes

  1. The details of Cricket and its related terminology can be found at https://www.cs.purdue.edu/homes/hosking/cricket/explanation.htm and https://www.youtube.com/watch?v=g-beFHld19c : Last Accessed 13 September, 2021.

  2. It is to be noted that the five categories are not the usual ‘types’ of Cricket strokes, such as “Cover Drive”, “Pull shot”, “Sweep shot” etc., which are dependent on the sequence of batsman poses. Instead, they are only a coarse grained representation based on direction of stroke. E.g., all the “Cover drives”, “Long-Off drives” and lofted strokes hit in these directions, by a right handed batsman, will belong to category 3, while for a left-handed batsman, the same category shots will contain “Mid-Wicket” and “Long-On drives”.

  3. https://docs.opencv.org/3.2.0/d7/d8b/tutorial_py_lucas_kanade.html: Last Accessed : 2020-12-29

  4. https://github.com/arpane4c5/StrokeAttention

  5. C3D weights(pretrained on Sports1M [31] were available at http://imagelab.ing.unimore.it/files/c3d_pytorch/c3d.pickle: Last Accessed 11 January, 2021.

References

  1. Bradski G The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000)

  2. Cai Z, Neher H, Vats K, Clausi D A, Zelek J S (2018) Temporal hockey action recognition via pose and optical flows. arXiv:1812.09533

  3. Carreira J , Zisserman A (2017) Quo Vadis, action recognition? A new model and the kinetics dataset. arXiv:1705.07750

  4. Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. arXiv:1409.1259

  5. Cioppa A, Deliege A, Giancola S, Ghanem B, Droogenbroeck M V, Gade R, Moeslund T B (2020) A context-aware loss function for action spotting in soccer videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  6. Chung J, Gülçehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555

  7. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1, pp 886–893

  8. Deliege A, Cioppa A, Giancola S, Seikavandi M J, Dueholm J V, Nasrollahi K, Ghanem B, Moeslund T B, Van Droogenbroeck M (2021) Soccernet-v2: a dataset and benchmarks for holistic understanding of broadcast soccer videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) workshops, pp 4508–4519

  9. Digital Gaming Technology (DGT). http://www.digitalgametechnology.com/index.php/products/electronic-boardshttp://www.digitalgametechnology.com/index.php/products/electronic-boards. Accessed 15 Sept 2021

  10. Donahue J, Hendricks L A, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691. https://doi.org/10.1109/TPAMI.2016.2599174

    Article  Google Scholar 

  11. D’Orazio T, Leo M (2010) A review of vision-based systems for soccer video analysis. Pattern Recogn 43(8):2911–2926. https://doi.org/10.1016/j.patcog.2010.03.009

    Article  Google Scholar 

  12. Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Proceedings of the 13th Scandinavian conference on image analysis. SCIA’03. Springer, Berlin, pp 363– 370

  13. Faulkner H, Dick A Tenniset: a dataset for dense fine-grained event recognition, localisation and description. In: 2017 International conference on digital image computing: techniques and applications (DICTA). IEEE, pp 1–8

  14. Foysal M F A, Islam M S, Karim A, Neehal N (2019) Shot-net: a convolutional neural network for classifying different cricket shots. In: Santosh K C, Hegadi R S (eds) Recent trends in image processing and pattern recognition. Springer, Singapore, pp 111–120

  15. Giancola S, Amine M, Dghaily T, Ghanem B (2018) SoccerNet: a scalable dataset for action spotting in soccer videos. arXiv:1804.04527

  16. Gourgari S, Goudelis G, Karpouzis K, Kollias S (2013) Thetis: three dimensional tennis shots a human action dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) workshops

  17. GRU module in torch.nn. https://pytorch.org/docs/stable/generated/torch.nn.GRU.html#torch.nn.GRU. Accessed 28 Dec 2020

  18. Gupta A, Karel A, Muthiah S B (2021) Cricket stroke recognition using hard and soft assignment based bag of visual words. In: Singh S K, Roy P, Raman B, Nagabhushan P (eds) Computer vision and image processing. Springer, Singapore, pp 231–242

  19. Gupta A, Karel A, Sakthi Balan M (2020) Discovering cricket stroke classes in trimmed telecast videos. In: Nain N, Vipparthi S K, Raman B (eds) Computer vision and image processing. Springer, Singapore, pp 509–520

  20. Gupta A, Muthiah S B (2018) Temporal cricket stroke localization from untrimmed highlight videos. In: Proceedings of the 11th Indian conference on computer vision, graphics and image processing. ICVGIP 2018. Association for Computing Machinery, New York

  21. Gupta A, Muthiah S B (2020) Viewpoint constrained and unconstrained Cricket stroke localization from untrimmed videos. Image Vis Comput 100:103944. https://doi.org/10.1016/j.imavis.2020.103944

    Article  Google Scholar 

  22. Harikrishna N, Satheesh S, Sriram S D, Easwarakumar K S (2011) Temporal classification of events in cricket videos. In: 2011 National conference on communications (NCC), pp 1–5

  23. He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385

  24. Heilbron F C, Escorcia V, Ghanem B, Niebles J C (2015) ActivityNet: a large-scale video benchmark for human activity understanding. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit 07:961–970. https://doi.org/10.1109/CVPR.2015.7298698

    Google Scholar 

  25. Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21. https://doi.org/10.1016/j.imavis.2017.01.010

    Article  Google Scholar 

  26. Hochreiter S, Schmidhuber J (November 1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

  27. Hui T-W, Tang X, Loy C C (2018) LiteFlowNet: a lightweight convolutional neural network for optical flow estimation. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), pp 8981–8989. http://mmlab.ie.cuhk.edu.hk/projects/LiteFlowNet/

  28. Ibrahim M S, Muralidharan S, Deng Z, Vahdat A, Mori G (2016) A hierarchical deep temporal model for group activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  29. Ji S, Xu W, Yang M, Yu K (2013Jan) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35 (1):221–231. https://doi.org/10.1109/TPAMI.2012.59

  30. Junejo I N, Dexter E, Laptev I, Pérez P (2011) View-independent action recognition from temporal self-similarities. IEEE Trans Pattern Anal Mach Intell 33(1):172–185. https://doi.org/10.1109/TPAMI.2010.68https://doi.org/10.1109/TPAMI.2010.68

    Article  Google Scholar 

  31. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on computer vision and pattern recognition (CVPR), pp 1725–1732. https://doi.org/10.1109/CVPR.2014.223

  32. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. arXiv:1705.06950

  33. Kingma D, Ba J (2014) Adam: a method for stochastic optimization, pp 1–15, arXiv:1412.6980

  34. Kolekar M H, Palaniappan K, Sengupta S (2008) Semantic event detection and classification in cricket video sequence. 2008 Sixth Indian conference on computer vision, graphics image processing, pp 382–389

  35. Kolekar M H (2011) Bayesian belief network based broadcast sports video indexing. Multimed Tools Appl 54(1):27–54. https://doi.org/10.1007/s11042-010-0544-9

    Article  Google Scholar 

  36. Kolekar M H, Sengupta S (2010) Semantic concept mining in cricket videos for automated highlight generation. Multimed Tools Applic 47(3):545–579. https://doi.org/10.1007/s11042-009-0337-1

    Article  Google Scholar 

  37. Krizhevsky A, Sutskever I, Hinton G E (2012) ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges C J C, Bottou L, Weinberger K Q (eds) Advances in neural information processing systems 25. Curran Associates, Inc., pp 1097–1105

  38. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: A large video database for human motion recognition. Proc IEEE Int Conf Comput Vision :2556–2563. https://doi.org/10.1109/ICCV.2011.6126543https://doi.org/10.1109/ICCV.2011.6126543

  39. Kulkarni K M, Shenoy S (2021) Table tennis stroke recognition using two-dimensional human pose estimation. arXiv:2104.09907

  40. Kumar A, Garg J, Mukerjee A (2014) Cricket activity detection. In: International image processing, applications and systems conference, IPAS 2014, pp 1–6. https://doi.org/10.1109/IPAS.2014.7043264

  41. Language Modeling with nn.Transformer and TorchText. https://pytorch.org/tutorials/beginner/transformer_tutorial.html. Accessed 08 Aug 2021

  42. Lazarescu M, Venkatesh S, West G (2002) On the automatic indexing of cricket using camera motion parameters. Proceedings. In: IEEE International Conference on Multimedia and Expo, vol 1. pp 809–812

  43. Liu H, Tang H, Xiao W, Guo Z, Tian L, Gao Y (2016) Sequential bag-of-words model for human action classification. CAAI Trans Intell Technol 1(2):125–136. https://doi.org/10.1016/j.trit.2016.10.001

    Article  Google Scholar 

  44. Liu J, Carr P, Collins R T, Liu Y (2013) Tracking sports players with context-conditioned motion models. In: 2013 IEEE Conference on computer vision and pattern recognition, pp 1830–1837

  45. Lu W-L, Ting J, Little J J, Murphy K P (2013) Learning to track and identify players from broadcast sports videos. IEEE Trans Pattern Anal Mach Intell 35(07):1704–1716. https://doi.org/10.1109/TPAMI.2012.242https://doi.org/10.1109/TPAMI.2012.242

    Article  Google Scholar 

  46. Lucas B D, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th international joint conference on artificial intelligence - volume 2. IJCAI’81. Morgan Kaufmann Publishers Inc., San Francisco, pp 674–679

  47. Moeslund T, Thomas G, Hilton A, Little J, Merler M, Gade R CVSports — 7th International workshop on computer vision in sports (CVsports) at CVPR 2021. http://www.vap.aau.dk/cvsports/. Accessed 15 Sept 2021

  48. Moodley T, van der Haar D (2020) Casrm: cricket automation and stroke recognition model using openpose. In: Duffy V G (ed) Digital human modeling and applications in health, safety, ergonomics and risk management. Posture, motion and health. Springer International Publishing, Cham, pp 67–78

  49. Moodley T, van der Haar D (2020) Cricket stroke recognition using computer vision methods. In: Kim K J, Kim H-Y (eds) Information science and applications. Springer, Singapore, pp 171–181

  50. Najafzadeh N, Fotouhi M, Kasaei S (2015) Multiple soccer players tracking. In: 2015 The international symposium on artificial intelligence and signal processing (AISP), pp 310–315

  51. Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv:1405.4506

  52. Piergiovanni AJ, Ryoo M S (2018) Fine-grained activity recognition in baseball videos. In: The IEEE Conference on computer vision and pattern recognition (CVPR) workshops

  53. Pramod Sankar K, Pandey S, Jawahar C V (2006) Text driven temporal segmentation of cricket videos. In: Proceedings of the 5th Indian conference on computer vision, graphics and image processing. ICVGIP’06. Springer, Berlin, pp 433–444

  54. Quiroga J, Carrillo H, Maldonado E, Ruiz J, Zapata L M (2020) As seen on tv: automatic basketball video production using gaussian-based actionness and game states recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) workshops

  55. Ramanathan V, Huang J, Abu-El-Haija S, Gorban A N, Murphy K, Fei-Fei L (2015) Detecting events and key actors in multi-person videos. arXiv:1511.02917

  56. Ravinder M, Venugopal T (2016) Content-based cricket video shot classification using bag-of-visual-features. In: Dash S S, Bhaskar M A, Panigrahi B K, Das S (eds) Artificial intelligence and evolutionary computations in engineering systems. Springer, New Delhi, pp 599–606

  57. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg A C, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  58. Semwal A, Mishra D, Raj V, Sharma J, Mittal A (2018) Cricket shot detection from videos. In: 2018 9th International conference on computing, communication and networking technologies (ICCCNT), pp 1–6

  59. Sharma R A, Sankar K P, Jawahar C V (2015) Fine-grain annotation of cricket videos. arXiv:1511.07607

  60. Shih H (2018May) A survey of content-aware video analysis for sports. IEEE Trans Circ Syst Video Technol 28(5):1212–1231. https://doi.org/10.1109/TCSVT.2017.2655624

  61. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199

  62. Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proceedings Ninth IEEE international conference on computer vision, vol 2, pp 1470–1477. https://doi.org/10.1109/ICCV.2003.1238663

  63. Soomro K, Zamir A R, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  64. Sutskever I, Vinyals O, Le Q V (2014) Sequence to sequence learning with neural networks. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger K Q (eds) Advances in neural information processing systems. https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf. Accessed 15 Sept 2021, vol 27. Curran Associates, Inc.

  65. Teachabarikiti K, Chalidabhongse T H, Thammano A (2010) Players tracking and ball detection for an automatic tennis video annotation. In: 2010 11th International conference on control automation robotics vision, pp 2461–2494

  66. Thomas G, Gade R, Moeslund T B, Carr P, Hilton A (2017) Computer vision for sports: current applications and research topics. Comput Vis Image Underst 159:3–18. https://doi.org/10.1016/j.cviu.2017.04.011https://doi.org/10.1016/j.cviu.2017.04.011

    Article  Google Scholar 

  67. Trace Bot. https://traceup.com/soccer/how-it-works. Accessed 15 Sept 2021

  68. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: The IEEE international conference on computer vision (ICCV)

  69. van Gemert J C, Veenman C J, Smeulders A W M, Geusebroek J (2010) Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell 32 (7):1271–1283

    Article  Google Scholar 

  70. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L , Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg U V, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Accessed 15 Sept 2021, vol 30. Curran Associates, Inc.

  71. Veo — Sports Camera. https://event.veo.co. Accessed 15 Sept 2021

  72. Yan X, Lou Z, Hu S, Ye Y (2020) Multi-task information bottleneck co-clustering for unsupervised cross-view human action categorization. ACM Trans Knowl Discov Data 14(2). https://doi.org/10.1145/3375394

  73. Yao A, Uebersax D, Gall J, Van Gool L (2010) Tracking People in broadcast sports. In: Goesele M, Roth S, Kuijper A, Schiele B, Schindler K (eds) Pattern recognition. Springer, Berlin, pp 151–161

  74. Zhu G, Xu C, Huang Q, Gao W (2006) Automatic multi-player detection and tracking in broadcast sports video using support vector machine and particle filter. In: 2006 IEEE International conference on multimedia and expo, pp 1629–1632

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arpan Gupta.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 A.1 Sampling the stroke clips

The sampling of clips from Cricket strokes was performed using our custom data loader by extending the VisionDataset class in Torchvision 0.4.0. The VideoClips class was modified and used for our dataset, and it generated clip meta-data using the stroke information. The modification allowed for sampling pre-extracted clip features, instead of raw frames, using the clip meta-data. Figure 2b illustrates the distribution of the sampled clips (not the Cricket strokes). The number of samples generated for each category are similar to this distribution, when different temporal sequence sizes are considered. In order to compensate for the skewed distribution and better train the GRU models, we use the WeightedRandomSampler class available in PyTorch.

1.2 A.2 Finetuning C3D model

The C3D finetuning was performed on our dataset by sampling clips of contiguous RGB frames (length of 16), using a step size of 4. The pre-trained C3D model architecture is the same as used by Tran et al. [68].Footnote 5 The FC layers and Conv5b layer were finetuned using SGD with LR of 0.001 and decreased by a factor of 10 after 15 epochs. Each iteration was executed for 150 iterations. The progression of loss and accuracy values are shown in Fig. 8.

Fig. 8
figure 8

Accuracy and loss values while finetuning a C3D model for 30 epochs. Each epoch was executed for 150 iterations

1.3 A.3 Training 2 Stream GRU model

Multiple combinations of extracted feature pairs were used for training a 2 stream GRU model with late fusion [31]. We experimented with combinations of OF Grid features with HOOF features, 2DCNN extracted features, and HOG features (similar to Simonyan et al. [61]), but they performed worse than the OF Grid 20 trained model. The best performing combination of OF Grid 20 with HOG, is shown in Fig. 9. We used soft assignment with C = 1000 for both the streams. The feature sizes for OF Grid and HOG was 576 and 3600, respectively, and hidden size and number of layers were same as that of the single stream model.

Fig. 9
figure 9

Accuracy and loss values while training 2 Stream GRU model for 30 epochs

We chose the model trained on sequence length of 24, which performed the best on the validation set, for reporting the test set accuracy. The validation set accuracy values over a range of sequence length values are shown in Fig. 9c, where each point represents a separate GRU model trained from scratch. It is to be noted that the model trained on sequence length of 34 did not converge and its validation accuracy did not decrease, which may occur due to being stuck at a local minima. Since, all the models were trained with the same random seed, therefore, this anomaly can, most likely, be resolved by taking a different random seed, which would generate a different order of samples for training.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, A., Muthiah, S.B. Learning cricket strokes from spatial and motion visual word sequences. Multimed Tools Appl 82, 1237–1259 (2023). https://doi.org/10.1007/s11042-022-13307-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13307-y

Keywords