Learning cricket strokes from spatial and motion visual word sequences

Gupta, Arpan; Muthiah, Sakthi Balan

doi:10.1007/s11042-022-13307-y

Learning cricket strokes from spatial and motion visual word sequences

Published: 14 June 2022

Volume 82, pages 1237–1259, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

290 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

There are a number of challenges involved in recognizing actions from Cricket telecast videos, mainly, due to the rapid camera motion, camera switching, and variations in background/foreground, scale, position and viewpoint. Our work deals with the task of trimmed Cricket stroke classification. We used the Cricket Highlights dataset of Gupta and Balan (2020) and manually labeled the 562 trimmed strokes into 5 categories based on the direction of stroke play. These categories are independent of the batsman pose orientations (or handedness) and are useful in determining the outcome of a Cricket stroke. Models trained on our proposed categories can have applications in building player profiles, automated extraction of direction dependent strokes and highlights generation. The Gated Recurrent Unit (GRU) based models were trained on sequences of spatial and motion visual words, obtained by hard(HA) and soft assignment(SA). Extensive set of experiments were carried out on the frame-level dense optical flow grid(OF Grid) features, histogram of oriented optical flow(HOOF), pretrained 2D ResNet and pretrained 3D ResNet extracted features. The training on visual word sequences gives better results as compared to the training on raw feature sequences. Moreover, the soft assignment based word sequences perform better than the hard assignment based sequences of OF Grid features. We present strong baseline results for this new dataset, with the best accuracy of 81.13% on the test set, using soft assignment on optical flow based grid features. We compare our results with Transformer and 2-stream GRU models trained on HA/SA visual words, and 3D convolutional models (C3D/I3D) trained on raw frame sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cricket Stroke Recognition Using Hard and Soft Assignment Based Bag of Visual Words

Discovering Cricket Stroke Classes in Trimmed Telecast Videos

CASRM: Cricket Automation and Stroke Recognition Model Using OpenPose

Code Availability

Our implementation is available on Github.^{Footnote 4}

Notes

The details of Cricket and its related terminology can be found at https://www.cs.purdue.edu/homes/hosking/cricket/explanation.htm and https://www.youtube.com/watch?v=g-beFHld19c : Last Accessed 13 September, 2021.
It is to be noted that the five categories are not the usual ‘types’ of Cricket strokes, such as “Cover Drive”, “Pull shot”, “Sweep shot” etc., which are dependent on the sequence of batsman poses. Instead, they are only a coarse grained representation based on direction of stroke. E.g., all the “Cover drives”, “Long-Off drives” and lofted strokes hit in these directions, by a right handed batsman, will belong to category 3, while for a left-handed batsman, the same category shots will contain “Mid-Wicket” and “Long-On drives”.
https://docs.opencv.org/3.2.0/d7/d8b/tutorial_py_lucas_kanade.html: Last Accessed : 2020-12-29
https://github.com/arpane4c5/StrokeAttention
C3D weights(pretrained on Sports1M [31] were available at http://imagelab.ing.unimore.it/files/c3d_pytorch/c3d.pickle: Last Accessed 11 January, 2021.

References

Bradski G The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000)
Cai Z, Neher H, Vats K, Clausi D A, Zelek J S (2018) Temporal hockey action recognition via pose and optical flows. arXiv:1812.09533
Carreira J , Zisserman A (2017) Quo Vadis, action recognition? A new model and the kinetics dataset. arXiv:1705.07750
Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. arXiv:1409.1259
Cioppa A, Deliege A, Giancola S, Ghanem B, Droogenbroeck M V, Gade R, Moeslund T B (2020) A context-aware loss function for action spotting in soccer videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Chung J, Gülçehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1, pp 886–893
Deliege A, Cioppa A, Giancola S, Seikavandi M J, Dueholm J V, Nasrollahi K, Ghanem B, Moeslund T B, Van Droogenbroeck M (2021) Soccernet-v2: a dataset and benchmarks for holistic understanding of broadcast soccer videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) workshops, pp 4508–4519
Digital Gaming Technology (DGT). http://www.digitalgametechnology.com/index.php/products/electronic-boards http://www.digitalgametechnology.com/index.php/products/electronic-boards. Accessed 15 Sept 2021
Donahue J, Hendricks L A, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691. https://doi.org/10.1109/TPAMI.2016.2599174
Article Google Scholar
D’Orazio T, Leo M (2010) A review of vision-based systems for soccer video analysis. Pattern Recogn 43(8):2911–2926. https://doi.org/10.1016/j.patcog.2010.03.009
Article Google Scholar
Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Proceedings of the 13th Scandinavian conference on image analysis. SCIA’03. Springer, Berlin, pp 363– 370
Faulkner H, Dick A Tenniset: a dataset for dense fine-grained event recognition, localisation and description. In: 2017 International conference on digital image computing: techniques and applications (DICTA). IEEE, pp 1–8
Foysal M F A, Islam M S, Karim A, Neehal N (2019) Shot-net: a convolutional neural network for classifying different cricket shots. In: Santosh K C, Hegadi R S (eds) Recent trends in image processing and pattern recognition. Springer, Singapore, pp 111–120
Giancola S, Amine M, Dghaily T, Ghanem B (2018) SoccerNet: a scalable dataset for action spotting in soccer videos. arXiv:1804.04527
Gourgari S, Goudelis G, Karpouzis K, Kollias S (2013) Thetis: three dimensional tennis shots a human action dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) workshops
GRU module in torch.nn. https://pytorch.org/docs/stable/generated/torch.nn.GRU.html#torch.nn.GRU. Accessed 28 Dec 2020
Gupta A, Karel A, Muthiah S B (2021) Cricket stroke recognition using hard and soft assignment based bag of visual words. In: Singh S K, Roy P, Raman B, Nagabhushan P (eds) Computer vision and image processing. Springer, Singapore, pp 231–242
Gupta A, Karel A, Sakthi Balan M (2020) Discovering cricket stroke classes in trimmed telecast videos. In: Nain N, Vipparthi S K, Raman B (eds) Computer vision and image processing. Springer, Singapore, pp 509–520
Gupta A, Muthiah S B (2018) Temporal cricket stroke localization from untrimmed highlight videos. In: Proceedings of the 11th Indian conference on computer vision, graphics and image processing. ICVGIP 2018. Association for Computing Machinery, New York
Gupta A, Muthiah S B (2020) Viewpoint constrained and unconstrained Cricket stroke localization from untrimmed videos. Image Vis Comput 100:103944. https://doi.org/10.1016/j.imavis.2020.103944
Article Google Scholar
Harikrishna N, Satheesh S, Sriram S D, Easwarakumar K S (2011) Temporal classification of events in cricket videos. In: 2011 National conference on communications (NCC), pp 1–5
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385
Heilbron F C, Escorcia V, Ghanem B, Niebles J C (2015) ActivityNet: a large-scale video benchmark for human activity understanding. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit 07:961–970. https://doi.org/10.1109/CVPR.2015.7298698
Google Scholar
Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21. https://doi.org/10.1016/j.imavis.2017.01.010
Article Google Scholar
Hochreiter S, Schmidhuber J (November 1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hui T-W, Tang X, Loy C C (2018) LiteFlowNet: a lightweight convolutional neural network for optical flow estimation. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), pp 8981–8989. http://mmlab.ie.cuhk.edu.hk/projects/LiteFlowNet/
Ibrahim M S, Muralidharan S, Deng Z, Vahdat A, Mori G (2016) A hierarchical deep temporal model for group activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Ji S, Xu W, Yang M, Yu K (2013Jan) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35 (1):221–231. https://doi.org/10.1109/TPAMI.2012.59
Junejo I N, Dexter E, Laptev I, Pérez P (2011) View-independent action recognition from temporal self-similarities. IEEE Trans Pattern Anal Mach Intell 33(1):172–185. https://doi.org/10.1109/TPAMI.2010.68 https://doi.org/10.1109/TPAMI.2010.68
Article Google Scholar
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on computer vision and pattern recognition (CVPR), pp 1725–1732. https://doi.org/10.1109/CVPR.2014.223
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. arXiv:1705.06950
Kingma D, Ba J (2014) Adam: a method for stochastic optimization, pp 1–15, arXiv:1412.6980
Kolekar M H, Palaniappan K, Sengupta S (2008) Semantic event detection and classification in cricket video sequence. 2008 Sixth Indian conference on computer vision, graphics image processing, pp 382–389
Kolekar M H (2011) Bayesian belief network based broadcast sports video indexing. Multimed Tools Appl 54(1):27–54. https://doi.org/10.1007/s11042-010-0544-9
Article Google Scholar
Kolekar M H, Sengupta S (2010) Semantic concept mining in cricket videos for automated highlight generation. Multimed Tools Applic 47(3):545–579. https://doi.org/10.1007/s11042-009-0337-1
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton G E (2012) ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges C J C, Bottou L, Weinberger K Q (eds) Advances in neural information processing systems 25. Curran Associates, Inc., pp 1097–1105
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: A large video database for human motion recognition. Proc IEEE Int Conf Comput Vision :2556–2563. https://doi.org/10.1109/ICCV.2011.6126543 https://doi.org/10.1109/ICCV.2011.6126543
Kulkarni K M, Shenoy S (2021) Table tennis stroke recognition using two-dimensional human pose estimation. arXiv:2104.09907
Kumar A, Garg J, Mukerjee A (2014) Cricket activity detection. In: International image processing, applications and systems conference, IPAS 2014, pp 1–6. https://doi.org/10.1109/IPAS.2014.7043264
Language Modeling with nn.Transformer and TorchText. https://pytorch.org/tutorials/beginner/transformer_tutorial.html. Accessed 08 Aug 2021
Lazarescu M, Venkatesh S, West G (2002) On the automatic indexing of cricket using camera motion parameters. Proceedings. In: IEEE International Conference on Multimedia and Expo, vol 1. pp 809–812
Liu H, Tang H, Xiao W, Guo Z, Tian L, Gao Y (2016) Sequential bag-of-words model for human action classification. CAAI Trans Intell Technol 1(2):125–136. https://doi.org/10.1016/j.trit.2016.10.001
Article Google Scholar
Liu J, Carr P, Collins R T, Liu Y (2013) Tracking sports players with context-conditioned motion models. In: 2013 IEEE Conference on computer vision and pattern recognition, pp 1830–1837
Lu W-L, Ting J, Little J J, Murphy K P (2013) Learning to track and identify players from broadcast sports videos. IEEE Trans Pattern Anal Mach Intell 35(07):1704–1716. https://doi.org/10.1109/TPAMI.2012.242 https://doi.org/10.1109/TPAMI.2012.242
Article Google Scholar
Lucas B D, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th international joint conference on artificial intelligence - volume 2. IJCAI’81. Morgan Kaufmann Publishers Inc., San Francisco, pp 674–679
Moeslund T, Thomas G, Hilton A, Little J, Merler M, Gade R CVSports — 7th International workshop on computer vision in sports (CVsports) at CVPR 2021. http://www.vap.aau.dk/cvsports/. Accessed 15 Sept 2021
Moodley T, van der Haar D (2020) Casrm: cricket automation and stroke recognition model using openpose. In: Duffy V G (ed) Digital human modeling and applications in health, safety, ergonomics and risk management. Posture, motion and health. Springer International Publishing, Cham, pp 67–78
Moodley T, van der Haar D (2020) Cricket stroke recognition using computer vision methods. In: Kim K J, Kim H-Y (eds) Information science and applications. Springer, Singapore, pp 171–181
Najafzadeh N, Fotouhi M, Kasaei S (2015) Multiple soccer players tracking. In: 2015 The international symposium on artificial intelligence and signal processing (AISP), pp 310–315
Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv:1405.4506
Piergiovanni AJ, Ryoo M S (2018) Fine-grained activity recognition in baseball videos. In: The IEEE Conference on computer vision and pattern recognition (CVPR) workshops
Pramod Sankar K, Pandey S, Jawahar C V (2006) Text driven temporal segmentation of cricket videos. In: Proceedings of the 5th Indian conference on computer vision, graphics and image processing. ICVGIP’06. Springer, Berlin, pp 433–444
Quiroga J, Carrillo H, Maldonado E, Ruiz J, Zapata L M (2020) As seen on tv: automatic basketball video production using gaussian-based actionness and game states recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) workshops
Ramanathan V, Huang J, Abu-El-Haija S, Gorban A N, Murphy K, Fei-Fei L (2015) Detecting events and key actors in multi-person videos. arXiv:1511.02917
Ravinder M, Venugopal T (2016) Content-based cricket video shot classification using bag-of-visual-features. In: Dash S S, Bhaskar M A, Panigrahi B K, Das S (eds) Artificial intelligence and evolutionary computations in engineering systems. Springer, New Delhi, pp 599–606
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg A C, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Semwal A, Mishra D, Raj V, Sharma J, Mittal A (2018) Cricket shot detection from videos. In: 2018 9th International conference on computing, communication and networking technologies (ICCCNT), pp 1–6
Sharma R A, Sankar K P, Jawahar C V (2015) Fine-grain annotation of cricket videos. arXiv:1511.07607
Shih H (2018May) A survey of content-aware video analysis for sports. IEEE Trans Circ Syst Video Technol 28(5):1212–1231. https://doi.org/10.1109/TCSVT.2017.2655624
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proceedings Ninth IEEE international conference on computer vision, vol 2, pp 1470–1477. https://doi.org/10.1109/ICCV.2003.1238663
Soomro K, Zamir A R, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Sutskever I, Vinyals O, Le Q V (2014) Sequence to sequence learning with neural networks. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger K Q (eds) Advances in neural information processing systems. https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf. Accessed 15 Sept 2021, vol 27. Curran Associates, Inc.
Teachabarikiti K, Chalidabhongse T H, Thammano A (2010) Players tracking and ball detection for an automatic tennis video annotation. In: 2010 11th International conference on control automation robotics vision, pp 2461–2494
Thomas G, Gade R, Moeslund T B, Carr P, Hilton A (2017) Computer vision for sports: current applications and research topics. Comput Vis Image Underst 159:3–18. https://doi.org/10.1016/j.cviu.2017.04.011 https://doi.org/10.1016/j.cviu.2017.04.011
Article Google Scholar
Trace Bot. https://traceup.com/soccer/how-it-works. Accessed 15 Sept 2021
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: The IEEE international conference on computer vision (ICCV)
van Gemert J C, Veenman C J, Smeulders A W M, Geusebroek J (2010) Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell 32 (7):1271–1283
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L , Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg U V, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Accessed 15 Sept 2021, vol 30. Curran Associates, Inc.
Veo — Sports Camera. https://event.veo.co. Accessed 15 Sept 2021
Yan X, Lou Z, Hu S, Ye Y (2020) Multi-task information bottleneck co-clustering for unsupervised cross-view human action categorization. ACM Trans Knowl Discov Data 14(2). https://doi.org/10.1145/3375394
Yao A, Uebersax D, Gall J, Van Gool L (2010) Tracking People in broadcast sports. In: Goesele M, Roth S, Kuijper A, Schiele B, Schindler K (eds) Pattern recognition. Springer, Berlin, pp 151–161
Zhu G, Xu C, Huang Q, Gao W (2006) Automatic multi-player detection and tracking in broadcast sports video using support vector machine and particle filter. In: 2006 IEEE International conference on multimedia and expo, pp 1629–1632

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The LNM Institute of Information Technology, Jaipur, India
Arpan Gupta & Sakthi Balan Muthiah

Authors

Arpan Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Sakthi Balan Muthiah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arpan Gupta.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 A.1 Sampling the stroke clips

The sampling of clips from Cricket strokes was performed using our custom data loader by extending the VisionDataset class in Torchvision 0.4.0. The VideoClips class was modified and used for our dataset, and it generated clip meta-data using the stroke information. The modification allowed for sampling pre-extracted clip features, instead of raw frames, using the clip meta-data. Figure 2b illustrates the distribution of the sampled clips (not the Cricket strokes). The number of samples generated for each category are similar to this distribution, when different temporal sequence sizes are considered. In order to compensate for the skewed distribution and better train the GRU models, we use the WeightedRandomSampler class available in PyTorch.

1.2 A.2 Finetuning C3D model

The C3D finetuning was performed on our dataset by sampling clips of contiguous RGB frames (length of 16), using a step size of 4. The pre-trained C3D model architecture is the same as used by Tran et al. [68].^{Footnote 5} The FC layers and Conv5b layer were finetuned using SGD with LR of 0.001 and decreased by a factor of 10 after 15 epochs. Each iteration was executed for 150 iterations. The progression of loss and accuracy values are shown in Fig. 8.

1.3 A.3 Training 2 Stream GRU model

Multiple combinations of extracted feature pairs were used for training a 2 stream GRU model with late fusion [31]. We experimented with combinations of OF Grid features with HOOF features, 2DCNN extracted features, and HOG features (similar to Simonyan et al. [61]), but they performed worse than the OF Grid 20 trained model. The best performing combination of OF Grid 20 with HOG, is shown in Fig. 9. We used soft assignment with C = 1000 for both the streams. The feature sizes for OF Grid and HOG was 576 and 3600, respectively, and hidden size and number of layers were same as that of the single stream model.

We chose the model trained on sequence length of 24, which performed the best on the validation set, for reporting the test set accuracy. The validation set accuracy values over a range of sequence length values are shown in Fig. 9c, where each point represents a separate GRU model trained from scratch. It is to be noted that the model trained on sequence length of 34 did not converge and its validation accuracy did not decrease, which may occur due to being stuck at a local minima. Since, all the models were trained with the same random seed, therefore, this anomaly can, most likely, be resolved by taking a different random seed, which would generate a different order of samples for training.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gupta, A., Muthiah, S.B. Learning cricket strokes from spatial and motion visual word sequences. Multimed Tools Appl 82, 1237–1259 (2023). https://doi.org/10.1007/s11042-022-13307-y

Download citation

Received: 09 February 2021
Revised: 26 January 2022
Accepted: 30 May 2022
Published: 14 June 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s11042-022-13307-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning cricket strokes from spatial and motion visual word sequences

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cricket Stroke Recognition Using Hard and Soft Assignment Based Bag of Visual Words

Discovering Cricket Stroke Classes in Trimmed Telecast Videos

CASRM: Cricket Automation and Stroke Recognition Model Using OpenPose

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Appendix

Appendix

1.1 A.1 Sampling the stroke clips

1.2 A.2 Finetuning C3D model

1.3 A.3 Training 2 Stream GRU model

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now