There are a number of challenges involved in recognizing actions from Cricket telecast videos, mainly, due to the rapid camera motion, camera switching, and variations in background/foreground, scale, position and viewpoint. Our work deals with the task of trimmed Cricket stroke classification. We used the Cricket Highlights dataset of Gupta and Balan (2020) and manually labeled the 562 trimmed strokes into 5 categories based on the direction of stroke play. These categories are independent of the batsman pose orientations (or handedness) and are useful in determining the outcome of a Cricket stroke. Models trained on our proposed categories can have applications in building player profiles, automated extraction of direction dependent strokes and highlights generation. The Gated Recurrent Unit (GRU) based models were trained on sequences of spatial and motion visual words, obtained by hard(HA) and soft assignment(SA). Extensive set of experiments were carried out on the frame-level dense optical flow grid(OF Grid) features, histogram of oriented optical flow(HOOF), pretrained 2D ResNet and pretrained 3D ResNet extracted features. The training on visual word sequences gives better results as compared to the training on raw feature sequences. Moreover, the soft assignment based word sequences perform better than the hard assignment based sequences of OF Grid features. We present strong baseline results for this new dataset, with the best accuracy of 81.13% on the test set, using soft assignment on optical flow based grid features. We compare our results with Transformer and 2-stream GRU models trained on HA/SA visual words, and 3D convolutional models (C3D/I3D) trained on raw frame sequences.

Code Availability
Our implementation is available on Github.Footnote 4
The details of Cricket and its related terminology can be found at https://www.cs.purdue.edu/homes/hosking/cricket/explanation.htm and https://www.youtube.com/watch?v=g-beFHld19c : Last Accessed 13 September, 2021.
It is to be noted that the five categories are not the usual ‘types’ of Cricket strokes, such as “Cover Drive”, “Pull shot”, “Sweep shot” etc., which are dependent on the sequence of batsman poses. Instead, they are only a coarse grained representation based on direction of stroke. E.g., all the “Cover drives”, “Long-Off drives” and lofted strokes hit in these directions, by a right handed batsman, will belong to category 3, while for a left-handed batsman, the same category shots will contain “Mid-Wicket” and “Long-On drives”.
https://docs.opencv.org/3.2.0/d7/d8b/tutorial_py_lucas_kanade.html: Last Accessed : 2020-12-29
C3D weights(pretrained on Sports1M [31] were available at http://imagelab.ing.unimore.it/files/c3d_pytorch/c3d.pickle: Last Accessed 11 January, 2021.
1.1 A.1 Sampling the stroke clips
The sampling of clips from Cricket strokes was performed using our custom data loader by extending the VisionDataset class in Torchvision 0.4.0. The VideoClips class was modified and used for our dataset, and it generated clip meta-data using the stroke information. The modification allowed for sampling pre-extracted clip features, instead of raw frames, using the clip meta-data. Figure 2b illustrates the distribution of the sampled clips (not the Cricket strokes). The number of samples generated for each category are similar to this distribution, when different temporal sequence sizes are considered. In order to compensate for the skewed distribution and better train the GRU models, we use the WeightedRandomSampler class available in PyTorch.
1.2 A.2 Finetuning C3D model
The C3D finetuning was performed on our dataset by sampling clips of contiguous RGB frames (length of 16), using a step size of 4. The pre-trained C3D model architecture is the same as used by Tran et al. [68].Footnote 5 The FC layers and Conv5b layer were finetuned using SGD with LR of 0.001 and decreased by a factor of 10 after 15 epochs. Each iteration was executed for 150 iterations. The progression of loss and accuracy values are shown in Fig. 8.
1.3 A.3 Training 2 Stream GRU model
Multiple combinations of extracted feature pairs were used for training a 2 stream GRU model with late fusion [31]. We experimented with combinations of OF Grid features with HOOF features, 2DCNN extracted features, and HOG features (similar to Simonyan et al. [61]), but they performed worse than the OF Grid 20 trained model. The best performing combination of OF Grid 20 with HOG, is shown in Fig. 9. We used soft assignment with C = 1000 for both the streams. The feature sizes for OF Grid and HOG was 576 and 3600, respectively, and hidden size and number of layers were same as that of the single stream model.
We chose the model trained on sequence length of 24, which performed the best on the validation set, for reporting the test set accuracy. The validation set accuracy values over a range of sequence length values are shown in Fig. 9c, where each point represents a separate GRU model trained from scratch. It is to be noted that the model trained on sequence length of 34 did not converge and its validation accuracy did not decrease, which may occur due to being stuck at a local minima. Since, all the models were trained with the same random seed, therefore, this anomaly can, most likely, be resolved by taking a different random seed, which would generate a different order of samples for training.
