Keywords

1 Introduction

The widely used surveillance cameras nowadays give rise to explosive demand on the recognition of objects in massive videos especially the human actions. The accurate human action recognition has found varieties applications in people’s daily lives such as intelligent video surveillance, smart home, somatic gaming and so on. However, this recognition task is full of challenges due to obstacles like background clutters, scale variations, object occlusions, viewpoint shifts, etc. In the last decade, many efforts have been made in this area. However, most of them focus on designing classifiers applying handcrafted feature extracting algorithms, which is inflexible when taking both accuracy and robustness into consideration. Schuldt et al. [18] constructed local space-time feature representations in videos and incorporated the support vector machine (SVM) method for action recognition. Scovanner et al. [19] introduced the 3-dimensional (3D) SIFT descriptor in action recognition and improved the performance a lot. To achieve better results, Wong and Cipolla [26] took the global information encoded in videos into consideration. They developed their research on the organization of pixels in the video sequences and proposed a detector utilizing a set of interest points. Besides, the combination of many effective feature descriptors has been widely used such as HOG (Histogram of Oriented Gradients) + HOF (Histogram of Optical Flow) + MBH (Motion Boundary Histogram) [22], DT (Dense Trajectories) + BOF (Bag of Features) [24] and so on.

The Convolutional Neural Networks (CNN) have shown its advantages in computer vision. Many tasks, such as image segmentation [16], object tracking [15], etc. have been handled successfully by CNN. The hierarchy of features are learnt by the CNN. In the human action recognition problem, the spatial-temporal features should be constructed. Ji et al. [6] developed a novel 3D CNN architecture which fused useful spatial-temporal features. While their model took gradients and optical flow feature as the input of the neural networks. Moreover, Karpathy et al. [8] and other researchers [7, 11, 25] proposed new architectures to recognize the action. Considerable amount of training data are needed for these complex architectures, but they tend to suffer from the overfitting problem while the train sample size is small.

In the deep learning models, many useful training strategies have been pointed out to improve the results, a representative among which is the data augmentation method. To prevent from overfitting, Krizhevsky et al. [9] employed translations, horizontal reflections and RGB intensity altering on the training images to generate more samples. Jung et al. [7] then added image rotation operation to obtain 14 times more training data. To further prevent from overfitting and improve the robustness, Molchanov et al. [11] creatively introduced two more data augmentation methods: spatial elastic deformation and image pixel drop-out.

In this paper, we are interested in the human action recognition on the KTH dataset [18] and the UCF Sports dataset [17]. We develop our model with the 3D convolutional neural networks. To reduce overfitting, we apply the effective data augmentation method on the input video volumes. To boost the model’s performance to the best, we incorporate the One-versus-One (OvO) algorithm in our model which leads to an acceptable result as we wish.

2 Methodology

This part is organized as follows. We first briefly introduce the datasets used in our work in Sect. 2.1. We then provide some background information needed for our model in Sect. 2.2. Section 2.3 describes the details of the data preprocessing adopted in our model. Section 2.4 centers on the 3DCNN frameworks of our model. And Sect. 2.5 shows the details during training.

2.1 Datasets

We construct our model based on the KTH dataset and the UCF Sports dataset. The KTH action database contains six different types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) and each of them is performed by 25 people in four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes and indoors. There are 100 videos for each of the human actions except the handclapping action which has only 99 videos. And all the videos have the frame rate of 25 fps and time length of four seconds shot by a static camera. To reduce the computation cost, the videos are all down-sampled to the resolution of \(160 \times 120\) pixels.

The UCF Sports database consists of ten human action types (diving, golf swing, kicking, lifting, riding Horse, running, skateboarding, swing-bench, swing-side and walking). All these video sequences are collected from many sports scenes in broadcast television channels. There are 150 videos in total with the resolution of \(720 \times 480\) pixels and they contain unconstrained background environments (Figs. 1 and 2).

Fig. 1.
figure 1

The KTH dataset: example video sequences with different people in four different scenarios.

Fig. 2.
figure 2

The UCF Sports dataset: different sports scenes.

2.2 Background

3D Convolution. The convolution operations are only applied to spatial dimensions in the typical 2D CNN, which cannot capture the temporal features useful for action recognition. Different from 2D convolutions, 3D convolutions extend the convolution operations to the temporal dimensions. By convolving 3D kernels on the give spatial-temporal video volumes, the network is capable of obtaining temporal dynamic information encoded in several adjacent frames, which is needed for action recognition.

Before the 3D convolutions, we firstly extract a few contiguous frames from the original video sequences and then stack them on the frame level to form a video volume of size \(w \times h \times d\), which represent the width, height and depth (temporal length) separately. And then we apply 3D convolutional kernels with size of \(w^\prime \times h^\prime \times d^\prime \) across the volume to get numbers of feature maps. The calculation follows this way: the output value in the feature maps corresponding to the input value at position (x, y, z) is:

$$\begin{aligned} v_{xyz}=f(\sum _{i=0}^{w\prime -1}\sum _{j=0}^{h\prime -1}\sum _{k=0}^{d\prime -1}w_{ijk}k_{(x+i)(y+j)(z+k)}+b) \end{aligned}$$
(1)

where f denotes the activation function, \(w_{ijk}\) denotes the weight value of the 3D kernel with index (i, j, k), \(k_{(x+i)(y+j)(z+k)}\) denotes the input value at position \((x + i, y + j, z + k)\) and b denotes the bias value.

One-versus-One Algorithm. The OvO (One-versus-One) algorithm is one of the classic multiclass classification algorithms in machine learning [1]. The core idea of this algorithm is to reduce the problem of multiclass classification to multiple binary classification problems. In the reduction, we train \(\frac{N\times (N-1)}{2}\) binary classifiers each of which receives the data samples from a pair of classes from the original training dataset. At the training stage, we focus on making each classifier learn to distinguish between the corresponding two classes. At prediction stage, we feed the test data to all these classifiers to get \(\frac{N\times (N-1)}{2}\) results for each sample. With these results, the strategy we apply is the majority voting: the class that gets the highest votes is selected as the predicted class.

In our sub-data learning work, instead of directly designing a multiclass classification classifier, we train 15 and 45 binary classifiers on the two datasets according to the OvO algorithm. Finally we sum up all the validation results and produce the overall classification result.

2.3 Data Preprocessing

Target Area Segmentation. To achieve lower computational cost and make the result of higher precision, we don’t feed our convolutional networks with the volumes made up of the original frames from the videos. Considering part of the video sequences contains scale variation and the area occupied by the person only accounts for a small percentage of the whole area in most of the frames, instead we adopt a human detector with the help of the HOG-SVM algorithm detailed described in [2]. After getting the detection results, we crop the area where the moving person always stay in the center out of the original frames and then resize them to the size of \(40 \times 60\) pixels as shown in Fig. 3. When it comes to the temporal length of the volumes, the length should be big enough to catch a complete human motion, while at the same time as small as enough to gain reduction in computation. And according to our careful observation, the number 16 cannot be more appropriate for this very purpose. By this means, the extracted target area has a size of \(40 \times 60 \times 16\) pixels.

Fig. 3.
figure 3

Illustration of the target area segmentation method.

Data Augmentation. Note that the KTH dataset only contains 599 pieces of videos clips and the UCF Sports datasets just 150. It is far not enough to prevent from overfitting during training with this kind of sample size. As mentioned in [9], label-preserving transformation on the original dataset is the easiest way to reduce overfitting. Motivated by their work, we use two different data augmentation methods in our work to enlarge the training data while keeping the test data unchanged.

In our first method of data augmentation, we extract four \(35\,\times \,55\) pixels patches from four corners of the \(40\,\times \,60\) pixels target area cropped out according to the target area segmentation section and then do translations moving each of them in the diagonal direction (\(\pm 1\) pixels along the x axis, and \(\pm 1\) pixels along the y axis). Then again we extract the center part with size of \(35\,\times \,55\) pixels from the target area. So all above operations together make the number of patches for training increased by a factor of 9.

Our second method applies reverse operations along the temporal dimension to the output generated by the first augmentation method, with which we double the amount of the training data. For example, we create a new action that a man runs from left to right from the original action that the man runs from right to left. By this time, we increase the sample size by a factor of 18 in total.

Note that we only do the data augmentation scheme on the training data. Without the data augmentation our model suffers a lot from the overfitting problem during the experiments. And if the data augmentation is applied, the input size of the network will be smaller. Thus, at validation time we also extract 5 patches from the original test data (four corners and the center part) and then feed them to our model for validation.

2.4 Our Model Architecture

As mentioned above, our model combines the 3D convolutional neural networks (3DCNN) with the One-versus-One (OvO) algorithm. Thus, we divide the dataset into many sub-groups on each of which we train the 3DCNN which learns to distinguish between the corresponding two actions. In the following, the 3DCNN architecture that can effectively capture the temporal and spatial features useful for the classification is detailed described.

As depicted in Fig. 4, the proposed 3DCNN architecture contains six layers of which the first five are 3D convolutional or max-pooling layers and the last one is a fully connected layer. The input contains a number of volumes with size \(35\,\times \,55\,\times \,16\), corresponding to picture size \(35\,\times \,55\) and temporal length 16. Firstly we apply the 3D convolutional operation with a kernel size of \(5 \times 5 \times 3\) (\(5 \times 5\) in the spatial dimension and 3 in the temporal dimension) on the input volume data. It’s known that one kernel can only get one feature map from the input so we apply 16 different kernels to increase the number of feature maps. And to prevent the output size of the first layer from decreasing too sharply, we use the padding strategy that pads zeros around the image borders after convolution. Then with a \(2\,\times \,2\,\times \,1\) 3D max pooling operation on each of the feature maps we get the 16 feature maps with reduced size \(17\,\times \,27 \times 16\) in the layer S2. Subsequently, we further perform the 3D convolution on the feature maps of S2 with 32 kernels of size \(6\,\times \,7\,\times \,3\), leading to 32 feature maps in C3. After that we apply the \(3\,\times \,3\,\times \,1\) max pooling on the output of C3. By this time, the spatial size of the output is small (\(4\,\times \,7\)) so at the next layer we only apply the 2D convolution operations. Then C4 is obtained by applying 2D convolutions with 64 kernels of size \(4\,\times \,7\). After all the convolution and subsampling operations, we flatten the feature maps of C4 to concatenate all of them into a long 896D feature vector containing lots of useful motion information. Next we design the fully connected layer FC with 1024 nodes which is fully connected to each unit of the feature vector. Finally we set the number of the output to 2, which is as same as the number of types of human actions, and the two values represent the probability of each motion hypothesis separately with the help of the softmax regression function.

Fig. 4.
figure 4

The proposed 3DCNN structure in our model. It contains 3 convolutional layers, 2 max-pooling layers and 1 fully-connected layer.

2.5 Details of Training

To train the network, we choose the average cross-entropy as the loss function to minimize it:

$$\begin{aligned} l=- \frac{1}{N}\sum _{i=0}^{N}P(x^i)\cdot log(Q(x^i)), \end{aligned}$$
(2)

where N is the total number of the samples of the data, \(x^i\) denotes the ith sample of the dataset, P and Q denote respectively the inherent probability distribution and the probability distribution of x predicted by the model.

In our experiments the weights in each layer are initialized from a truncated normal distribution centered on 0 with standard deviation \( std=\sqrt{\frac{2}{n}} \) where n denotes the input or output connections at a layer. And we choose the ReLU activation function and set the biases for all the layers to 0 according to [4]. During the training stage, we apply drop-out strategy [21] with probability 0.5 after several layers and L2 regularization on the weights to overcome the possible overfitting problem. To accelerate the training, we also apply batch normalization [5] to the response of each layer.

3 Experimental Results

To evaluate the effectiveness of our model, we conduct the experiments on two benchmark datasets: the KTH dataset and the UCF Sports dataset. We first read the original video sequences frame by frame and down-sample them to the resolution of \(40 \times 60\) pixels taking the memory and computation overhead into consideration. Then we extract and form the \(35 \times 55\) pixels target areas according to the method mentioned in the Target Area Segmentation Section. Next, at train-test split, we randomly sample 10% from the reformed data as the validation part on the KTH dataset and the percentage is 33% on the UCF Sports dataset. After the data augmentation completes, for the KTH dataset, the training data contains 9702 video volumes with size \(35 \times 55 \times 16\) pixels, and the test data contains 600 video volumes with the same size. And for the UCF Sports dataset, the corresponding two numbers are 1800 and 250, separately.

According to OvO algorithm, we generate 15 binary classifiers for the KTH dataset and 45 binary classifiers for the UCF Sports dataset. Each of this classifiers utilizes the 3DCNN architecture proposed by us and are fed by the sub-data from the corresponding pair of classes. After rounds of training and the final fine-tune process, the classification results of these binary classifiers from the two datasets are reported in Tables 1 and 2. Then by summing up all these results and applying the majority voting strategy on the validation data, we finally get the overall validation accuracy: 94.0% on the KTH dataset and 95.6% on the UCF Sports dataset. The confusion matrices are shown in Figs. 5 and 6. And the comparison of our work to the peer work is demonstrated in Table 3.

Table 1. The 15 binary classifying results on the KTH dataset. A, B, C, D, E and F stands for boxing, handclapping, handwaving, jogging, running and walking separately. And the decimal numbers are the corresponding accuracies.
Table 2. The 45 binary classifying results on the UCF Sports dataset. A, B, C, D, E, F, G, H, I and J stands for diving, golf swing, kicking, lifting, riding Horse, running, skateboarding, swing-bench, swing-side and walking separately. And the decimal numbers are the corresponding accuracies.
Fig. 5.
figure 5

Confusion matrix for the KTH dataset. Average performance 94.0%.

Fig. 6.
figure 6

Confusion matrix for the UCF sports dataset. Average performance 95.6%.

Table 3. Comparisons of our work to the peer work on the KTH and UCF Sports dataset.

4 Conclusions

In this paper, we focus on the action recognition problem on the KTH and UCF Sports dataset. Rather than capture the handcrafted features like most of the researchers do, we develope the 3D convolutional neural networks (3DCNN) to automatically caputure the useful spatial-temporal features. To boost our model’s performance, we utilize the sub-data learning method that incorporate the One-versus-One (OvO) algorithm into our 3DCNN architecture. We achieve the high correct classification rate of 94.0% on the KTH dataset and 95.6% on teh UCF Sports dataset, which is quite competitive compared to the peer work.