1 Introduction

The number of closed-circuit television surveillance cameras deployed in urban environments was estimated to have surpassed 1 billion in 2021 (Bahrami et al. 2021). Contemporary automated crowd surveillance requires both privacy preservation and automation, and deep learning has become the method of choice for both (Sharif 2021; Zhao et al. 2019; Kassani et al. 2019). What constitutes an anomaly greatly depends on the context. For instance, people running out of a bank would typically be considered an anomaly, whereas a group of running people would be considered normal during a marathon. Furthermore, anomalies are few and far between, which means a lack of labeled data is a further challenge. Thus, the detection of anomalies is typically treated as an unsupervised rather than supervised learning problem: a model (e.g., an autoencoder) is trained on a video frame sequence to capture what is essentially normal activity, and any divergence is viewed as an anomaly (Rodrigues et al. 2019).

Before the advent of deep learning models, researchers typically focused on the extraction of handcrafted spatiotemporal features and traditional image processing techniques (Ihaddadene et al. 2008; Fontugne et al. 2008; Sharif and Djeraba 2009a). Although deep learning has made feature extraction by and large superfluous, there are remaining challenges related to anomaly inconsistency, the cost of obtaining labeled data, and the need for robustness in the face of multiple view angles and varying illumination and weather conditions (Jaouedi et al. 2020; Doshi and Yilmaz 2020b; Sindagi et al. 2020; Rezaee et al. 2021; Yuan et al. 2022).

Several surveys regarding crowd anomaly detection methods already exist. For example, Borja-Borja et al. (2018) presented a short review of deep learning methods aimed at understanding group and crowd behaviors. Similarly, Afiq et al. (2019) offered a review of algorithms published between 2013 and 2018. Khan et al. (2020) summarized seminal research works on crowd management published between 2010 and 2020. Suarez and Jr. (2020) compiled a survey of deep learning solutions for anomaly detection in surveillance videos, considering articles published between 2016 and 2020. Braham et al. (2021) performed a comparative study of crowd analysis algorithms, taking into account some previous reviews and various studies published between 2017 and 2020. Elbishlawi et al. (2020) assembled deep learning-based methods for crowd scene analysis methods that were published up to the time of writing (until 2020). Mohammadi et al. (2021) conducted an in-depth literature survey into deep learning-based anomaly detection methods for both images and video (the studies were mainly published between 2010 and 2020). Mu et al. (2021b) focused their review on deep learning methods published until 2020. Sharma et al. (2021) contrasted the deep learning literature published between 2017 and 2020 with earlier research published between 2011 and 2017 by studying 93 research articles from reputed databases published between 2011 and 2020. Sanchez et al. (2020) reviewed articles on deep learning-based models published between 2011 and 2019. Yuan et al. (2022) chronicled the evolution of human behavior-recognition methods, starting from standard manual feature extraction to deep learning methods. Rezaee et al. (2021) focused their survey on methods for secure distributed video analytics published between 2016 and 2020. Currently, deep learning algorithms are the standard approach for automatic crowd anomaly detection, with recent progress including the transition from two- to three-dimensional convolutional neural networks, the introduction of the attention mechanisms (Zhang et al. 2021b) now used in transformers (Vaswani et al. 2017; Dosovitskiy et al. 2020; Touvron et al. 2021b), and excursions into the application of quantum computing-based machine learning (Schuld et al. 2014; Tang 2018; Blekos and Kosmopoulos 2021).

While the aforementioned surveys are comprehensive and offer helpful insights into open problems, our survey intents toward fulfill the target of being a good starting point for the researchers aiming to start in the field of crowd anomaly detection. Our survey provides insight into the deep learning-based crowd anomaly detection methods mainly published in mainstream English-language conferences and journals articles between 2020 and 2022. It presents techniques for feature selection, performance metrics, benchmark datasets, and target methods in a clear and concise manner. It also updates similar surveys with the fast evolution of the area between 2020 and 2022 in a better qualitative and quantitative fashion. Explicitly, our survey conveys more comprehensive and systematic deep discussion of contemporary methods and materials with more focused on the specific years than those of other existing similar surveys. We augment the knowledge of the anomaly detection research community. Our findings can also be adopted by newcomers to obtain an overall comprehension of recent progress in the field. Our contribution can be summarized as follows:

  • A summary, taxonomy, and comparative study of the performance indicators of contemporary deep learning models;

  • A collation of crowd datasets to gauge their popularity and relevance;

  • A rigorous statistical analysis of data taken from experimental setups to assess the impact of the architectural heterogeneities of pre-trained two-dimensional convolutional neural networks.

The rest of the survey is organized as follows: Sect. 2 defines the crowd anomaly detection task. Section 3 summarizes several seminal works up to the period where our review begins. Section 4 signifies adopted strategy to search and store previous works. Section 5 discusses common crowd features and their extraction methods. Section 6 compares the performance metrics of various crowd anomaly detection models. Section 7 briefly summarizes crowd datasets. Section 8 compiles a succinct survey of deep crowd anomaly detection methods. Section 9 illustrates the architectural impacts of pre-trained two-dimensional convolutional neural network models on the performances of crowd anomaly detection methods. Section 10 identifies current research challenges and future prospects. Finally, Sect. 11 concludes the paper.

In addition, Table 4 of Section 12, Table 5of Section 13, and Table 6 of Section 14 outline tabular synopses of our used acronyms, various crowddatasets, and deep anomaly detection methods, respectively.

2 Definition of crowd anomaly detection

The people in a defined area are commonly called a crowd. In conformity with the dictionary of merriam-webster.com, an anomaly can be defined as either an irregularity (i.e., deviation from the common rule) or something anomalous (i.e., something different, abnormal, peculiar, or not easily classified). Sometimes, anomalies could be explained by outliers. In statistics, outliers are extreme data points within data. Detection is the process or act of identifying or sensing something. For public security and safety, it is often needed to detect various anomalies from crowd surveillance data. Crowd anomaly detection refers to the recognition of irregularities, aberrations, or discovering patterns that are uneven with usual activity/behavior from crowd surveillance images/videos. Explicitly, crowd anomaly detection is the procedure of understanding the all-inclusive characteristics (e.g., density, flow, demographic information, and etc.) of a crowd video (usually taken by surveillance camera) employing numerous concepts and techniques from image processing, computer vision, and machine learning for identifying anomalies in a crowd of people. A common goal of crowd anomaly detection is to identify patterns in the crowd video that can help differentiating anomalous activity/behavior with respect to normal activity/behavior. If a crowd anomaly is detected, the relevant authorities are informed then and there in the forms of turning on lights, sounding an alarm, etc. Typically, the crowd anomaly detection links with crowd statistics, tracking, and crowd scene understanding. Figure 1 depicts the general structure of crowd anomaly detection.

Fig. 1
figure 1

A general structure of crowd anomaly detection

  • Crowd Statistics \(\Rightarrow\) Examples of crowd statistics include crowd density estimation and crowd counting. Both of them involve in analyzing patterns and trends in quantitative data. The number of people per unit area can be used to estimate crowd density by applying different computer vision and deep learning techniques (e.g., background subtraction, foreground estimation, optical flow, CNN, and etc.). The key aim of crowd statistics is to estimate the crowd density of a given area and thereby counting the number of people present. For example, Chen et al. (2020b) used a deep convolutional neural network to estimate the crowd density map and then regressed the crowd counting from the density map. The crowd congestion level in the scene is highly involved in understanding anomalies in the crowd (Ekanayake et al. 2023). Basically, an extraordinary big number of people in the crowd can be deemed as an abnormal event. Crowd counting can be defined as estimating the number of people present in a single picture (Wan et al. 2021). If the crowd counting is higher than any specified threshold value, it is considered as an anomaly (Bhat et al. 2020).

  • Tracking \(\Rightarrow\) In video surveillance applications, it is usually desired to localize anomalies in both space and time. Object (or person) tracking is the process of determining the location of moving objects over time or the changes in features over time. Objects can be tracked singly or multiply with either online or offline manners. Various computer vision and machine learning techniques are available for object detection and tracking (e.g., graph-based tracking, Kalman filter, particle filter, and etc.). Upon tracking an individual (object or person), usually its behavior/activity can be analyzed to determine whether it is usual or unusual. Tracking information helps to identify specific individuals or groups of people behaving exceptionally including loitering, running, and moving against the main mobility of a crowd.

  • Crowd Scene Understanding \(\Rightarrow\) It involves in detecting and analyzing spatial and temporal crowd events. Miscellaneous techniques (e.g., machine learning models, rule-based algorithms, and etc.) are available for crowd scene understanding. Typically, crowd scene understanding involves in comparing the activity/behavior of individual with certain preset standard activity/behavior model. Intelligent surveillance cameras or sensors can help the process of analyzing crowd flow (e.g., direction and speed of movement of people), understanding crowd activity/behavior, deciding structured/unstructured crowd, and detecting anomalies. Crowd scene understanding helps to avoid miscellaneous potential catastrophic circumstances from various public activities including concerts, political rallies, sports, and religious events (e.g., Hajj and Umrah Aldayri and Albattah 2022). Besides, by understanding crowd scene it is possible to detect anomalies in crowd videos including abuse, arrest, arson, assault, hazard, burglary, explosion, fighting, robbery, shooting, stealing, shoplifting, and vandalism.

3 Early studies

Detecting the abnormal activities or anomalies of a crowd in real-world surveillance videos is considered an important yet challenging task because prior knowledge about anomalies is usually very limited or unavailable. Moreover, in real-life situations, both abnormal events and behaviors may take place at once in the crowd, and they both need to be detected. Fundamentally, the crowd anomaly detection problem is a binary classification task where each frame in the video obtains an anomaly score for differentiating whether the frame belongs to a normal or abnormal class (e.g., see Fig. 2).

Fig. 2
figure 2

Theoretical view of normal–abnormal frame classification using (Minnesota 2022) dataset and DenseNet121 (Huang et al. 2017; Wakili et al. 2022)

No generic definition currently exists for abnormal events, which are usually dependent on the scene under consideration. For example, a car passing on the road is a normal activity but it becomes abnormal when it passes in the pedestrian lane. A typical approach for addressing such a scene’s context dependency is to consider a scene’s rare or unforeseen events as abnormal. Nevertheless, this may result in classifying unseen normal activities as abnormal. In general, it may not be possible to know all the normal and abnormal activities during training. It is only possible to have access to subsets of normal and abnormal activities. The lack of a generic definition and insufficiency in the data make it extremely hard for any learning algorithm to understand and capture the nature of an anomaly. The results of the related literature report some success stories along with several convincing studies, which are mostly conducted in constrained conditions. Under uncontrolled scenarios, the task of crowd anomaly detection is still challenging for the research community. However, early research can be classified into deep learning paradigm and miscellaneous paradigm.

3.1 Deep learning paradigm

Remarkable progress has been achieved in object-level information and deep learning models for tracking, classification, and clustering, which have been applied for recognizing abnormal events in video scenes. Currently, deep learning (e.g., CNN LeCun et al. 1989, AE Vincent 2011, GAN Goodfellow et al. 2014, LSTM Hochreiter and Schmidhuber 1997, U-Net Ronneberger et al. 2015, YOLO Redmon et al. 2016, etc.)-based models are utilized as first-hand alternatives for crowd anomaly detection setups. The primary strength of deep learning models is automatic salient feature extraction; however, they encounter many challenges, namely application levels, such as when the anomalies have an inconsistent abnormal behavior and when the high cost of labeling data makes it difficult to obtain the large-scale labeled data required for training. Furthermore, crowd anomaly detection algorithms should work reliably and robustly under a wide range of changing circumstances, including varying levels of illumination, multiple view angles, and changing seasons and weather, and a single algorithm may not be the best option for all usage cases. Various methods have been proposed in the literature to improve efficiency, robustness against pixel occlusion, generalizability, computational complexity, and execution time (Rezaee et al. 2021). Accordingly, a number of surveys (e.g., Borja-Borja et al. 2018; Afiq et al. 2019; Khan et al. 2020; Suarez and Jr. 2020; Braham et al. 2021; Elbishlawi et al. 2020; Mohammadi et al. 2021; Mu et al. 2021b; Sharma et al. 2021; Sanchez et al. 2020; Yuan et al. 2022; Rezaee et al. 2021) on crowd anomaly detection methods explain taxonomy, anomaly detection, crowd emotion, datasets, opportunities, and prospects. Borja-Borja et al. (2018) assessed deep learning methods briefly to figure out group and crowd activities. Afiq et al. (2019) described comprehensively four major detection methods of abnormal behavior in crowd scene namely Gaussian mixture model, hidden Markov model, optical flow technique, and spatiotemporal technique. Khan et al. (2020) investigated miscellaneous aspects of the already existing solutions for crowd analysis. Suarez and Jr. (2020) categorized the advances in anomaly detection for videos into reconstruction error, predicting future frames, using classification, and using scoring. Braham et al. (2021) concluded that group analysis-related tasks are not popularly explored using deep learning methods, even with their common use in crowd analysis. They also commented that due to the non-existence of applicable datasets the upstream massive crowd analysis for motion tracking and/or anomaly detection is not well-explored in the literature of the deep learning. Elbishlawi et al. (2020) categorized the deep learning-based methods for crowd scene analysis into crowd counting and crowd action recognition. Mohammadi et al. (2021) provided a precise definition of anomaly detection concept with a thorough categorization of the already existing methods and explained the open challenges for this research topic. Mu et al. (2021b) discussed the definition and related surveys for abnormal event detection in intelligent video surveillance. They gave an overview of representative approaches that covers the feature extraction, event modeling, event detection, and datasets with evaluation metrics. Sharma et al. (2021) hinted that video processing literature was enriched with the advent of pivotal research in AlexNet and ImageNet during the years of 2017–2020. Sanchez et al. (2020) discussed that both public and private sectors are demanding accurate solutions for monitoring crowd behaviors, and there is still a large room for improvement. They proposed a taxonomic organization of the previously existing works by taking into account the different stages that conform the crowd behavior analysis problem. Yuan et al. (2022) mainly outlined the various methods for human behavior recognition, including both traditional and newly emerging algorithms, and recorded from the initial single data to multi-modal data, from manual feature methods to end-to-end research methods based on deep learning evolution. Rezaee et al. (2021) studied various aspects of crowd anomaly detection methods. Based on the individual tracking, classification using handcrafted extracted features, classification using deep learning, and hybrid models; they concluded that deep learning as well as hybrid models have more satisfactory performance characteristics for identifying and predicting crowd anomaly behavior.

3.2 Miscellaneous paradigm

In spite of the successes of deep learning models, some researchers predominantly focused on the use of dissimilar handcrafted spatiotemporal features and image processing procedures. For example, the Gaussian Mixture (GMM), Hidden Markov (HMM), and Gaussian Mixture Hidden Markov Models (GM-HMM), One-Class Support Vector Machines (OSVMs), and Bag-of-Words (BoW) techniques were commonly applied to model normal behavior patterns for detecting abnormal patterns (Afiq et al. 2019). The Kanade–Lucas–Tomasi feature tracker (KLT), Multi-Observation HMM, k-means clustering, Principal Component Analysis Histogram of Oriented Gradient (PCA-HOG), and Histogram of Oriented Optical Flow (HOFH) techniques were also applied to detect abnormal behaviors in crowded scenes (Zhou et al. 2015; Wang et al. 2015, 2016). In the approaches based on spatiotemporal volume along with cuboids (Chen et al. 2016a), in addition to spatiotemporal textures (STT) (Wang and Xu 2016), trajectories (Tran et al. 2014), interest points (STIP) (Cheng et al. 2015), descriptors (STD) (Lung et al. 2015), and texture maps (STTM) (Lloyd et al. 2017), the features from the spatial and temporal dimensions were integrated for the purpose of crowd tracking and anomaly detection. The GMM techniques were trained by the expectation–maximization (EM) algorithm. The EM fairly well computed the GMM parameters. However, it required a high computational cost. In the HMM approach, the computing process was based on the initial probability of hidden states and transition and observation matrices (Afiq et al. 2019). The optical flow was considered to obtain the crowd features alongside important information regarding various behaviors or activities. Both optical flow and spatiotemporal techniques played a major role in abnormal event detection frameworks.

With the advent of new deep learning models, crowd anomaly detection prototypes have shifted from traditional 2DCNN to 3DCNN, from standard deep learning to more intelligent attention-, transformer- or even quantum computing-based learning. Most of the existing surveys represent widely recognized deep learning methods in such a manner that effectively highlights current issues. However, our survey is more comprehensive and more focused on the specific years than the existing other similar surveys.

4 Strategy to collect previous works

To collect previous works, we adopted the strategies as to be detailed in the following two subsections. Initially, we collected more than 750 relevant articles between 2020 and 2022. Upon revising based on some factors including the most relevance, publication venues, general significance, and quality of the articles, we qualified about 380 articles concerning the deep learning based crowd anomaly detection techniques and datasets. The vast majority of those articles were published in the mainstream English language conferences and journals between 2020 and 2022. Masters’ theses, Doctoral theses, and unpublished works were excluded. Several arxiv.org articles were considered as a form of publication. To perform the arduous collection task, the first author of this article invested his maximum efforts more than a month.

4.1 Generation of keywords

We used miscellaneous search keywords generated from the following five sets with their strings combination.

  1. 1.

    {anomaly, event, behavior, activity, abnormality, action, episode}+{detect, recognit, track, analyz};

  2. 2.

    {video, deep, image, data, traffic, learning, supervised, gravity, gradient, online, motion, surveillance}+{anomaly};

  3. 3.

    {anomaly, abnormal, aberrant, unusual, video, crowd}+{event, behavior, activit, action};

  4. 4.

    {crowd, mob, mass, people, pedestrians, video}+{surveillance, safety, security, monitor, abnormal};

  5. 5.

    {attention, deep, crowd, vision, video}+{transformer};

where the plus symbol (“+”) hints string concatenation. Using these sets, we can generate \(7\times 4+12\times 1+6\times 4+6\times 5+5\times 1=99\) keywords indexing from \(nkw= 1\) to \(nkw= 99\), where nkw represents the number of keywords. Explicitly, a list of generated keywords (genkw) consists of: genkw[1] \(=\) anomaly detect, genkw[2] \(=\) anomaly recognit, genkw[3] \(=\) anomaly track, genkw[4] \(=\) anomaly analyz, genkw[5] \(=\) event detect, \(\cdots\), genkw[98] \(=\) vision transformer, genkw[99] \(=\) video transformer.

4.2 Search engines and documents collection

We used search engines (seng) to search articles from online. The nse denotes the number of seng. Mainly, we considered \(nse=3\), explicitly we employed: seng[1] = dblp.org, seng[2] = ieeexplore.ieee.org, and seng[3] = scholar.google.com. We searched articles first from dblp.org, because it is known as “the premier open bibliographic data base, search engine, and knowledge graph on computer science publications” (GmbH 2024). We created a directory named storeDir to store every downloaded article as the portable document format (*.pdf). Inside the storeDir, we created a LaTeX.bib file, which contained the BibTex information of all downloaded articles. The LaTeX.bib helped to organize our references and create a bibliography. The Algorithm 1 illustrates the whole nonautomated search and store efforts in a step by step fashion.

Algorithm 1
figure a

Strategy to search and store of works between 2020 and 2022.

5 Selection of predictive features

Features are used for learning. They can also be the output of learning, derived from datasets. Feature extraction is the task of removing appropriate information from raw data. Extracting informative, discriminating, and independent features is a crucial element of effective anomaly detection algorithms. The feature extraction quality directly influences the algorithm’s detection accuracy (Yuan et al. 2022). Thus, feature extraction methods play a considerable role in crowd anomaly detection tasks. The t-SNE (t-distributed Stochastic Neighbor Embedding) method is usually used as a dimensionality reduction tool for the visualization of high-dimensional feature distributions. It is widely applied in abnormal event detection tasks to visualize the distributions of learned features (Shin et al. 2020; Tsai et al. 2022).

5.1 Popular crowd features

Crowd features are composed of a set of metrics that specify the dynamics, topological anatomy, and emotional state of the crowd. Those metrics can be maintained over time by computing at the individual and/or crowd levels.

  • Motion Pattern \(\Rightarrow\) The motion pattern is defined as the optical flow vector, which appeared as the brightness patterns in the images (Afiq et al. 2019).

  • Trajectory \(\Rightarrow\) Trajectory features are very informative (Zhou and Maskell 2017). The global motion pattern in a scene can be utilized to describe a trajectory (Lin et al. 2017). In general, the KLT (Kanade–Lucas–Tomasi feature tracker) is utilized to obtain the trajectories (Hariyono and Jo 2017).

  • Velocity \(\Rightarrow\) This measures the average speed at which individuals (when the approach is bottom-up) or crowds (in top-down approaches) are moving (Zhang et al. 2018).

  • Direction \(\Rightarrow\) At the macroscopic level, it determines the number of main directions of movement influencing the crowd (Musse et al. 2007). The direction followed by each individual may also be extracted in microscopic approaches.

  • Density \(\Rightarrow\) This quantifies the proximity of individuals in the crowd, determining how dense the crowd is. At the macroscopic level, the objective is to perform density estimation rather than precise people counting, as clutter and severe occlusions make individual counting difficult in very dense crowds.

  • Collectiveness \(\Rightarrow\) Individuals tend to follow the behaviors of others. This feature measures the degree that individuals act as a union in collective motions (Zhou et al. 2013; Zhang et al. 2016a; Afiq et al. 2019). When part of a crowd, instead of behaving independently, individuals tend to follow the behaviors of others and move in the same direction as their neighbors (Sanchez et al. 2020).

  • Stability \(\Rightarrow\) Crowd stability describes group tendencies, such as maintaining topological structures over time (Afiq et al. 2019). This is determined by the behavior of the group’s members, which continues with the nearest neighbors at a consistent distance.

  • Uniformity \(\Rightarrow\) Group uniformity is calculated based on members’ distances and evenly distributed locations in space. Furthermore, non-uniform groups tend to scatter in different directions (Afiq et al. 2019).

  • Conflict \(\Rightarrow\) The conflict feature is used to illustrate the interaction or friction between people as they approach each other (Afiq et al. 2019).

  • Valence \(\Rightarrow\) This aims to measure the positive and negative effects of the crowd. According to the literature on Psychology, it is usually presented as a \([-1,~1]\) steady scale, ranging from unpleasantness to pleasantness (Sanchez et al. 2020).

  • Arousal \(\Rightarrow\) This feature helps to monitor the tranquility and excitement of the crowd. It is also presented in a \([-1,~1]\) steady scale, extending from passive to active (Sanchez et al. 2020).

5.2 Manual feature extraction

Conventional artificial feature extraction methods were repeatedly applied in the early development of human action recognition. Manual-feature-extraction-based methods can be categorized into several groups, including spatiotemporal interest points (Laptev 2005; Scovanner et al. 2007), template matching (Yilmaz and Shah 2005; Kläser et al. 2008), trajectories (Huang et al. 2020), and depth-sequences (Yang et al. 2012). Nevertheless, due to changing illumination, viewing angles, and occlusion, conventional manual feature methods are no longer applicable in complex scenes (Jaouedi et al. 2020; Yuan et al. 2022).

5.3 Automatic feature extraction

Deep learning can automatically extract feature vectors through iterative learning to conclude classification. For example, due to the grid-like nature of images, CNNs automatically extract salient features at different levels of abstraction. CNN applies the feature extractor in the training process instead of manually implementing it. Usually, CNN consists of a convolution, sub-sampling (or pooling layer), and fully connected layers. The convolutional operation in the lower layers is utilized for analyzing the links between neighboring patches and learning the low-level local features (He et al. 2017). The sub-sampling layers are non-linear down-sampling operations that compute average or maximum values for each input image patch or feature map. Subsequently, they improve the robustness of translation but lessen the number of network parameters (Afiq et al. 2019). Eventually, the outcomes of the fully connected layer include the learned feature, which can be applied for classification or detection tasks (Li et al. 2018).

5.4 Popular feature selection methods

Finding the most predictive features for a fixed deep learning model is important. Many methods exist to select features for supervised learning. The following methods can be adopted to select predictive features.

  • Filter Method \(\Rightarrow\) This category uses statistical tests that select features based on their distributions. Filter methods are much faster compared with wrapper methods because they do not involve training the models. Although these methods are computationally very fast, the p-values of statistical tests tend to be very small for big datasets. This indicates significant, yet tiny, differences in distributions. For this reason, they are not widely used in practice.

  • Wrapper Method \(\Rightarrow\) These methods apply the concept of greedy algorithms that will try every possible feature combination based on a step forward, step backward, or exhaustive search. For each feature combination, these methods will train a machine learning model, usually with cross-validation, and determine its performance. Thus, wrapper methods are very computationally expensive, and often, impossible to carry out.

  • Embedded Method \(\Rightarrow\) These methods train a single machine learning model and select features based on the feature importance returned by that model. They tend to work very well in practice and are faster to compute. On the downside, we cannot derive feature importance values from all machine learning models (e.g., nearest neighbors). Furthermore, decision tree-based algorithms may not perform well in very big feature spaces and, thus, the importance values might be unreliable. In brief, embedded methods are not suitable for every scenario or every machine learning model.

  • Shuffling Method \(\Rightarrow\) The feature shuffling method assigns importance to a feature based on the decrease in a model performance score when the values of a single feature are randomly shuffled. It only trains one machine learning model, so it is quick and suitable for any supervised machine learning model. If two features are correlated, when one of the features is shuffled, the model will still have access to the information through its correlated variable. Feature shuffling is available in popular feature engines, including the Python open-source library.

  • Thresholding Method \(\Rightarrow\) The thresholding method splits the data frame into a training and a testing set, and then it selects the features with performances above a threshold. It is fast because no machine learning model is trained, and it is also robust to outliers. It captures non-linear relationships between features and the target, which is model agnostic. Nevertheless, it requires the tuning of interval numbers for skewed variables. Rare categories will offer unreliable performance proxies or make the method impossible to compute.

6 Performance metrics for crowd anomaly detection models

Crowd anomaly detection functions fall into the category of an unbalanced binary classification task, i.e., an AOS obtained by an anomaly detection model categorizes each frame in the video as either normal or abnormal. In this section, we categorize the performance metrics for crowd anomaly detection and then discuss the evaluation metrics, available from 2020 and 2022, in Table 6.

6.1 Classification of performance metrics

The performance evaluation metrics for crowd anomaly detection can be roughly categorized as presented below.

  • Evaluation Metrics for Frame-level Detection \(\Rightarrow\) The motivation for crowd anomaly detection in the video is to come up with the AOS of each frame, which stipulates the probability that the frame holds abnormal events. The AOS has a range from 0 to 1 (Chong and Tay 2017). It is used to explain the degree of anomaly. It is an absolute anomaly measured together with the temporal and spatial characteristics of a video (Esquivel and Zavaleta 2022). A higher AOS value signifies a higher anomaly level (Nayak et al. 2021). Commonly, the metrics of MSE (Luo et al. 2017b; Gong et al. 2019) and PSNR (Mathieu et al. 2016; Liu et al. 2018b) are employed to calculate the AOS (Zhong et al. 2022). For example, the AOS of a frame t (\(\hbox {AOS}_t\)) can be computed by Equation (1) as Sun et al. (2021):

    $$\begin{aligned} AOS_t = \frac{e_t - \min _t\,e_t}{\max _t\,e_t}, \end{aligned}$$
    (1)

    where \(e_t\) belongs to the mean reconstruction error of all the pixel values in t, \(\min _t\,e_t\) indicates the minimum reconstruction error among all frames in a video, and \(\max _t\,e_t\) addresses the maximum frame-level reconstruction error in a video. Furthermore, AE (Wang et al. 2018a) or OCSVM (Aziz et al. 2021) models can be utilized to obtain an \(AOS_t\). If the value of \(AOS_t\) is significant, the frame t is categorized as an abnormal event. While most existing anomaly detection algorithms calculate the AOS based on the current frame, a small amount of them use past, present, and future frames (Zhong et al. 2022). If any region of a frame is detected as an anomaly that is in line with the frame-level ground truth annotation, such detection is granted to be a correct hit regardless of the locality and the area of the region (Zhang et al. 2020b; Yan et al. 2020). The ACC, F-AUC, F1S, and F-EER are well-known examples of this category.

  • Evaluation Metrics for Pixel-Level Detection \(\Rightarrow\) Although frame-level evaluation methods have been adopted by scores of researchers, they cannot determine whether the anomaly location is accurate (Zhang et al. 2020a). To emphasize the correctness of the abnormal locality, pixel-level measurements can be used. In this performance metric, the exact location of the abnormal event in the frame is marked as a mask. Anomaly detection algorithms are asked to create a comparable mask. An anomaly is said to be flawlessly recognized if the predicted mask is aligned with the ground truth. If over 40% of the ground truth events are detected as anomalies in a frame, such detection is called a right detection (Zhang et al. 2020b; Yan et al. 2020; Fan et al. 2020; Mu et al. 2021a, b; Nayak et al. 2021; Wu et al. 2021c). This criterion can be employed to evaluate the anomaly localization capability (Fan et al. 2020). P-AUC and P-EER are common examples of this category.

  • Evaluation Metrics for Event-Level Detection \(\Rightarrow\) If any position with a true anomaly is detected and localized as abnormal, such detection is granted a correct hit (Zhang et al. 2020b). On the other hand, if any normal frame is detected as an anomaly, it is counted as a false alarm (Kim and Grauman 2009; Cong et al. 2013) To reduce the noisy and incomprehensible local minima in the RGS, Hasan et al. (2016) applied the Persistence1D (Kozlov and Weinkauf 2020) algorithm to cluster local minima using a fixed temporal window of 50 frames. Plainly, local minima within 50 frames are in the same abnormal event (Yan et al. 2020). To this end, in the CUHK-Avenue (Lu et al. 2013a) dataset, the spatial stream in Yan et al. (2020) detected 36 abnormal events with 8 false alarms, while their temporal stream detected 32 abnormal events with 12 false alarms.

  • Evaluation Metrics for Computational Complexity \(\Rightarrow\) It is often important to quantify the time and space complexities of a video anomaly detector. Here, RTM is commonly used. The number of operations executed by the model can be measured in FLOPS (floating-point operations per second) (Wu et al. 2020c).

6.2 Used performance metrics 2020–2022

The following performance evaluation metrics were used between 2020 and 2022 for crows anomaly detection tasks.

  • ACC \(\Rightarrow\) It is a very familiar metric for binary classification problems. If \(t_n\), \(t_p\), \(f_p\), and \(f_n\) represent true negative, true positive, false positive, and false negative, respectively, then ACC can be computed using Equation (2) as:

    $$\begin{aligned} ACC = \frac{t_n+t_p}{t_p + t_n + f_p + f_n}. \end{aligned}$$
    (2)

    The major stumbling block of this metric is its paucity of unbalanced setups (Sanchez et al. 2020). Sometimes, the vocable ACC is applied interchangeably with the percent correct classification (PCC) (Shehu et al. 2021).

  • PRS \(\Rightarrow\) It is defined as the number of correct positive results divided by the number of positive results predicted by the classifier. It can be computed using Equation (3) as:

    $$\begin{aligned} PRS = \frac{t_p}{t_p + f_p}. \end{aligned}$$
    (3)
  • RES \(\Rightarrow\) It computes the fraction of correctly classified elements that belong to the positive class. The recall is defined as the number of correct positive results divided by the number of all relevant samples. It can be computed using Equation (4) as:

    $$\begin{aligned} RES = \frac{t_p}{t_p+f_n}. \end{aligned}$$
    (4)
  • F1S \(\Rightarrow\) It is also called the F-score or F-measure. It is just the harmonic mean between precision and recall. It can be computed using Equation (5) as:

    $$\begin{aligned} F1S = \frac{2}{\frac{1}{PRS} + \frac{1}{RES}}= \frac{t_p}{t_p + \frac{f_p+f_n}{2}}. \end{aligned}$$
    (5)
  • AUC \(\Rightarrow\) True positive, true negative, false positive, and false negative rates indicate that an anomalous frame is detected as anomalous, a normal frame is detected as normal, a normal frame is detected as anomalous, and an anomalous frame is detected as normal, respectively. The ROC curve is generated by plotting the true positive rate against the false positive rate at numerous threshold settings. The AUC is one of the most widely used metrics for evaluating flows and events in crowd videos (Sharif 2017). The AUC of a classifier equals the probability that the classifier ranks a randomly chosen positive sample higher than a randomly chosen negative sample (Shehu et al. 2021). AUC = 0.00 indicates that the predictions of a model are 100% wrong. Conversely, if the predictions are 100% correct, then its AUC is just 1. AUC can be accomplished in both frame- and pixel-level evaluations. In a frame-level evaluation, an anomaly is measured at the frame level. A frame is counted as an anomaly even if an anomaly is detected for less than one pixel of the discrete frame (Sabokrou et al. 2017). In a pixel-level evaluation, the evaluation scrutinizes each pixel independently (Mu et al. 2021b). However, many authors performed both F-AUC and P-AUC for evaluating anomaly detection systems (Wu et al. 2021c).

  • EER \(\Rightarrow\) Originally, this performance metric was used in biometric systems. Currently, it is widely employed for crowd video anomaly detection. It is interpreted as the operating point at which the miss and false alarm rates are equal. It can be computed directly from the ROC curve as shown in Fig. 3. An ERR is a point where false positive and true positive rates intersect. An EER can be accomplished at the frame or pixel level. Similar to AUC, many authors performed both F-EER and P-EER when evaluating anomaly detection systems (Wu et al. 2021c). A detection algorithm with a lower EER is regarded as more accurate. EER is effective for the detection of video anomalies (Wu et al. 2020c; Nayak et al. 2021). However, EER can provide misleading results for anomaly detection (Giorno et al. 2016; Asad et al. 2021). For example, Giorno et al. (2016) identified every instance of an anomaly and also evaluated their fidelities compared with a human evaluation of anomalousness. They avoided metrics that scored anomaly detections in an event-detection manner, where labeling a single frame was considered a successful detection method of adjoining frames. They said (Giorno et al. 2016): “Metrics like EER can be misleading in the anomaly detection setting. Consider the case when only 1% of the video is anomalous: the EER on an algorithm that marks all frames normal would be 1%, outperforming most modern algorithms. This extreme class imbalance is less prevalent in current standard datasets but will become an apparent problem as more realistic datasets become prevalent.

  • EDR \(\Rightarrow\) A higher EDR indicates a better performance. The EDR can be computed by Equation (6) as (Li et al. 2014):

    $$\begin{aligned} EDR = 1 - EER. \end{aligned}$$
    (6)
  • S4 \(\Rightarrow\) The S4 limits between 0 and 1, and a higher score is always better, which can be computed by Equation (7) as Doshi and Yilmaz (2020b):

    $$\begin{aligned} S4 = (F1S)(1 - \textit{Normalized root MSE}). \end{aligned}$$
    (7)
  • IoU \(\Rightarrow\) It is a measure of the magnitude of the intersection between two bounding boxes. It computes the size of the overlap between two objects divided by their combined total area. It can be computed via Equation (8) as:

    $$\begin{aligned} IoU = \frac{\textit{Area of overlap}}{\textit{Area of union}}. \end{aligned}$$
    (8)
  • APD \(\Rightarrow\) It is a combination of the average detection delay (ADD) and alarm precision. The AUC metric summarizes the true positive versus the false positive rate, whereas APD measures the area under the precision versus normalized ADD curve. A very successful algorithm having an APD value of approximately 1 must have high precision as well as low delay in its alarms (Doshi and Yilmaz 2022b, a).

  • mAP \(\Rightarrow\) It is a popular metric used to measure the performance of models for document/information retrieval and object detection tasks. However, it can also be applied to anomaly explanation tasks (Szymanowicz et al. 2022). The precision-recall (P-R) curve demonstrates the tradeoff between PRS and RES for different thresholds or classes. The P-R curve is considered as an alternative to the ROC curve (Maxwell et al. 2021). It is a plot of the precision on the y-axis, and the recall on the x-axis for various thresholds or classes (Adam et al. 2022). In the same way as the ROC curve, it is feasible to result in a summary area under the curve (AUC P-R) metric (Tharwat 2021). In the deep learning community, the AUC P-R metric is mostly referred to as mAP (mean average precision) or sometimes simply AP (average precision) (Maxwell et al. 2021). The mAP leverages IoU, PRS, and RES along with the four components of confusion matrix (i.e., \(t_n\), \(t_p\), \(f_p\), and \(f_n\)). A high mAP indicates that a model has both a low \(f_n\) and a low \(f_p\) rate. Mathematically, mAP can be formulated considering Equations (9, 10, and 11) as:

    $$\begin{aligned} AP_k&= \sum _{j=0}^{j=R_{th}-1} [RES(j)-RES(j+1)]\,PRS(j), \end{aligned}$$
    (9)
    $$\begin{aligned}&\approx \int _{RES_k=0}^{1} PRS_k(RES_k)\,d(RES_k), \end{aligned}$$
    (10)
    $$\begin{aligned} mAP&= \frac{\sum _{k=1}^{k=I_{th}} AP_k}{I_{th}}, \end{aligned}$$
    (11)

    where \(j\in \{0,1,\cdots ,R_{th}-1\}\), \(k\in \{1,2,\cdots ,I_{th}\}\), \(RES(R_{th})=0\), \(PRS(R_{th})=1\), \(0\le \,AP_k\le 1\), and \(0\le \,mAP\le 1\). The number of recall and IoU thresholds or classes are denoted by \(R_{th}\) and \(I_{th}\), respectively.

  • DPLR \(\Rightarrow\) The PLR helps to evaluate the quality of a detection algorithm. The location of abnormal events is important for pixel-level evaluation. A frame with ground truth anomalies is considered as a true positive detection if at least 40% of the ground truth anomalous pixels are detected. However, it suffers, because if 40% of the abnormal ground truth events are overlapped, all falsely detected regions are ignored and then the system can give as many (false) detections as possible for covering the ground truth (Wu et al. 2021c). Nevertheless, a limitation can be addressed at pixel-level detection. For example, a DPLR can be considered if a minimum of 5% of the detected regions belong to the true anomalous pixels (Wu et al. 2021c). If a substantial unconnected region is detected, it is not identified as a true positive by a DPLR (Sabokrou et al. 2015; Wu et al. 2021c).

  • RBDR \(\Rightarrow\) Many anomaly detection studies reported the pixel-level AUC for some popular datasets (e.g., UCSD Ped2 Chan et al. 2008). However, Ramachandra and Jones (2020) claimed that the pixel-level AUC is a flawed evaluation metric. Thus, they introduced RBDR and TBDR to replace the commonly used pixel- and frame-level AUC metrics. The region-based detection criterion estimates the RBDR over all frames in the test set versus the number of false positive regions per frame. A true positive takes place if a ground truth-annotated region has a minimum IoU of 0.1 with a detection region. The RBDR can be computed using Equation (12) (Ramachandra and Jones 2020):

    $$\begin{aligned} {RBDR} = \frac{\textit{Number of anomalous regions detected}}{\textit{Total number of anomalous regions}}. \end{aligned}$$
    (12)
  • TBDR \(\Rightarrow\) The track-based detection criterion measures the TBDR versus the number of false positive regions per frame. A ground truth track is said to be detected if at least 10% of the ground truth regions in the track is detected. A ground truth region in a frame is said to be detected if the IOU between the ground truth region and a detected region is greater than or equal to 10%. The total number of positives is the number of ground truth-annotated tracks in the testing dataset. The TBDR can be computed using Equation (13) (Ramachandra and Jones 2020):

    $$\begin{aligned} {TBDR} = \frac{\textit{Number of anomalous tracks detected}}{\textit{Total number of anomalous tracks}}. \end{aligned}$$
    (13)
  • MISE \(\Rightarrow\) It estimates a smaller probability of classifying a high-level event to a low level. The MISE of a model determines if anomaly event classification can be analyzed quantitatively. A lower MISE implies a better anomaly detection performance (Lin et al. 2021a). Assume that a video containing an event with level u is feeding to a model. Then its misclassification severity, denoted by \(MISE_u\), can be formulated using Equation (14) as:

    $$\begin{aligned} MISE_u = \Sigma _{\forall v<u}^{\,}(u-v)\,P_{uv} + R_{if} \Sigma _{\forall v>u}^{\,}(v-u)\,P_{uv}, \end{aligned}$$
    (14)

    where \(P_{uv}\) belongs to the probability of the video in level u that is misclassified into level v. Always, the relative importance factor \(R_{if}\) is less than 1.

  • SGAP \(\Rightarrow\) The SGAP is computed by subtracting the average score of the normal from that of the anomaly (Feng et al. 2021a). A larger SGAP hints that the anomaly detection model is more capable of distinguishing anomalies from normal events (Liu et al. 2018b; Zhong et al. 2022), while also enjoying stronger robustness to noises (Zhou et al. 2020b).

  • RGS \(\Rightarrow\) The RGS can be deemed as the opposite of the AOS (Liu et al. 2018b; Nayak et al. 2021). Whether a frame contains anomalies can be judged by its RGS (Hao et al. 2022). In the testing phase, a detector touches on the RGS of an image. A lower RGS value signifies a higher anomaly level (Nayak et al. 2021), where errors between a predicted frame based on the normal training data and the ground truth image are substantial (Saypadith and Onoye 2021). The RGS curve drops steeply when an abnormal event occurs (Hao et al. 2022).

  • FAR \(\Rightarrow\) It usually refers to the expectancy of the false positive ratio. It can be computed by Equation (15) as:

    $$\begin{aligned} FAR = \frac{f_p}{f_p + t_n}, \end{aligned}$$
    (15)

    where \(f_p\) and \(t_n\) indicate false positive and true negative, respectively. In the video anomaly detection task, a higher AUC shows a better model performance, whereas a lower FAR on a normal video implies stronger anomaly-detection-method robustness (Sultani et al. 2018; Wan et al. 2020).

  • \(d\prime\) \(\Rightarrow\) Both decision and detection tasks involve some uncertainty. The \(d\prime\) looks for squeezing out how inherently decidable the decision task is, or how detectable the signal is, regardless of the observer’s error-avoidance preferences (Williams 1996). In any task, it is desirable to have the \(d\prime\) as large as possible (Radu et al. 2012).

  • RTM \(\Rightarrow\) Basically, RTM depends on the input dimensions, the number of neurons and layers (Nayak et al. 2021), the experimental setup, and the code organization (Sharif et al. 2008; Sharif 2014). Preferably, RTM should be as low as possible for a desirable level of ACC.

Fig. 3
figure 3

Estimation of EER from ROC curve

6.3 Performance metrics comparison

The regression-problem-related tasks (e.g., crowd counting) mostly employ MSE and MAE. MSE addresses the robustness of the estimates, while MAE determines the accuracy of the estimates. Recently, Gao et al. (2020) and Fan et al. (2022) comprehensively reviewed the contemporary research advancements on crowd counting and density estimation and claimed that MAE and MSE are the most commonly used image-level measurements. On the other hand, classification problems use metrics of ACC, AUC, EER, F1S, and so on.

Fig. 4
figure 4

Usage frequency of performance evaluation metrics between 2020 and 2022 considering Equation (16) and Table 6. The F-AUC became the supreme and valid performance evaluation metric for video anomaly detection

We have calculated the usage frequency of an item (e.g., metric, dataset, and method) by using Equation (16) as:

$$\begin{aligned} \small \textit{Usage frequency (\%)} = \frac{(\textit{Total number of papers that used the item})(100)}{\textit{Sum of papers both that used and unused the item}}. \normalsize \end{aligned}$$
(16)

Fig. 4 depicts the usage frequency of the performance evaluation metrics considering Equation (16) and Table 6. Figure 4 shows that AUC, EER, ACC, and RTM became sequentially the most frequently applied performance indicators among a list of evaluation metrics for classification problems in deep learning. However, the usage frequency score of F-AUC hints that the F-AUC is an unparalleled performance indicator for anomalous frame localization.

6.4 Our observations

In real-surveillance-video anomaly detection, we generally desire to rapidly detect the commencement and termination frames of the abnormal event. This is due to the fact that anomaly detection is a coarse-level perception and the clipped anomalous video segment can be sent for additional in-depth video analysis (e.g., activity detection, object detection, etc.) (Nayak et al. 2021). Furthermore, recently, Cai et al. (2021) explained that in the actual abnormal frame the anomaly phenomenon is substantially distributed in a small patch of the image rather than across the full frame. If just that frame can be detected regardless, we neither need the pixel-level ground truth annotations nor the determination of the anomaly by computing either the MSE or PSNR of the complete frame. However, the standard F-AUC takes the paramount advantage in these respects. Consequently, the F-AUC became the most important and meaningful performance evaluation metric for video anomaly detection.

7 Diversity of crowd datasets

Crowd activity monitoring in both private and public places is a very demanding endeavor. The everyday activities of human beings create huge amounts of crowd trajectory data in both indoor and outdoor environments (Zhao et al. 2021). The increase in crowd trajectory data availability offers new opportunities for reliability engineering analysis and assessment (Lin et al. 2021b). A dataset is a body of samples with shared attributes. Miscellaneous crowd datasets and benchmarks enable validating and comparing methods for developing smarter algorithms. However, accessing relevant images and datasets is one of the key challenges for image analysis researchers. Crowd datasets can be indoor, outdoor, or both. On the basis of both quantity and quality, they can be widely categorized as synthetic or real-world datasets.

7.1 Synthetic datasets: quantity unrestricted

Synthetic datasets are algorithmically manufactured rather than recorded by real-world events. They are supposed to mimic real-world original data in such a way that both synthetic and real-world datasets cannot be differentiated from each other—not even by human domain specialists or sophisticated computer algorithms.

7.1.1 Advantages of synthetic dataset

The synthetic dataset might be a feasible solution to overcome some existing problems in real-world datasets. The quantity of synthetic data can be surpassed more easily than that of real-world data. For example, a billion synthetic frames can be generated with a powerful parallel machine (e.g., GPUs, and CPU clusters), whereas collecting such a number of real training samples might be impossible. One of the key advantages of a synthetic dataset is anonymity, as no personal information is available. The data cannot be traced back to the original owner. Thus, it does not need to consider possible copyright infringements. Explicitly, synthetic data protect authentic data privacy and confidentiality. For instance, the Census.gov (2021) utilized synthetic data without personal information that mirrored real data collected via household surveys for income and program participation. In addition to privacy protection in a synthetic dataset, it is fully annotated (i.e., there is no need for humans to spend time manually collecting and annotating training data), fully user-controlled, unrestricted to hardware devices for data collection, multi-spectral, easy to enrich with abnormal events, and cost-effective.

7.1.2 Popular synthetic crowd datasets

Simulating pedestrian crowd movement in a virtual environment is not a novel task (Saeed et al. 2022). We found many synthetic crowd datasets in the literature. For example, during 2020 and 2022 (see our review in Table 6), researchers used the following synthetic crowd datasets to detect abnormal events in crowd scenes.

  1. 1.

    (Minnesota 2022) \(\Rightarrow\) The UMN dataset is one of the standard crowd abnormal event testing datasets from the University of Minnesota. It is a synthetic dataset composed of three different scenes (e.g., one indoor and two outdoors) (Lloyd et al. 2017; Sanchez et al. 2020). Each footage had been recorded at a frame rate of 30 fps at a resolution of \(480\times 640\) using a static camera (Lloyd et al. 2017). This dataset aims to accurately detect the changes in crowd movement. In each video scene, an unstructured crowd is walking, and the motion pattern is completely unstructured (Sanchez et al. 2020). An anomaly is marked when everyone suddenly starts running.

  2. 2.

    SHADE (Lin et al. 2021a) \(\Rightarrow\) The SyntHetic Abnormality DatasEt (SHADE) was generated in the video game Grand Theft Auto V (GTA5) (Metacritic 2021). The videos in SHADE include those labeled as arrest, chase, fight, knockdown, run, shoot, scatter, normal type 1 (e.g., high-five and hug), and normal type 2 (e.g., people walk around normally).

7.1.3 Challenges of synthetic crowd dataset

Although a synthetic dataset possesses a long list of benefits, it comes with limitations. While synthetic data can mimic many properties of authentic data, it cannot copy the original content exactly. The quantity is not a problem for synthetic data. However, it is arduous to generate high-quality indoor and outdoor synthetic crowd datasets. The main ultimatums for creating synthetic crowd datasets derive from the following facets:

  • It is extraordinarily laborious to completely encode scenes with customized and vivid stories of real-world crowd information into synthetic crowd data;

  • The movements of each individual in the real world can be influenced by the strands of physical strength, current locality, crowd density, and subjective interest (Zhao et al. 2021). Such strands limit the similar real-world movements of each individual in simulated data;

  • The crowd distribution varies in an indoor environment. Such variations in synthetic indoor data should satisfy the structures and functionalities of real-world indoor data.

The aforementioned factors might make the synthetic crowd datasets unpopular compared with the real-world version.

7.2 Real-world crowd datasets: quantity restricted

Real-world crowd data are observational data originally collected from a number of sources in real-world settings, usually by using sensors (e.g., cameras). Normally, the quality of a real-world crowd dataset is higher than that of a synthetic dataset.

7.2.1 Advantages of real-world crowd dataset

While a synthetic dataset has many benefits and is highly useful in many circumstances, there is still a heavy reliance on human-annotated and real-world data. Compared with synthetic crowd datasets, real-world crowd datasets are basic, and they enrich the newly developed algorithms by providing more robust and confident results. Despite the different nature of synthetic datasets compared with real-world datasets, they can compromise the algorithmic output quality for critical decision-making. Even though synthetic crowd datasets are satisfactory, they are still inferior compared with certain properties of real-world crowd datasets.

7.2.2 Popular real-world crowd datasets

Many real-world crowd datasets have been proposed to detect anomalies in real-world video scenes. For example, during 2020 and 2022 (considering our succinct review in Table 6), researchers used the following real-world crowd datasets to detect abnormal events in crowd scenes.

  • LIVE (Sheikh et al. 2006) \(\Rightarrow\) Basically a live dataset of real-world-scenario video sequences. The sequences were captured in daylight and at night time with different crowd levels. The sequences contain both normal and abnormal events with ground truths (ur Rehman et al. 2021). Both normal and abnormal events exit in a footpath, grocery shop, subway station, police booth, petrol pump, etc. The Video Quality Experts Group studied the video quality (VQEG 2021).

  • Subway (Adam et al. 2008) \(\Rightarrow\) Two cameras recorded an underground train station. One camera pointed toward the entrance platform and the second camera observed the exit platform. Thus, this dataset contains two long videos of subway entrance and exit gate scenes. Both video sequences are annotated at a frame level and have similar types of anomalies, including wrong-direction walking, loitering, and avoiding payment (Hong Kong 2022; Lu et al. 2013a; Asad et al. 2021).

  • UCSD (Chan et al. 2008) \(\Rightarrow\) The UCSD (University of California San Diego) anomaly detection dataset was obtained from a stationary camera mounted at an elevation, which overlooked pedestrian walkways. The dataset is split into two subsets named Pedestrian 1 (Ped1) and Pedestrian 2 (Ped2). In Ped1, there is an acute angle between the camera view and the sidewalk, and the camera height is lower than in Ped2 (Hao et al. 2022). The ground truth annotation includes a binary flag per frame, indicating whether an anomaly is present at that frame. Abnormal events mainly consist of two categories: the movement of non-pedestrian entities and anomalous pedestrian motions (Zhou et al. 2020a). Abnormal events include bikers, skaters, carts, wheelchairs, and people walking off the walkway.

  • ImageNet (Deng et al. 2009) \(\Rightarrow\) It consists of over 15 million high-resolution labeled images belonging to approximately 22000 categories. It also includes variable-resolution images. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool. The ImageNet project is a large visual database designed for use in visual-object-recognition software research. It has been running an annual software contest named “ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)” since 2010. The ILSVRC uses a subset of ImageNet.

  • PETS2009 (Ferryman and Shahrokni 2009) \(\Rightarrow\) Its functions consist of: (i) crowd count and density estimation, (ii) the tracking of individual(s) within a crowd, and (iii) the detection of separate flows and specific crowd events in a real-world environment. The dataset scenarios were filmed from multiple cameras and involve multiple actors. It also contains crowd abnormal events (Ilyas et al. 2021). Initially, there is a group of people normally walking around. As the anomaly begins, people start to disperse and run away. However, the lack of some other realistic scenarios, including fighting, fear, and abnormal objects, is a deficiency for this dataset (Rabiee et al. 2016).

  • UCF-Web (Mehran et al. 2009) \(\Rightarrow\) It was proposed by Mehran et al. (2009) as a harder version of the Minnesota (2022) dataset, with denser crowds (Sanchez et al. 2020). The anomalies are clash and escape panic scenarios (Mu et al. 2021c).

  • BEHAVE (Blunsden and Fisher 2010) \(\Rightarrow\) In this dataset, various activities were simulated and captured at a rate of 25 fps. An annotation file contains the starting and ending frame numbers and a class label that describes the event within the frames (Deepak et al. 2021b). Some of these events are grouping, walking together, chasing, and fighting. The anomalies are mainly caused by fighting.

  • HockeyFight (Nievas et al. 2011) \(\Rightarrow\) (Nievas et al. 2011) collected 1000 clips of action from the hockey games of the National Hockey League (NHL) in North America. Each clip is manually labeled as either “fight” or “non-fight”. This dataset is used for fighting detection (Almazroey and Jarraya 2020).

  • UCF101 (Soomro et al. 2012) \(\Rightarrow\) This dataset consists of 13320 videos collected from youtube.com, including a variety of action forms. The format of the video is unified during the dataset’s construction (Hu 2020).

  • Mall (Chen et al. 2012) \(\Rightarrow\) This dataset was collected from a publicly accessible webcam for crowd-counting and -profiling research. Chen et al. (2012) exhaustively annotated the data by labeling the head position of every pedestrian in all dataset frames.

  • ViF (Hassner et al. 2012) \(\Rightarrow\) The ViF (Violent Flows) is a dataset of real-world videos downloaded from the web consisting of crowd violence, along with standard benchmark protocols designed to test both violent/non-violent classification and violence outbreak detection. The average length of the video clips is 3.6 s. The types of violent behaviors are related to the fighting events in the video clips (Rabiee et al. 2016).

  • CUHK-Avenue (Lu et al. 2013a) \(\Rightarrow\) This dataset’s videos were captured in the CUHK (Chinese University of Hong Kong) Campus Avenue (Hong Kong 2022). The 16 training videos capture normal situations. The 21 testing videos include both normal and abnormal events. The latter are marked in rectangles. The abnormal events are running, walking in opposite directions, throwing objects, and loitering (Zhou et al. 2020a; Sun et al. 2021; Hao et al. 2022). The dataset contains a slight camera shake (e.g., in testing video 2 at frames 1051–1100) and a few outliers in the training data. Furthermore, normal patterns seldom appear in the training data (Hong Kong 2022).

  • CUHK-Crowd (Shao et al. 2014) \(\Rightarrow\) It includes crowd videos with various densities and perspective scales. Videos were collected from gettyimages.com and pond5.com, and included various environments, such as streets, shopping malls, airports, and parks.

  • GBA2016 (Martinez et al. 2016) \(\Rightarrow\) These videos were recorded on different days with a GoPro HERO4 camera at 50 fps at the Polytechnics School of the University of Alcala. The camera was located at a high height to reduce occlusions. The dataset includes various individuals performing actions, e.g., walking, running, sitting down, and falling.

  • ShanghaiTech (Zhang et al. 2016b) \(\Rightarrow\) It consists of two parts, namely PartA and PartB. The images of PartA were taken from the Internet, whereas the images of PartB were collected from a metropolitan street in the city of Shanghai. The training and evaluation sets are defined by the authors (Zhang et al. 2016b). The training and testing phases are very biased in this dataset because the images are of various density levels and are not uniform (Khan et al. 2020).

  • MED (Rabiee et al. 2016) \(\Rightarrow\) The videos of the MED (Motion Emotion Dataset) were recorded at 30 fps using a fixed video camera elevated at a height, overlooking individual walkways. The MED consists of ground truth emotion labels. It also comprises crowd behavior annotations. All videos start with normal behavior frames and end with abnormal ones.

  • LV (Leyva et al. 2017) \(\Rightarrow\) The LV (Live Videos) dataset was captured by surveillance cameras from different view angles at various resolutions and frame rates (Leyva et al. 2017). Abnormal events include fighting, robbery, accidents, murder, kidnapping, and an illegal U-turn. Each abnormal event highly depends on the nature of its environment. The abnormal events are localized by specifying the regions of interest in a separate sequence.

  • CUHK-Avenue17 (Hinami et al. 2017) \(\Rightarrow\) It is a subset of CUHK-Avenue (Lu et al. 2013a). The training set is identical to CUHK-Avenue (Lu et al. 2013a), however, the testing set is smaller. Hinami et al. (2017) argued that the CUHK-Avenue (Lu et al. 2013a) testing set contains five videos (e.g., 1, 2, 8, 9, and 10) with static abnormal objects that are not properly annotated (Ionescu 2018). Henceforth, they evaluated their approach on a subset (called Avenue17) that excludes these five videos.

  • ShanghaiTech Campus (Luo et al. 2017b) \(\Rightarrow\) The data collection was performed at ShanghaiTech University campus considering 13 different scenes with various lighting conditions and camera angles (Asad et al. 2021). This dataset is one of the biggest and most challenging datasets available for video anomaly detection (Doshi and Yilmaz 2022b). Anomalous events are produced by strange objects in the scenes, such as pedestrians moving at anomalous speeds (e.g., running and loitering) and in unexpected directions (Sanchez et al. 2020). The dataset has 130 abnormal events in 13 scenes (Tech 2022). All abnormal videos are in the testing set because the dataset is proposed for unsupervised learning. To adapt to the weakly supervised setting, Zhong et al. (2019) reorganized the videos into 238 and 199 training and testing videos, respectively.

  • UCF-QNRF (Idrees et al. 2018) \(\Rightarrow\) The UCF-QNRF is a large-scale crowd-counting dataset. It contains a variety of scenarios and scenes with high resolutions and densities. Usually, it has extremely congested scenes, where the maximum count of an image can reach up to 12865 (Idrees et al. 2018). It also includes buildings, plants, the sky, and paths.

  • UCF-Crime (Sultani et al. 2018) \(\Rightarrow\) It consists of 1900 long and untrimmed real-world surveillance videos. Usually, trimming refers to taking off either part of the beginning or end of a video clip. Anomalies include abuse, arrest, arson, assaults, road accidents, burglaries, explosions, fighting, robberies, shooting, stealing, shoplifting, and vandalism (Central Florida 2022).

  • FDST (Fang et al. 2019) \(\Rightarrow\) (Fang et al. 2019) proposed a large-scale crowd-counting video dataset named Fudan-ShanghaiTech (FDST) with frame-wise ground truth annotation. The FDST contains many different scenes, including shopping malls, squares, and hospitals. It took more than 400 h to annotate the FDST dataset (Fang et al. 2019).

  • IITB-Corridor (Rodrigues et al. 2019) \(\Rightarrow\) The videos in the IIT Bombay campus were captured with a single camera. The scene consists of a corridor with many normal and abnormal activities. Normal activities include walking and standing, whereas abnormal activities include protests, unattended baggage, cycling, sudden running, fighting, chasing, loitering, suspicious objects, hiding, and playing with a ball. The annotations for normal and abnormal video frames are provided at the frame level.

  • MVTec (Bergmann et al. 2019) \(\Rightarrow\) All images of the MVTec anomaly detection dataset were captured by a \(2048\times 2048\)-pixel high-resolution industrial RGB sensor. The training set contains non-anomalous objects, whereas the testing set includes various types of anomalies and non-anomalous samples. The dataset includes a detailed ground truth with pixel-wise mask annotations for each anomalous region. It comprises almost 1900 manually annotated regions.

  • StreetScene (Ramachandra and Jones 2020) \(\Rightarrow\) This dataset consists of 46 training video sequences and 35 testing video sequences taken from a static USB camera looking down on a two-lane street with bike lanes and pedestrian sidewalks. Videos were collected from the camera during daylight at various times in two consecutive summers. The main recorded activities are cars driving, turning, stopping, and parking; pedestrians walking, jogging, and pushing strollers; and bikers riding in bike lanes.

  • CitySCENE (Lao 2020) \(\Rightarrow\) This dataset consists of a variety of real-world anomalies, which include carrying objects, crowds, graffiti, sweeping, smoking, and walking dogs. The training and testing sets are trimmed and untrimmed, respectively. It can be used to compare algorithms for general and specific real-world anomaly detection (Lao 2020).

  • AI-CityChallenge20:T4 (Naphade et al. 2020) \(\Rightarrow\) The 4th annual edition of the AI City Challenge has four challenging tracks. Track 4 (T4) addressed traffic anomaly detection (Naphade et al. 2020). The anomalous behaviors mainly consisted of vehicles driving off the road, stalled vehicles, and crashes. The viewing angles, weather, and lighting conditions of each video produced a unique and challenging dataset (Doshi and Yilmaz 2020b). More than 25 h of video data were captured on highways in Iowa, USA.

  • RWF2000 (Cheng et al. 2020) \(\Rightarrow\) The RWF2000 (Real-World Fighting) dataset comprises 2000 real-life surveillance-video clips downloaded from youtube.com. Each video clip has a duration of 5 s with a frame rate of 30 fps. Because the clips were extracted from roughly 1000 unique videos, the authors (Cheng et al. 2020) manually checked the training and testing sets to avoid data leakages. Anomalous behaviors mainly involved in-crowd, two-person, and multiple-person scenarios, making this a multiplex and arduous identification-modeling task (Ullah et al. 2022).

  • JHU-CROWD++ (Sindagi et al. 2020) \(\Rightarrow\) (Sindagi et al. 2020) introduced a large-scale unconstrained crowd-counting dataset named JHU-CROWD++, which was collected under a variety of diverse scenarios and environmental conditions, including weather-based degradation and illumination variations. The JHU-CROWD++ consists of a set of annotations at both image and head levels The images were annotated with the help of Amazon Mechanical Turk workers (Wikipedia 2022).

  • AI-City Challenge21:T4 (Naphade et al. 2021) \(\Rightarrow\) The fifth AI City Challenge has five tracks. Track 4 consists of 100 training videos and 150 testing videos, each with an approximate length of 15 min. The videos were captured at a frame rate of 30 fps. The purpose of the challenge is to devise an algorithm that is capable of identifying all anomalies with minimum false alarms and detection delays. More than 62 h of video data were captured on highways in Iowa, USA.

  • NeuroAED (Chen et al. 2021b) \(\Rightarrow\) This dataset was acquired with a stationary neuromorphic vision sensor (DAVIS346) mounted on top of a retractable tripod with a maximum elongation of five meters. Pan-tilt was used to adjust the camera angle to ensure the coverage of the entire region of interest. It consists of four sub-datasets, namely walking, campus, square, and stair. The abnormal events present are labeled as bikes or motorcycles.

  • X-MAN (Szymanowicz et al. 2021) \(\Rightarrow\) The X-MAN (eXplanations of Multiple sources of ANomalies) dataset evaluates anomaly explanation methods. It consists of 22722 manually labeled frames in UCSD Ped2 (Chan et al. 2008) (1648), CUHK-Avenue (Lu et al. 2013a) (3712), and ShanghaiTech Campus (Luo et al. 2017b) (17362). Each frame holds between one and five explanation labels. Each label has a non-identical reason regarding why the frame should be anomalous.

  • NOLA (Doshi and Yilmaz 2022b) \(\Rightarrow\) This dataset focused on a single-scene setup with 110 training video segments in 11 splits and 50 test segments captured during day and night, as well as on various days of the week, using a single moving camera from a famous street in New Orleans, USA.

7.2.3 Challenges of real-word crowd datasets

To analyze video sequences, ground truth video sequences are all-important. Woefully, to produce and annotate such ground truth video sequences at a rational level of detail is very time-consuming. For example, in the experience of (Blunsden and Fisher 2010), one hour of video (with more or less 90000 frames), took approximately six person-months of time for annotation at the level of individual bounding boxes and frame-by-frame behavior. However, it is extremely hard to collect a sufficient number of videos for specific abnormal crowd behaviors. As the dataset increases in size, more challenges arise (e.g., the problem of manually labeled videos with different behaviors). In addition to confidentiality and time-sensitive issues, some common challenges in real-world crowd datasets arise from:

  • The limited number of training samples;

  • The inclusion of only crowd images—a deep network can erroneously predict crowd even in scenes that do not contain crowd;

  • Limited annotations;

  • A lack of diversity regarding scenes and viewing angles (Luo et al. 2017b);

  • Not including adverse weather conditions, such as haze, snow, and rain (Sindagi et al. 2020);

  • Not including shifting illumination and man-made changes;

  • Not including as many anomaly events as possible for anomaly detection cases;

  • Neglecting to cover a wide range of geographical regions (Jafarzadeh et al. 2021) and camera devices.

7.3 Comparison with specifications of crowd datasets

Most of the aforementioned crowd datasets provided considerable challenges. Sorting them by year, Table 5 compares their numerous specifications. Figure 5 depicts the usage frequency of datasets considering Equation (16) and Table 6.

Fig. 5
figure 5

Dataset usage frequency from 2020 to 2022 considering Table 6. The UCSD became the key standard benchmark dataset for anomaly detection

Figure 5 shows that CUHK-Avenue (Lu et al. 2013a), ShanghaiTech Campus (Luo et al. 2017b), UCSD (Chan et al. 2008), and (Minnesota 2022) are the most popular datasets. Furthermore, it is noticeable that the real-world dataset UCSD (Chan et al. 2008) has become the predominant standard benchmark for anomaly detection. This dataset contains two shooting angles. The first shooting angle involves pedestrians approaching or moving away from the surveillance camera, whereas the second shooting angle shows pedestrians moving parallel to the camera. Only pedestrians appear in normal events, and abnormal events include bikes and trucks. The major challenge of this dataset is that the density of pedestrians on the road constantly changes from sparse to crowded (Li et al. 2022a). Nonetheless, the utilization of Ped2 in the literature is higher than Ped1 because of its resolution. For example, Hinami et al. (2017) selected Ped2 because Ped1 has a significantly lower frame resolution of \(158\times 240\), which would have made it difficult to capture objects in their model. Furthermore, in Ped1, there is an acute angle between the camera view and the sidewalk, and the height of the camera is lower than that of Ped2. Thus, the size of a person is swapped, which will also affect the easily perceived anomalies (e.g., skateboarders and bicycles) (Hao et al. 2022). Besides, some characteristics are gratuitously labelled as anomalies (Cai et al. 2021).

Notably, Chang et al. (2022b) employed only Ped2 because some UCSD events (Chan et al. 2008) are labeled as normal in the training set but are considered anomalous in the testing set. Moreover, the camera location in Ped2 is higher. Therefore, the camera viewpoint is vertical to the sidewalk. Thus, the size of a person is comparatively fixed in Ped2 (Hao et al. 2022). Anomalies in UCSD (Chan et al. 2008) videos are assumed to be simple. Therefore, real-world anomalous events are not sufficiently reflected in video surveillance (Mohammadi et al. 2021). Despite being extensively used as a benchmark dataset, most anomalies in UCSD (Chan et al. 2008) are crystal clear and can be effortlessly recognized from a single frame (Doshi and Yilmaz 2022b).

The real-word CUHK-Avenue dataset (Lu et al. 2013a) secured the second-best benchmark for a crowd dataset. The key reason could be that the authors would likely validate the performance of their models with the real-world challenges of this dataset. The challenges of the CUHK-Avenue (Lu et al. 2013a) dataset include: (1) a slight camera shake presents in frames 1051–1100 of test video 2 (Li et al. 2022a; 2) the varying size and scale of people in the dataset because of the changing positions and view angles of the cameras (Li et al. 2022a; Chen et al. 2021a; Hao et al. 2022; Chang et al. 2022b); and (3) the inclusion of a few outliers in the training data, in addition to some normal patterns seldom appearing in the training data (Li et al. 2022a).

The real-world ShanghaiTech Campus dataset (Luo et al. 2017b) became the third best frequently utilized benchmark. It differs from the UCSD and CUHK Avenue datasets because it covers 13 different scenes (e.g., streets, squares, entrances of cafeterias, etc.) for both training and testing (Feng et al. 2021b; Chen et al. 2021a; Li et al. 2022a; Hao et al. 2022). Its abnormal events contain various situations. In addition to abnormal objects (e.g., vehicles and bicycles), abnormal behaviors (e.g., fighting and robbing) are also collected for testing (Hao et al. 2022). The ratios of each scene in the training and test sets can be varied (Feng et al. 2021b). All these distinctive challenges made the ShanghaiTech Campus (Luo et al. 2017b) dataset unique in validating many algorithms. However, the videos are captured from 13 distinct cameras, which put them together in a multi-scene articulation. Additionally, handling it as 13 individual datasets restricts the number of accessible training frames for each scene (Doshi and Yilmaz 2022b). Furthermore, a runtime error can occur if the dataset’s size exceeds the computer memory.

Although the UCSD (Chan et al. 2008), CUHK-Avenue (Lu et al. 2013a), and ShanghaiTech Campus (Luo et al. 2017b) datasets are very popular, they retain relatively clean data that were recorded at similar times during the day and in clear-weather setups. As a result, these datasets are not substitutes for real-world surveillance footage, where superficial factors including weather and the time of the day can affect the quality of the accumulated frames (Leroux et al. 2022). Additionally, Gudovskiy et al. (2022) did not use the Subway Adam et al. (2008) dataset because the dissimilar ground truth is annotated in different works (Lu et al. 2013a; Hasan et al. 2016), and non-identical ground truth is effective for the performance assessment of dissimilar methods (Gudovskiy et al. 2022).

7.4 Our observations

7.4.1 Usage of existing datasets

The performance of deep learning models can often be improved by feeding them more training data. In essence, the UCSD dataset is simple. However, sometimes a deep learning model’s poor temporal features extraction leads to low performance in UCSD (Esquivel and Zavaleta 2022). Ped1 holds additional anomalies that can at best be extracted by learning the normal temporal features. For example, skateboards in Ped1 are a difficult shape for the human eye to recognize but they are noticed due to their speed. Usually, a deep learning model can perform well in sparsely populated scenes (e.g., UCSD). However, the same model can lose its superiority in dense crowd scenes (e.g., ShanghaiTech Campus). Thus, the probability of success when testing any new deep learning model with the UCSD dataset is higher than that of the ShanghaiTech Campus dataset. Furthermore, both real-world (e.g., UCSD) and synthetic (e.g., Minnesota 2022) datasets can help to show the robustness of the underlying models. To this end, a great number of researchers (see Table 6) validated their models using both UCSD and (Minnesota 2022) datasets. All these options not only made UCSD a top choice among existing datasets but also made (Minnesota 2022) a more popular dataset compared with other available synthetic crowd datasets.

7.4.2 Own problem of datasets

Neural networks trained on popular datasets can suffer from overinterpretation. In overinterpretation, algorithms make confident predictions based on details that do not make sense to humans (e.g., random patterns and image borders). For example, neural models trained on the CIFAR-10 (Harwood 2009) dataset made confident predictions even when 95 percent of the input images were missing, with the remainder remaining senseless to humans (AcademicGates 2021). Deep learning models can latch onto both meaningful and meaningless subtle signals. When deep learning model classifiers are trained on crowd datasets (e.g., ImageNet (Deng et al. 2009)), they can make seemingly reliable predictions based on sensible and senseless signals. Thus, the overinterpretation is solely a dataset problem. This problem cannot be diagnosed using typical evaluation methods based on the accuracy of the model. Regarding datasets, a common question can be asked: How would we adapt any crowd dataset in a way that may enable deep learning models to be trained to mimic closely how a human thinks about categorizing images from real-world scenarios?

Fig. 6
figure 6

A noble taxonomy of deep crowd anomaly detection models based on principle part of each model in Table 6

8 Succinct survey of deep crowd anomaly detection

There are different types of deep learning algorithms but not all kinds of algorithms are employed in crowd anomaly detection. Recently, a long list of deep learning algorithms that achieved actual and valuable results for crowd anomaly detection appeared in the literature. This section reviews the literature available from 2020 to 2022. First of all, we have made a qualitative and quantitative summary in Table 6 for deep learning-based crowd anomaly detection methods. We have classified the existing algorithms based on the information in Table 6. At the end, we have highlighted our observations.

8.1 Taxonomy

(Sanchez et al. 2020) grouped deep anomaly detection models depending on their nature when performing anomaly detection tasks, e.g., extracting deep features and then applying either OCSVM or Gaussian models, training a deep learning model capable of reconstructing the original image from its compressed representation and then detecting an anomaly from the reconstruction, making use of an anomaly score directly (i.e., end-to-end) without utilizing deep learning models as feature extractors, and etc. While the taxonomy of (Sanchez et al. 2020) entails a common view of crowd behavior analysis, our work uses it differently and comprehensively by only considering more recent studies. Our taxonomy is based on the used key modules or techniques (e.g., CNN, LSTM, AE, etc.) of each model. According to the information shown in Table 6, we have primarily classified the deep crowd anomaly detection algorithms into CNN-, LSTM-, AE-, GAN-, U-Net-, YOLO-, and AN-based models. Usually, a deep crowd anomaly detection model consists of more than one module or technique, and sometimes the roles of modules are inseparable when sorting out their principal parts. This makes the categorization of deep learning models arduous. The taxonomy presented in Fig. 6 is based on the principal part of each model in Table 6. For example, Hasan et al. (2016) applied an AE as a GAN for utilizing frame-level generation errors (Wang et al. 2020c) to detect anomalies. In this case, we assume GAN as the principal module and thus the model belongs to the GAN category. When the roles of two or more principal modules of a model have similar weights, the model falls into a hybrid category. The sundry category contains many algorithms of different kinds.

8.2 CNNs- or ConvNets-based methods

CNNs or ConvNets were introduced in the 1980 s by Yann LeCun. The early version of ConvNet was called LeNet, named after LeCun, and could recognize handwritten digits (LeCun et al. 1989). Currently, a CNN is a popular model in computer vision, and it has the superiority of making good use of the correlation information of data. It is mainly composed of convolution layers, most often followed by an activation function, pooling layers, and some fully connected layers. The output nodes of the convolution and pooling layers are arranged into a 2D plane named feature mapping. Conventionally, each plane of a layer is originated from the amalgamation of one or more planes of preceding layers. The nodes of a plane are joined to a tiny portion of every affixed plane of the preceding layer. Every node of the convolution layer takes out the features from the input images using convolution operations on the input nodes. In convolutional layer, feature maps from preceding layers are convolved with learnable kernels. The output of the kernels puts up with a linear or non-linear activation function (e.g., Softmax, sigmoid, hyperbolic tangent, rectified linear, and etc.) for structuring the output feature maps. Every output feature map can be fused with more than one input feature map. Mathematically, this can be expressed by a function f(.) using Eq. 17 as:

$$\begin{aligned} output_{k} = f\left( \sum _{j\in map_k} (output_j)(kernel_{jk}) + bias_k\right) , \end{aligned}$$
(17)

where \(output_{k}\), \(output_j\), \(kernel_{jk}\), \(bias_k\), and \(map_k\) indicate the current layer output, the previous layer output, the kernel for the current layer, the biases for the current layer, and the selection of input maps, respectively. Figure 7 demonstrates a typical CNN architecture. It includes an input layer, multiple alternating convolution with ReLU activation and pooling layers, as well as one fully-connected layer. The convolution and pooling layers act as feature extractors from the input image, while fully connected layer acts as a classifier. Each layer of the network gets the output from its immediate previous layer as its input and proceeds its output as the input to the next layer.

Fig. 7
figure 7

A typical CNN architecture

The input to a CNN is either raw or normalized pixel values of an image with a size of \(\Psi \times \Psi \times C\), where \(\Psi\) belongs to either height or width as well as C denotes the number of color channels. The convolution layer contains \(\zeta\) filters of size \(\xi \times \xi\) deeming \(\xi <\Psi\). Each filter is convolved with the input image by sliding the filter considering a stride of \(\mathcal {S}\) pixels. This results a feature map and its size can be calculated by Eq. 18 as:

$$\begin{aligned} \textit{The size of feature map} = \left( \frac{\Psi -\xi +2\mathcal {P}}{\mathcal {S}}+1\right) \left( \frac{\Psi -\xi +2\mathcal {P}}{\mathcal {S}}+1\right) , \end{aligned}$$
(18)

where \(\mathcal {P}\) indicates the amount of zero padding applied on the perimeter of the input image. Padding is mostly employed during the convolution operations to certify the input and output feature map have the identical dimensions. The zero padding is commonly used to equalize the size of the feature map and its input image. Note that \(\zeta\), \(\mathcal {S}\), \(\mathcal {P}\), and \(\xi\) are user-defined parameters.

The convolutional layer is the core building block of a CNN, and it is represented as the backbone of many deep learning models. CNN adopts the convolution product as the main procedure for each layer. Convolutional layers convolve the input and pass its result to the next layer. The convolutional and pooling layers of CNN can function on the data with a Euclidean or grid-like structure (e.g., images). Two accepted options exist in CNN while training images. The first option is to train the domain-specific problem statement from scratch. The second option is to use a pre-trained model, which is frequently referred to as transfer learning (Tripathi et al. 2020). CNN has a number of parameters called hyperparameters, which determine the network structure (e.g., the number of hidden units). Hyperparameters are variables including the number of hidden layers, the learning rate, the batch size, or the number of epochs. The depth of a network is a crucial feature to consider when attempting to achieve optimal outcomes (Mansour et al. 2021). The number of layers used in deep learning range from five to more than a thousand (Ferentinos 2018; Szegedy et al. 2015; Simonyan and Zisserman 2015). The values of the hyperparameters are configured before training the network (Radhakrishnan 2021).

In some cases, the number of network parameters can be considered to compare the CNN models (Alom et al. 2019). The number of parameters of the \(i^{th}\) layer can be calculated using either Eq. 19 or Eq. 20 as:

$$\begin{aligned} \textit{Number of parameters without bias} = \xi \times \xi \times L_{i-1} \times L_{i}, \end{aligned}$$
(19)
$$\begin{aligned} \textit{Number of parameters with bias} = ((\xi \times \xi \times L_{i-1})+1) \times L_{i}, \end{aligned}$$
(20)

where \(L_{i-1}\) and \(L_{i}\) hint the number of filters in the previous layer (i.e., total number of input feature maps or channels) and the number of filters in the current layer (i.e., total number of output feature maps), respectively. In Eq. 20, 1 is added due to the bias term for each filter. If any \(i^{th}\) layer carries information of \(\xi =4\), \(L_{i-1}=32\), and \(L_{i}=64\), then the total number of parameters with a bias for this layer is \(((4\times 4\times 32)+1)\times 64\) \(=32832\).

There exist many popular pre-trained CNN models, e.g., VGGNet (Simonyan and Zisserman 2015), GoogleNet (Szegedy et al. 2015), ResNet50 (He et al. 2016), DenseNet (Huang et al. 2017), etc. However, the features picked up by pre-trained CNN models can be quickly employed for a number of different problems that they were not initially designed to resolve (Gutoski et al. 2021). Choosing a suitable CNN structure is important in the trained model (Mansour et al. 2021). An Adam optimizer (Kingma and Ba 2015) is frequently utilized for CNN as it performs significantly better than the Nesterov moment optimizer (Hur et al. 2021).

8.2.1 Pre-trained 2DCNN

Instead of constructing a model from scratch to solve a similar problem, we can use a pre-trained model that is already built and trained to solve certain problems. For example, VGGNet16 (Simonyan and Zisserman 2015) is a pre-trained CNN model that can be employed to extract spatial features and for high-accuracy image recognition because of the depth of its network (Vignesh et al. 2017). Basically, the VGGNet16 (Simonyan and Zisserman 2015) model highlights that the depth of a network is a critical component for gaining better recognition or classification accuracy in CNNs. Figure 8 (a) shows its cardinal building block. The number of convolution layers varies in the VGGNet, e.g., VGGNet16 and VGGNet19 had 13 and 16 convolution layers, respectively. The final layer of the VGGNet model is a Softmax layer for classification. A series of VGGs are exactly the same in the last three fully connected layers.

Fig. 8
figure 8

(a) and (b) depict the fundamental building blocks of VGGNet and ResNet, respectively

Ahmed et al. (2021) used VGGNet16 for crowd detection and an analysis of gender, as well as for analyzing ages. As using VGGNet16 makes it difficult to represent the temporal relationship of the input video sequences accurately, Ye et al. (2020) employed VGGNet16 to extract the spatial features, and then the obtained feature maps were fed into LSTM to further extract the temporal features of the input video clips. Al-Dhamari et al. (2020) applied the pre-trained VGGNet19 (Simonyan and Zisserman 2015) model to extract descriptive features. They showed that VGGNet19 had higher detection accuracy than other pre-trained networks using the Ped1 and Minnesota (2022) datasets, such as GoogleNet (Szegedy et al. 2015), ResNet50 (He et al. 2016), AlexNet (Krizhevsky et al. 2012), and VGGNet16 (Al-Dhamari et al. 2020). Rezaei and Yazdi (2021) combined the advantage of both the pre-trained VGGnet and a multi-layer non-negative matrix factorization for anomaly detection in crowd video surveillance. AlexNet is considered a significant breakthrough in the computer vision area (Krizhevsky et al. 2017). It reduced the error rate of classification from 26% to 15%, which is a major improvement (Al-Dhamari et al. 2020). The knowledge transfer of the low-level features from AlexNet is entirely feasible (Franzoni et al. 2020). To this end, Almazroey and Jarraya (2020) employed the pre-trained AlexNet to extract high-level features for abnormal events and behavior detection in crowd scenes.

ResNet (He et al. 2016) was developed with an aim of designing ultra-deep networks that did not put up with the vanishing gradient problem. It was developed with various numbers of layers including 34, 50,101, 152, and 1202. Fundamentally, the ResNet is a standard feedforward network with a residual connection. In general, the ResNet consists of several basic residual blocks. Figure 8 (b) shows an elementary diagram of the residual block. The output of any residual layer \(z_l\) can be defined using Eq. 21 as:

$$\begin{aligned} z_l = \mathcal {F}(z_{l-1}) + z_{l-1}, \end{aligned}$$
(21)

where \(z_{l-1}\) indicates the output of the preceding layer, whereas \(\mathcal {F}(z_{l-1})\) hints the output upon performing several operations (e.g., convolution with distinct size of filters, batch normalization followed by an activation function). The ResNet was applied for crowd anomaly detection tasks successfully. For example, Pang et al. (2020) performed an initial detection to create pseudo anomalous and normal frame sets from a set of unlabeled videos, and then those sets were used to train a (pre-trained) ResNet50 model and a fully connected network in an end-to-end fashion. Ilyas et al. (2021) extracted spatiotemporal deep features from video frames using two ResNet101 models for crowd anomaly detection. Upon performing panoptic segmentation (Kirillov et al. 2019; Wu et al. 2020a) used pre-trained models (e.g., ResNet101) to generate high-level features from video streams. Mansour et al. (2021) employed ResNet as the baseline network to act as an effective feature extractor for intelligent video anomaly detection and classification. Doshi and Yilmaz (2021a) trained a Squeeze and Excitation Network (SENet) (Hu et al. 2017) with a depth of 152 for anomaly detection in traffic videos. However, due to the lack of ground truth they were unable to evaluate the performance of their model. Aljaloud and Ullah (2021) compared the performance of their irregularity-aware semi-supervised deep learning model, the baseline CNN model, and the WideResNet model (Zagoruyko and Komodakis 2016) for the detection of unusual events. They claimed that their model provided superior result by learning useful information from unlabeled data during the training stage. Tripathi et al. (2020) applied a variation of the Xception (Chollet 2017) model for crowd emotion analysis. For the detection of anomalies in crowd videos, Mehmood (2021) used a two-stream CNN structure (Simonyan and Zisserman 2014) consisting of a spatial and temporal stream. Modified Xception was used as the pre-trained 2D CNN in each of the two streams, and it outperformed Inception-v1 (Szegedy et al. 2015) and DenseNet (Huang et al. 2017). Bahrami et al. (2021) used a single-frame anomaly detection model that classified images by reconstructing them using a pre-trained Xception encoder and a decoder similar to Xception. However, the order of the layers was partially reversed. To detect video anomalies, Gutoski et al. (2021) performed a comparative study of transfer learning approaches using 12 different pre-trained 2DCNN models (e.g., GoogleNet, ResNet50, AlexNet, VGGNet, InceptionV3, DenseNet, etc.) and 7 benchmark datasets. They demonstrated that CNNs pretrained on ImageNet give in powerful feature extractors, leading to approximately state-of-the-art performance on video anomaly detection. Chen et al. (2022) adopted a 2D convolutional backbone (e.g., ResNet101) for extracting the general-purpose feature vector. Their main objective was to train a task-driven feature extractor by applying MLP for distribution alignment of video data. Gudovskiy et al. (2022) suggested to use conditional normalizing flow framework for estimating the exact data likelihoods. They applied ResNet and WideResNet Zagoruyko and Komodakis (2016) as examples of their CNN feature extractors. Using VGGNet16, Tsai et al. (2022) trained their model to extract representative patch features from normal images. They concluded that taking into account the global and local context of an image simultaneously leads to excellent representation learning for image anomaly detection.

8.2.2 Pre-trained 3DCNN

When applying 2DCNN to detect the anomalous events in video sequences of crowd scenes, thereby neglecting temporal-domain behavior characteristics. In 3D convolution, a 3D filter can move in all three directions (e.g., height, width, and channel of the image). The 3DCNNs are structured by changing the internal convolution layers from 2D to 3D operations (Lee et al. 2021). The input and output feature maps are demonstrated by Fig. 9 (a) and Fig. 9 (b) for 2DCNN and 3DCNN, respectively. Usually, the performance of 3DCNN is better that than of 2DCNN as the 3DCNN allows an additional dimension to represent the temporal changes in consecutive frames.

Fig. 9
figure 9

(a) and (b) indicate the comparison of convolution operations between 2DCNN and 3DCNN, respectively

Explicitly, in 3DCNN convolution and pooling operations are performed spatiotemporally while in 2DCNNs they are done only spatially. The 2DCNN loses the temporal information of the input signal immediately after every convolution operation (Tran et al. 2015). Although the 3D kernels tend to overfit because of a large number of parameters, the performance of 3DCNNs is greatly improved by using larger video databases (Hara et al. 2017). The 3DCNNs are more suitable for spatiotemporal feature learning compared with 2DCNNs. Thus, their performance is better than 2DCNNs. Explicitly, 3DCNN models can encapsulate information related to shapes and motions in video sequences better than 2DCNN-based models, thus boosting the anomaly detection accuracy in 3DCNN models. Li et al. (2020c) proposed an anomaly detection algorithm based on 3D fully convolutional networks (e.g., 3DCNN Tran et al. 2015). After background subtraction, Nasaruddin et al. (2020) fed the attention regions into a pre-trained feature extractor (e.g., 3DCNN Tran et al. 2015) for deep anomaly detection. Zaheer et al. (2020) proposed a weakly-supervised approach to learn anomalous events using video-level labels. They computed spatiotemporal features by employing a pre-trained feature extractor model (e.g., 3DCNN). Zahid et al. (2020) applied an ensemble framework for anomaly detection in surveillance videos. They segmented videos into 60-frame clips to localize anomalies temporally using Inception-v3 and a pre-trained 3DCNN feature extractor. Hu et al. (2020b) used a pre-trained 3D VGGNet16 model for anomaly detection and location in crowded scenes. Their suggested model could achieve promising performance, especially in the pixel level criterion and the dual-pixel level criterion. Sarker et al. (2021) extracted spatiotemporal features for each video segment using a T-C3D feature extractor (Liu et al. 2018a). The T-C3D model is pre-trained using the Kinetics dataset (Sarker et al. 2021). The architecture of 3DCNNs is relatively shallow compared with many deep neural networks in 2D-based CNNs (e.g., ResNets) Hara et al. 2017. The 2D ResNets introduce shortcut connections that bypass a signal from one layer to the next. Furthermore, Hara et al. (2017) simply extended from the 2D-based ResNets to the 3D ones to generate a standard pre-trained model for spatiotemporal recognition. The difference between 3D ResNet and the original 2D ResNet is the number of dimensions of convolutional kernels and pooling, i.e., 3D ResNets perform 3D convolution and pooling. The model trained on the Kinetics dataset (Kay et al. 2017) performs well without overfitting despite a large number of model parameters (Hara et al. 2017). Despite the very high computational time required to train the 3D ResNets (e.g., three weeks for the Kinetics dataset Kay et al. 2017), some authors (e.g., Gao et al. 2021; Lin et al. 2021a) applied 3D ResNets for anomaly detection in crowds. Hao et al. (2022) also employed 3D ResNet Hara et al. (2017) for video anomaly detection with remarkable performance. However, Degardin and Proença (2021) described an iterative learning model using 3DCNN for abnormal events detection. Zheng et al. (2022) proposed a model that learned both anomaly scores and certain abnormal event classes with only the supervision of video-level abnormal event labels, which could ameliorate the generalization ability and save detailed human annotations. They adopted a two-stream 3DCNN as the backbone network for their video classification tasks. Zhao et al. (2022) divided videos into small clips to avoid complexity of their two-stream structure and employed 3DCNN model to obtain clip-based spatiotemporal video features for detecting abnormal events in smart city management.

8.2.3 Fine-tuned CNN

Fine-tuning is an iterative process that reduces error rates. It takes a pre-trained model for a given task and then tweaks it for a second similar task. Singh et al. (2020) used an ensemble of different fine-tuned CNNs based on the hypothesis that dissimilar CNN architectures learn various levels of semantic representations from crowd videos and, hence, an ensemble of pre-trained CNNs (e.g., AlexNet, GoogleNet, and VGGNet via SGD Robbins and Monro 1951 and AGD Duchi et al. 2011) can enable enriched feature sets to be extracted. Ullah et al. (2022) employed a self-pruned fine-tuned lightweight CNN for classifying normal or anomalous events.

8.3 LSTMs-based methods

The LSTM (Hochreiter and Schmidhuber 1997) units are an advancement of the generic building blocks of the RNN (Nawaratne et al. 2020). An LSTM is designed to model temporal sequences. It keeps unique units called memory blocks in the recurrent hidden layer. The memory blocks hold memory cells with self-connections, which assist in storing the temporal state of the network along with gates (e.g., special multiplicative units) for managing the flow of information. Each memory block in the original architecture contained an input and output gate. The input gate manages the flow of input activations into the memory cell. The output gate supervises the output flow of cell activations into the rest of the network. The forget gate was attached to the memory block (Gers et al. 2000). An LSTM is mapped to work more distinctively than a CNN because an LSTM is customarily used to process and make predictions from a given sequence of data. Figure 10 illustrates the principal structure of the LSTM unit. Its central idea is the cell state. The LSTM adds or deletes information to the cell state called gates. An input gate (\(i_t\)), forget gate (\(f_t\)), and output gate (\(o_t\)) can be mathematically formulated using Eqs. 22-27 as:

$$\begin{aligned} f_t&= \sigma (W_f\cdot [h_{t-1},x_t] + b_f) \end{aligned}$$
(22)
$$\begin{aligned} i_t&= \sigma (W_i\cdot [h_{t-1},x_t] + b_i) \end{aligned}$$
(23)
$$\begin{aligned} \widetilde{C}_t&= tanh(W_c\cdot [h_{t-1},x_t] + b_c) \end{aligned}$$
(24)
$$\begin{aligned} C_t&= \sigma (f_t \,C_{t-1} + i_t\,\widetilde{C}_t) \end{aligned}$$
(25)
$$\begin{aligned} o_t&= \sigma (W_o\cdot [h_{t-1},x_t] + b_o) \end{aligned}$$
(26)
$$\begin{aligned} h_t&= o_t\, tanh(C_t) \end{aligned}$$
(27)

where \(x_t\), \(h_{t-1}\), \(\sigma\), \(W_*\), \(b_*\), \(\widetilde{C}_t\), and \(C_t\) indicate the input vector at time epoch t, the hidden state at time epoch \(t-1\), hidden layer vectors, sigmoid activation function, the weight vector corresponding to the gate, the bias corresponding to the gate, a hidden cell state, and the cell state at time t, respectively.

Fig. 10
figure 10

The principal design of the LSTM unit. Over irregular time intervals, the cell remembers values as well as the three gates manage the datasets into and out of the cell

In many anomaly detection methods for videos, time modeling mostly adopted 3D convolution (Zhao et al. 2017) and ConvLSTM structures (Shi et al. 2015; Chong and Tay 2017; Luo et al. 2017a). In spite of that, methods using a 3D convolution need to calculate a large number of parameters and perform time-consuming training (Cai et al. 2021). As a result, LSTM can be more suitable for temporal information modeling.

8.3.1 Classical LSTM

Xia and Li (2021b) used LSTM to decode historical feature sequences with temporal attention for predicting the features. However, Xia and Li (2021a) applied LSTM to predict current-dimension-reduced HOG (histogram of oriented gradients) as the appearance feature and HOF (histogram of optical flow) as the movement feature. Moustafa and Gomaa (2020) proposed an LSTM-based approach for pathway and crowd anomaly detection, where the crowd scene was divided into a number of static-overlapped spatial regions.

8.3.2 CNN-LSTM

Deep CNNs are a kind of common deep neural network that is suitable for spatial relationship learning on raw input data. CNNs, RNNs, and other deep learning models can learn better feature representations than hand-crafted feature models (Ye et al. 2020). For example, Arifoglu and Bouchachia (2019) combined 2D-CNN and LSTM to detect abnormal behavior in dementia sufferers. Among numerous CNN models, a CNN called VGGNet16 can be used to extract spatial features and for high-accuracy image recognition due to the depth of the network (Vignesh et al. 2017). Yet, it is tough for VGGNet16 to accurately represent the temporal relationship of the input video sequences. To overcome this limitation, Ye et al. (2020) applied 2D-VGGNet16 and LSTM models to extract the spatiotemporal features of video frames and then constructed the feature expectation subgraph for each key frame of every video. Sabih and Vishwakarma (2022) combined CNN and LSTM to solve crowd anomaly detection problems.

8.3.3 ConvLSTM

In 2015, Shi et al. (2015) proposed ConvLSTM for obtaining convolutional structures in both the input-to-state and state-to-state transitions by extending the fully connected LSTM (FC-LSTM). The difference between ConvLSTM (Shi et al. 2015) and FC-LSTM (Graves 2013; Srivastava et al. 2015) is that the matrix operations of FC-LSTM are replaced with convolutions. This operation enables ConvLSTM to perform better with images than FC-LSTM (Zhou et al. 2020a). As the LSTM encoder–decoder generally fails to account for the global context of the learned representations with a fixed dimension representation, Zhou et al. (2020a) proposed two composite LSTM encoder–decoder models with a ConvLSTM unit to learn spatiotemporal features and detect abnormal events. Nawaratne et al. (2020) employed a series of CNN layers to learn spatial representations, as well as a series of ConvLSTM layers to learn temporal representations for anomaly detection. Results from experiments conducted on three benchmark datasets demonstrated the accuracy, robustness, low computational overhead and contextual indicators of their model. Li et al. (2020b) applied a motion model to efficiently detect anomalies in surveillance-video footage. They replaced the 3D convolutional layers of their motion model with ConvLSTM to study the performance of their model. They observed that the model employing ConvLSTM layers had delayed detection results and failed to detect the beginning of anomalous events. Furthermore, it also reported more false alarms after the anomalous events due to its slow response. However, the 3D convolutional layers might fit the data better and thus extract more representative features. As ConvLSTM is capable of forecasting behaviors and remembering the spatial data over time to avoid any crowd-related catastrophe, Varghese et al. (2020) devised a ConvLSTM-based model for predicting nine distinct crowd behaviors learned from their fuzzy computational model. Liu et al. (2022a) adopted a bidirectional (Song et al. 2018) ConvLSTM for handling temporal information when detecting video anomalies.

8.3.4 SFA-ConvLSTM

SFA-ConvLSTM learns multi-scale motion information. It performs attention encoding to focus on movement regions in surveillance environments where events occur locally. The fields of spatiotemporal encoding can be controlled by adjusting the dilation rate of its convolutional kernels. The anomaly detection model developed by Lee et al. (2020) consisted of a stack of four SFA-ConvLSTMs, a bidirectional multi-scale encoder, a scale-selective aggregator (e.g., ConvLSTM), and a spatial decoder.

8.3.5 Social-LSTM

Inspired by the application of RNNs in variegated sequence forecasting tasks (Graves et al. 2013; Vinyals et al. 2015; Cao et al. 2015; Alahi et al. 2016) introduced Social-LSTM for human trajectory forecasting in crowded spaces. Social-LSTM (Alahi et al. 2016) is an LSTM with a novel social-pooling layer that captures the social interactions of nearby pedestrians. Kothari et al. (2021) took the challenge of modelling social interactions between pedestrians in crowds. They employed a Social-LSTM-based model for human trajectory forecasting in crowds (Alahi et al. 2016).

8.4 AE-based methods

AEs were first introduced in the 1980 s by Hinton and the PDP group (Rumelhart and McClelland 1987) to describe the problem of back-propagation without a teacher by considering the input data as the teacher (Baldi 2012). Currently, AEs are used to learn efficient codings of unlabeled data in deep architectures for transfer learning. An AE consists of two key parts, namely an encoder and a decoder. The encoder maps the input into the code, whereas the decoder maps the code to a restoration of the input.

Fig. 11
figure 11

a and b depict a traditional CAE and a skip connected CAE, respectively. Through skip connections, each feature map of the corresponding encoder and decoder are summed element-wise

Figure 11a shows encoder and decoder networks of a standard CAE. The encoder network consists in a stack of four hidden layers with convolutional filters of 64, 128, 256, and 512, kernel sizes of 5, 5, 3, and 3, and strides of (1,2), (2,2), (2,2), and (2,2), respectively. It has four transposed convolutional layers that mirror the encoder layers. The skip connections between the corresponding encoder and decoder layers let the network to unite to a better optimum in pixel-wise prediction problems. Figure 11b demonstrates the skip connections between two corresponding convolutional and deconvolutional layers. The response from a convolutional layer is directly propagated to the corresponding mirrored deconvolutional layer, both forwardly and backwardly. Let the outputs from the encoder layer and the corresponding decoder layer be \(Output_{encoder_i}\) and \(Output_{decoder_i}\), respectively. The input to the next decoder layer \(Input_{decoder_{i+1}}\) is calculated by Eq. 28 as:

$$\begin{aligned} Input_{decoder_{i+1}} = Output_{encoder_i} \oplus Output_{decoder_i}, \end{aligned}$$
(28)

where \(\oplus\) hints element-wise summation. The learning process of an AE minimizes the reconstruction error. In unsupervised anomaly detection, the AE is trained on normal segments by minimizing their reconstruction errors (Shi et al. 2021), and then, a thresholded reconstruction error is utilized to detect anomalies. It is generally assumed that the reconstruction error will be lower for the normal segments because they are close to the training data, whereas the reconstruction error becomes higher for the abnormal segments (Hasan et al. 2016; Gong et al. 2019; Cho et al. 2022).

8.4.1 DAE

DAE is a variant of the basic AE (Vincent 2011). DAE is trained by reconstructing a clean input by a corrupted input. A basic DAE is learned by minimizing the loss function. Deep DAE can be achieved by using multiple hidden layers that can learn the complicated distribution by given samples due to their multiple feature-representation spaces (Lu et al. 2013b). The backpropagation algorithm (LeCun et al. 2015) is used to train DAE (Wu et al. 2020a). For anomaly detection, Wu et al. (2021a) mainly focused on the behavior analysis of pedestrians by applying DAE. They applied pre-trained deep learning models to extract high-level concept and context features for training DAE that required a short training time (i.e., within 10 s on Ped1 and Ped2 datasets running on a computer with the 64-bit Windows 10 OS and equipped with 16 GB DDR4 RAM and an Intel Core i7-9750 H CPU at 2.60 GHz) while achieving comparable detection performance with alternative methods.

8.4.2 VAE

Sabokrou et al. (2016) introduced a cascaded anomaly detection model, which detected abnormal events based on the reconstruction error of the standard autoencoder. However, such types of methods are based on deep reconstruction and treat samples that are different from normal samples as anomalies. It ignores the small probability of abnormal events. A large number of normal samples are often misjudged as abnormal, leading to false alarms (Ma 2021). VAE can provide an effective solution in this respect. VAE is a directional probability graph model that describes the potential space state in a probabilistic way. Similar to the architecture of traditional autoencoders, VAE also includes two neural networks: a probabilistic and a generative decoder. VAE employs a backpropagation algorithm to train the model. Ma (2021) assumed that the distribution of all normal samples complied with a Gaussian distribution, with the abnormal sample appearing with a lower probability in this Gaussian distribution. Therefore, Ma (2021) proposed an end-to-end deep learning framework based on VAE for abnormal event detection. Shi et al. (2021) utilized GRU for the basic encoder and decoder unit to construct a VAE, as well as for the reconstruction probability of the anomaly score. Sharma et al. (2022) proposed a generalized version of the VAE (Kingma and Welling 2014) framework for abnormality detection.

8.4.3 AEVB

Kingma and Welling (2014) introduced the AEVB algorithm for an independent and identically distributed dataset, which had continuous latent variables per data point. The AEVB is used to form a probability distribution of normal data by probability inference and reconstruction. Yan et al. (2020) adopted two-stream recurrent AEVB, which provided a semi-supervised solution for abnormal event detection in videos.

8.4.4 DpAE

A simple AE contains one hidden layer between the input and output, whereas a DpAE can have multiple hidden layers. Wang et al. (2020a) built a cascaded DpAE based on the deep autoencoder (Wang et al. 2018b) for detecting abnormal video events.

8.4.5 AAE

Song et al. (2020) combined an attention-based AE along with a GAN model, called Ada-Net, to learn normal patterns for abnormal event detection in an unsupervised way. The decoder of their used AE is treated as the generator in the GAN. Ada-Net is simple to implement and it produces high-quality models. It is also a TensorFlow-based lightweight framework for learning high-quality models automatically with minimum expert interaction. It can be trained in an end-to-end manner. It provides a comprehensive framework for learning not only neural network design but also how to aggregate models for obtaining even better results. Le and Kim (2022) proposed an attention-based residual AE for video anomaly detection, which encoded both spatial and temporal information in a unified way.

8.4.6 2D-CAE

A CAE is a variant of CNNs that is developed for the unsupervised learning of convolution filters. It is a feed-forward multi-layer neural network in which the desired output is the input itself. It consists of three convolutional layers, two pooling layers in the encoder, three deconvolutional layers, and two unpooling layers in the decoder, with a symmetric structure (Yang et al. 2020). The 2D-CAE reduces dimensionality and learned temporal regularity (Kuen et al. 2015). Yang et al. (2020) used 2D-CAE to extract the features of input video frames and to compute reconstruction errors. The reconstruction error is relatively small for the normal frames, while the error is higher for abnormal frames. The 2D-CAE was also utilized for crowd anomaly detection from surveillance videos (Pawar and Attar 2021). Fan et al. (2020) employed a fully convolutional network, which did not contain a fully connected layer, allowing the encoder–decoder structure to preserve relative spatial coordinates between the input image and the output feature map. They employed a two-stream network framework to combine the appearance and motion anomalies. They also employed RGB frames for the appearance, whereas they used dynamic flow images for motion. Wang and Yang (2022) employed a recurrent type 2D-CAE to explicitly model the normal dynamics in video sequences for anomaly detection.

8.4.7 3D-CAE

The 2D-CAE has a problem with temporal information. For example, Hasan et al. (2016) employed two 2D-CAEs and stacked multiple video frames in lieu of an image-channel dimension. They claimed it was more computationally efficient than the approaches that adopted sparse coding on large video datasets. However, after the first 2D convolution operation, the temporal information was completely lost as 2D convolutions were only effective in learning spatial features. To overcome this problem, Asad et al. (2021) used 3D-CAE for extracting spatiotemporal features. Moreover, Deepak et al. (2021a) applied a residual spatiotemporal autoencoder (R-STAE), which consisted of 3D convolution, deconvolution, and ConvLSTM layers, to learn the patterns of normal activities from surveillance videos. R-STAE performed the unsupervised learning of the spatiotemporal representation of normal patterns and reconstructed them with low quantities of errors. Deepak et al. (2021b) also applied a 3D spatiotemporal autoencoder (3D-STAE) that consisted of 3D convolutional and ConvLSTM layers to learn the intricate appearance and motion dynamics involved in training videos. Hu et al. (2022) employed an improved unsupervised learning model based on 3D-CAE for the characteristics of continuous video frames. Their model showed an excellent performance for detecting abnormal behavior. However, their model was not standard in most scenarios. For example, the same scene only corresponded to the model with fixed parameters, and the model must be retrained if the scene was transmuted.

8.4.8 TS-AE

The two-stream hypothesis argues that human beings possess two distinct visual systems (Goodale and Milner 1992). This hypothesis is a widely accepted and influential model of vision neural processing (Eyesenck and Keane 2010). The two-stream convolutional network developed by Simonyan and Zisserman (2014) decomposed videos into spatiotemporal components for action recognition. However, the typical two-stream architecture adopted supervised learning characteristics, which required marking incoming frames for abnormal detection. Therefore, it is inconsistent with current anomaly detection datasets (Li et al. 2021c). The AE and two-stream networks can perform well. Nevertheless, the original RGB frames are the input for existing AE frameworks, and they vary in appearance, such as color or light, etc. In mitigation this challenge, Li et al. (2021c) combined a pre-trained spatial stream and a temporal stream to construct a deep spatiotemporal autoencoder (DSTAE) for anomaly detection. The spatial stream feeds stacked continuous RGB frames into DSTAE to extract the appearance characteristics, whereas the temporal stream feeds stacked continuous optical flow frames into DSTAE to extract the motion patterns.

Xie et al. (2018) used Top-Heavy 3DCNN and Bottom-Heavy 3DCNN. The Top-Heavy model starts with 2DCNN convolutions and ends with 3DCNN, whereas the Bottom-Heavy model starts with 3DCNN convolutions and ends with 2DCNN. The advantage of combining the dissimilar convolution dimensions is obtaining better results with less training data. This is crucial for smaller video surveillance datasets. Moreover, 2D convolutional layers are faster than 3D convolutional layers and less prone to overfitting problems. For that reason, the Top-Heavy AE is easier to train and is faster and smaller than a classical 3D AE. Accordingly, Esquivel and Zavaleta (2022) employed a Top-Heavy AE model for anomaly detection in video surveillance. Liu et al. (2022b) extracted appearance and motion features from videos. Basically, they applied a spatiotemporal-memories-augmented TS-AE framework for exploring the correlations between spatial and temporal patterns to detect video anomaly.

8.4.9 MoAE

Most of the two-stream-based convolutional networks apply a warped optical flow as the source for motion modeling (Tu et al. 2018). Although the motion feature is very useful, the costly computation of optical flow estimation delays the method from being used in many real-time applications. Chang et al. (2022b), inspired by Wang et al. (2019), exploited motion representation to simulate the motion of the optical flow, which was directly obtained by the differences of the RGB values between video frames. They employed both U-Net and 2DCNN as the backbone of their motion autoencoder to process consecutive video clips for anomaly detection. Differently, Feng et al. (2022) designed a MoAE to exploit spatiotemporal features.

8.4.10 A3D-CAE

Sun et al. (2021) adopted denoising reconstruction errors to train a 3D-CAE. Nevertheless, they omitted certain higher-order moments, which might produce extra errors. As a result, they employed an adversarial learning strategy based on GANs to train their used AE via an extra discriminator for learning more accurate data distributions. Their 3D-CAE encoder consisted of 3D convolutional layers. Specifically, their A3D-CAE model captured low-level appearance and motion information, which is simultaneously needed for accurate abnormal event detection in videos.

8.4.11 CdAE

Zhao et al. (2017) cascaded the 3D convolution and ConvLSTM operation to a classical AE for extracting the temporal information of video events. However, they only utilized a single-AE architecture for anomaly detection. Li et al. (2021b) built a better-cascaded classifier for video anomaly detection in crowded scenes, which consisted of a spatiotemporal adversarial autoencoder (STA-AE) and a spatiotemporal convolutional autoencoder (ST-CAE). The STA-AE was composed of a classical CAE and a discriminator (Goodfellow et al. 2014). It was intended to make the latent representation of inputs match with an arbitrary prior. Both STA-AE and ST-CAE employed 3D convolution and deconvolution in the encoder and decoder of the convolutional autoencoder, respectively. Three-dimensional convolution and de-convolution enhanced the ability of CdAE for extracting effective patterns from temporal dimensions.

8.4.12 RNN-AE

The optimization of temporally-coherent sparse coding with the sequential iterative soft-thresholding algorithm (Wisdom et al. 2016) is equivalent to a special type of stacked RNN (sRNN). Subsequently, Luo et al. (2021b) proposed to learn an sRNN-AE for both spatial and temporal features to detect video anomalies. Wang et al. (2020b) applied a multi-path encoder–decoder architecture with RNNs and explicitly captured the temporal information in object and semantic motions via the multiple ConvGRUs of different resolutions with non-local blocks (Wang et al. 2018c) for unsupervised video anomaly detection.

8.5 GANs-based methods

Unlike CNNs, GANs take part in a game-theoretic approach. GANs are unsupervised learning algorithms in their ideal form. Henceforth, no labeled data are needed for their training. In 2014, Goodfellow et al. (2014) introduced GANs. The main idea of GAN is that there are two adversaries, namely a Generator and a Discriminator, which are in a ceaseless battle to achieve more error-free predictions throughout their training process. The generator takes noise as input and generates samples. The discriminator receives samples from the generator and training data. It must be able to distinguish the two data sources. In the training phase, the generator learns to produce a sample that is close to its ground truth. The discriminator learns how to distinguish the generated data from its ground truth. These two networks are trained at the same time, i.e., the generator must not be trained without updating the discriminator. Training a GAN network does not require any Monte Carlo approximations (Han et al. 2020). Adversarial learning has shown success in improving the generation of image and video (Zhou et al. 2020b). A back-propagation algorithm (Hinton et al. 2012) is employed to obtain gradients. No inference is needed during learning, and a wide array of functions can be integrated into the GAN. These facts about GANs made them effective models for image generation and video prediction, especially in anomaly detection (Xia et al. 2022). Figure 12 represents the block diagram of a traditional GAN.

Fig. 12
figure 12

Block diagram of a standard GAN

8.5.1 Standard-GAN

As GAN can create data that do not exist in the dataset, the base model can be learned from a lot of data that are very similar to the actual data. Consequently, the problems produced by the short supply of labeled data are minimized. Wang et al. (2020c) applied the generation error of a generative neural network to detect anomalies. They first trained a GAN to generate normal samples, then judged the samples with large generation errors as anomalies.

8.5.2 LeastSquare-GAN

Usually, GANs are known to be successful in creating realistic images and videos. Nevertheless, standard GANs face the vanishing-gradient problem during learning as they hypothesize the discriminator as a classifier with the sigmoid cross-entropy loss function. To overcome this problem, Mao et al. (2017) used a modified version of GAN called LeastSquare GAN Mao et al. (2017). Doshi and Yilmaz (2021b) employed the LeastSquare GAN Mao et al. (2017) for detecting online anomalies in surveillance videos.

8.5.3 Relativistic-GAN

Zhang et al. (2021b) utilized an autoencoder as a generator. To avoid the vanishing-gradient problem when combining with the self-attention mechanism, they adopted a discriminator based on a relativistic GAN (Vaswani et al. 2017). In the training phase, the generator produced a future frame based on the historical clips of a video, and then the predicted future frame with its ground truth was fed into the discriminator. If the requirement of the discriminator was not met, training continued until the generator generated a frame that sufficiently confused the discriminator. In the testing phase, they compared the error between the future frame generated by the generator with its ground truth. If the error was greater than a known threshold, the generator failed to predict the development process of the event, which was marked as an abnormal event.

8.5.4 NM-GAN

Chen et al. (2021a) proposed an end-to-end pipeline called NM-GAN for video anomaly detection. Their NM-GAN combined the reconstruction-based (Ribeiro et al. 2018) and GAN-based approaches (Sabokrou et al. 2018). Although the reconstruction-based approach is criticized for the uncertainty it brings to the unobserved samples, it is more conducive to obtain results in real-world applications (Chen et al. 2021a).

8.5.5 DE-GAN

A double-encoder network enables the deep learning model to generate images of the underlying representation in the training phase. Han et al. (2020) proposed a DE-GAN architecture to detect abnormal crowd events. They removed fully connected hidden layers for deeper architectures and then used ReLU activation in a generator for all layers except for the output. In the discriminator, Tanh and Leaky ReLU activation was employed for all layers.

8.5.6 PatchGAN

PatchGAN (Isola et al. 2017) is a type of discriminator for GANs, which only penalizes structures at the scale of local image patches. A GAN discriminator maps an input image to a single scalar output in the range of [0,1] for addressing the probability of the image that is either real or fake, while PatchGAN (Isola et al. 2017) provides a matrix as the output with each element for signifying whether its corresponding patch is real or fake (Tang et al. 2020). For pixel-level tasks, Tang et al. (2020) followed the PatchGAN discriminator (Isola et al. 2017) to predict the broad locations of abnormal events. Zhang et al. (2022b) also employed PatchGAN (Isola et al. 2017) to generate high-quality frames.

8.5.7 P-GAN

Anomaly detection methods based on frame prediction (e.g., Luo et al. 2017b) usually use a few previous frames to predict the target frame. Compared with the frame reconstruction methods (e.g.,Sabokrou et al. 2018), the frame prediction methods consider the anomaly not only in appearance and location but also in motion. Inspired by this advantage, rather than reconstructing training data for anomaly detection, Liu et al. (2018b) identified abnormal events by comparing them with their expectations and then introduced a future video-frame prediction-based anomaly detection model referred to as P-GAN (Liu et al. 2018b), which was also applied in Chen et al. (2020a). P-GAN adopted the U-Net as the generator of GAN to predict future frames (Liu et al. 2018b). Park et al. (2020) applied four successive video frames to predict the fifth frame using P-GAN (Liu et al. 2018b) for their unsupervised anomaly detection model. Zhong et al. (2022) also employed a kind of P-GAN (Liu et al. 2018b) to detect anomalies in videos.

8.5.8 C-GAN

Mirza and Osindero (2014) introduced conditional versions of GANs. The data modes generated in GAN cannot be controlled (Goodfellow et al. 2014), whereas C-GAN (Isola et al. 2017) involves the conditional generation of images by a generator model. Moreover, Vu et al. (2021) utilized four C-GANs to generate multi-type future-appearance and motion information for anomaly detection in surveillance videos. (Cai et al. 2021) applied a C-GAN (Isola et al. 2017) to optimize predicted-images generation. Compared with ConvLSTM, ConvGRU has fewer parameters and a similar structure and modeling effects. Therefore, the ConvGRU module was chosen for time modeling. Ganokratanaa et al. (2022) proposed a variation C-GAN to enhance the accuracy and quality of the synthesized image.

8.5.9 BR-GAN

Yang et al. (2021) developed a BR-GAN for anomaly detection in videos. Their model consisted of a generator and two discriminators. As CAE models are trained separately, they are unable to learn the relation between different local information (e.g., appearance and gradient) (Roy et al. 2020). Considering this issue, Roy et al. (2020) proposed an unsupervised two-staged object-centric GAN for local anomaly detection in videos. The first stage of their method learns the normal local gradient appearance correspondences, and the second stage learns to classify events in an unsupervised manner.

8.6 U-Net-based methods

A U-Net is a U-shaped structure transformed from a fully convolutional network (Long et al. 2015). The first half of the network is used for feature extraction, which is similar to the VGG network (Russakovsky et al. 2015) in structure, and the second half is used for up-sampling. In 2015, Ronneberger et al. (2015) proposed the first classical U-Net for biomedical image segmentation. In general, the frame prediction methods outperform the frame generation methods.

Fig. 13
figure 13

Block diagram of a traditional U-Net

Figure 13 shows a traditional U-Net model with kernel sizes are configured to all convolution and deconvolution as \(3\times 3\), and the max pooling layers as \(2\times 2\). The U-Net contains the contraction path and the expansion path. It works as a shortcut between a high-level layer and a low-level layer with the output resolution unchanged for each two convolution layers to decrease gradient vanishing and to increase information symmetry.

U-Net plays a vital role in frame prediction because the consecutive frames of one clip of surveillance video usually have the same background and a similar foreground (Chen et al. 2020a). Although U-Net architecture can be widely employed for the tasks of reconstruction and frame prediction (Liu et al. 2018b), the skip connections in it may not be able to extract salient features from the video frames to a great extent for the reconstruction task (Park et al. 2020).

8.6.1 Classical U-Net

Considering the challenges posed by the insufficient utilization of motion patterns, which results in instability on different datasets, (Chen et al. 2020a) proposed a U-Net-based bidirectional prediction model for anomaly detection (Ronneberger et al. 2015). Many AE-based anomaly detection approaches assume that the AE will be unable to accurately reconstruct anomalous regions. However, in practice, neural networks generalize anomalies and reconstruct them efficiently, resulting in reduced detection capabilities. Accurate reconstruction is unlikely if the anomaly pixels were not visible to the AE (Zavrtanik et al. 2021). To address this issue, (Zavrtanik et al. 2021) used a U-Net-based encoder–decoder network to reconstruct the removed regions (Ronneberger et al. 2015). (Yu et al. 2020) adopted U-Net as the basic network architecture of generative DNNs, which were optimized by the default Adam optimizer (Kingma and Ba 2015) in PyTorch (Paszke et al. 2019). Zhang et al. (2021c) designed a U-Net-based future frame prediction model with two branches (Ronneberger et al. 2015)—one acted as the discriminator of a GAN and the other served as an encoder. During the testing phase, if a frame agreed with its prediction in both the image and latent spaces, there was a high probability it was a normal event. Otherwise, it was likely to be an abnormal event. Lu et al. (2020) applied U-Net to predict the future frame and passed the prediction to a ConvLSTM to retain the information of the previous steps. To learn high-level features, Leroux et al. (2022) applied a U-Net-type AE that was trained to reconstruct individual input frames. Park et al. (2022a) employed U-Net to skip connections between the encoder and the decoder, which boosted the generation ability by preventing the vanishing-gradient problem, thereby achieving information symmetry. Alafif et al. (2022) adopted a U-Net to detect abnormal behavior in Hajj like massive crowd videos.

8.6.2 Multi-scale U-Net

Saypadith and Onoye (2021) employed inception modules and residual skip connections inside the framework for learning higher-level features, named multi-scale U-Net. A multi-scale U-Net was utilized as a GAN to extract a spatial feature. Inception modules were employed inside the multi-scale U-Net to make the network learn higher-level image features. Finally, the multi-scale U-Net reduced the training and testing parameter numbers while considerably improving the anomaly detection accuracy.

8.6.3 ST-U-Net

Wang et al. (2021b) were motivated by the prevalent cloze test for learning neural network architectures and ensembles for improved performance. Therefore, they proposed an approach named visual-cloze-completion for video anomaly detection. Explicitly, they designed a ST-U-Net to perform their proposed visual-cloze-completion model. As compared with the standard U-Net, ST-U-Net synthesizes a recurrent network structure to accumulate temporal context information in ST-cubes and produce high-level feature maps to learn richer video semantics.

8.6.4 VQ-U-Net

The VQ-U-Net improves the vector quantization module van den Oord et al. (2017), where the encoder network outputs discrete rather than continuous information. To learn discrete representations of video by predicting future frames, Szymanowicz et al. (2022) mainly employed baseline U-Net. They formed a VQ-U-Net at the output of the encoder. The output of the encoder was quantized, which was then configured as the input to the decoder. VQ-U-Net can produce high-quality saliency maps.

8.6.5 Non-local U-Net

A non-local operation supports inputs of variable sizes and maintains the corresponding size in the output. The non-local operations can directly capture remote dependencies, and they can also improve the correlation of distant pixels for gaining a richer feature map. Wang et al. (2018c) wrapped the non-local operation into a non-local block.

Fig. 14
figure 14

A space-time non-local block

Figure 14 illustrates a space-time non-local block (Wang et al. 2018c) with the embedded Gaussian. The input feature maps are presented as their tensors with the shape of \(T\times H\times W\times C\), i.e., the input dimension of X is \(T\times H\times W\times C\). The red colored boxes hint \(1\times 1\times 1\) convolution. This space-time non-local block is comparable with the block in the ResNet. Consequently, the non-local operation can be smoothly inserted into the existing network structure. Nevertheless, the convolution is performed considering a convolution kernel with a size of \(1\times 1\times 1\) to achieve the outputs of three branches (\(\theta\), \(\varphi\), and \(\textbf{g}\)) with the dimension of \(T\times H\times W\times (C/2)\). Three outputs of these branches with dimension of \(THW\times (C/2)\) are obtained through tensor to matrix conversion process. The output dimension of \(T\times H\times W\times C\) from the \(1\times 1\times 1\) convolution layer and the original input dimension of \(T\times H\times W\times C\) perform element-wise summation to achieve the final output Z. Without changing the initial behavior of any pre-trained model, a new non-local block can be inserted (Wang et al. 2018c). To this end, Zhang et al. (2022b) and Sharif et al. (2023b) applied 3 and 4 non-local blocks,respectively, in their U-Net models for surveillance-video anomaly detection.

8.7 YOLO-based methods

Object detectors can be grouped into single- and two-stage detectors. Single-stage detectors (e.g., SSD Liu et al. 2016 and YOLO Redmon et al. 2016) solve a simple regression problem, and directly output the bounding box coordinates, whereas two-stage detectors (e.g., Faster R-CNN Ren et al. 2017) employ an RPN first to create regions of interest and then perform object classification and bounding box regression. The YOLO only analyzes the image once to detect diverse objects, and it is also very fast.

8.7.1 YOLOv1

YOLO (Redmon et al. 2016) is a pre-trained object detection system (Doshi and Yilmaz 2020a). It is capable of processing higher fps on a GPU while providing the same or even better accuracy compared with ResNet or SSD (Liu et al. 2016). As the whole detection pipeline is a single network, it can be directly optimized end-to-end on detection performance.

8.7.2 YOLOv3

The YOLOv3 detection rate is 30 fps (Bai et al. 2019), whereas the base YOLO model processes images in real-time at 45 fps (Redmon et al. 2016). Both appearance and motion information, along with modeling latent distributions, are all-important for precisely detecting video anomalies. Shine et al. (2020) developed a method for selecting anomaly candidates by scrutinizing 14 background frames per video using YOLOv3 (Redmon and Farhadi 2018). Doshi and Yilmaz (2020b) employed YOLOv3 (Redmon and Farhadi 2018) for anomaly detection in traffic videos, as speed is an important consideration when detecting traffic-surveillance anomalies. Doshi and Yilmaz (2020a) also employed YOLOv3 (Redmon and Farhadi 2018) to obtain location and appearance features when detecting objects in real-time from surveillance videos. Georgescu et al. (2020) utilized YOLOv3 (Redmon and Farhadi 2018) to detect objects, and then trained a 3DCNN to produce discriminative anomaly-specific information by jointly learning multiple proxy tasks. Ouyang and Sanchez (2020) mainly performed image patch generation, encoding/decoding via two DAEs, density estimation, and anomaly inference to detect video anomalies. The YOLOv3 (Redmon and Farhadi 2018) detector was applied to extract patches from the current frame.

8.7.3 YOLOv4

Doshi and Yilmaz (2022a) used the YOLOv4 (Bochkovskiy et al. 2020) pre-trained object detector to extract the bounding box (location) and class probabilities (appearance) for each object detected in a given frame. Explicitly, they adopted the extracted bounding boxes to construct a feature embedding to demonstrate the spatiotemporal activities observed in the frame (Doshi and Yilmaz 2022b). Shortly after the release of YOLOv4, YOLOv5 was introduced using the Pytorch framework (Ultralytics 2020). YOLOv5 outperformed v3 and v4 in terms of ACC using the Microsoft COCO dataset (Lin et al. 2014). YOLOv5 was used for miscellaneous computer vision tasks. For example, YOLOv5 (Ultralytics 2020) showed considerable performance with impressive interpretability on drone-captured scenarios (Zhu et al. 2021).

8.8 AN-based methods

Attention mechanisms can quickly extract important features from small amounts of data (Zhang et al. 2021b). An attention-based model enables the neural network to dynamically shift or select attributes so that the overall decision-making is more reliable (Zhou et al. 2020b). Currently, attention-based approaches involve a variety of vision-based applications, including segmentation (Chen et al. 2016b) and image classification (Wang et al. 2017; Zou et al. 2022).

8.8.1 Attention-driven loss

Zhou et al. (2020b) built an attention map by combining a mask map and background for anomaly detection in video surveillance. They constructed a generative model using U-Net, which avoided the vanishing-gradient problem along with information imbalance, and then they utilized a patch discriminator. In the discriminator, each output scalar of the discriminative network corresponded to a patch of an input image. Chang et al. (2022a) adopted attention consistency loss for video anomaly detection. Their attention module boosted anomalous prediction. Consequently, their model demonstrated improved frame-level AUC over the state-of-the-art methods. Gong et al. (2022) followed weakly supervised video anomaly detection paradigm and applied dynamic multiple instances of learning loss to enlarge the inter-class distance between anomalous and normal instances.

8.8.2 Self-attention

As a crowd size becomes bigger, traditional detection-based approaches perform poorly in occluded environments. To address this limitation, Sarathi Das et al. (2020) developed an attention-based deep learning framework by adopting multi-column CNN architecture (Zhang et al. 2016b). First, they generated basic-detection-based sitting and standing density maps to capture local information. Thereafter, they created a crowd-counting-based density map as a global counting feature. Finally, they finished a cross-branch segregating refinement phase that separated the crowd density map into final sitting and standing density maps using an attention mechanism (Hossain et al. 2019). Wang et al. (2021a) proposed a video anomaly detection algorithm based on future frame prediction using GAN and a self-attention mechanism (Oktay et al. 2018). A U-Net model was modified and added with an attention module for the generation model. A Markov GAN discrimination model with a self-attention mechanism was used as a discrimination model, which can affect the generator and improve the generation quality of the future video frame. The main difference between their model and the classical U-Net is that in each layer they maintained the unchanged size of the feature map of two adjacent convolution operations. Normally, direct implementations of the self-attention sliding-window strategy are costly. However, cyclic shifting is cheaper, although the system becomes more sophisticated (Goled 2022).

8.9 Hybrid networks

A hybrid network is a combination of two or more networks that perform a task more effectively than when they work separately.

8.9.1 GAN-AE-3DCNN

As the GAN discriminator extracts the features to classify real and fake data, it cannot extract the features of the data successfully (Kim et al. 2018). To this end, Shin et al. (2020) combined GAN (Shin and Cho 2018), AE, and 3DCNN for robust anomaly detection in video surveillance. They converted the GAN generator into an AE model based on 3DCNN and used the encoder of the AE as the base model.

8.9.2 U-Net-AE-GAN

Li et al. (2022a) integrated the strengths of the U-Net (Kohl et al. 2018), conditional variational AE (Sohn et al. 2015), and Wasserstein-GAN (Arjovsky et al. 2017) models to make a probabilistic latent variable model for detecting variational abnormal behaviors. (Dong et al. 2020) proposed a semi-supervised approach with a dual-discriminator-based GAN structure for video anomaly detection. They applied a generator implemented on U-Net for predicting future frames, one discriminator for discriminating frames (appearance), and another discriminator for differentiating the motion between true and false. They implemented the discriminators with PatchGAN (Isola et al. 2017).

8.9.3 ConvLSTM-GAN-AE

Ji et al. (2020) considered both motion and appearance for video anomaly detection. In the motion generative branch, the corresponding optical flow map was generated by a ConvLSTM-based CGAN (Isola et al. 2017) from consecutive frames to learn normal motion patterns. To learn appearance patterns, consecutive frames are reconstructed by an AE (Shi et al. 2021) in the reconstruction branch. In a different direction, Wang et al. (2022) employed both ConvLSTM and VAE for video anomaly detection.

8.10 Sundry networks

Sundry networks consist of a variety of models (e.g., GCN, RPN, TNN, Q-CNN, HTM, MLAD, and MatchNet), which work separately to solve the crowd anomaly detection problem.

8.10.1 GCN

Although it is difficult to perform CNN on graphs due to the arbitrary sizes of the graphs, CNN can be generalized to GCN. A GCN consists of several convolutional and pooling layers for feature extraction, followed by the final fully-connected layers. Convolutions on graphs are defined through the graph Fourier transform. The key aim of GCN is to take the weighted average of the features of all neighboring nodes, including the node itself, with lower-degree nodes for larger weights, and then to feed the resulting feature vectors through a neural network for training. Several deep learning models (Sun et al. 2020; Dengxiong et al. 2021; Luo et al. 2021a) have been successfully developed to extract and predict information on data graphs for detecting video anomalies. Sun et al. (2020) built a GCN-based (Kipf and Welling 2016) deep Gaussian mixture model (Zong et al. 2018) for normal and abnormal scene clustering in videos. Dengxiong et al. (2021) suggested a GCN-based method to localize abnormal videos. Luo et al. (2021a) applied a ResNet mechanism on spatiotemporal GCNs for video anomaly detection. Cao et al. (2022) proposed a weakly supervised adaptive GCN to model the contextual relationships among video segments. As the adjacency matrix of GCN is prior and hand-crafted, the baseline with GCN can achieve promising results (Wu et al. 2021b).

8.10.2 RPN

An RPN is a fully convolutional network that concurrently forecasts object bounds and objectness scores at each position. The RPN is trained end-to-end to create high-quality region proposals. Ren et al. (2017) merged RPN and Fast Regional-CNN (Girshick 2015) into a single network by sharing their convolutional features for real-time object detection. To analyze abnormal behavior, Hu et al. (2020a) detected and localized all objects (e.g., pedestrians and vehicles) from the crowded scene by using the RPN-based model created by Ren et al. (2017).

8.10.3 SCN

Sparse-coding-based anomaly detection aims to learn a dictionary to encode all normal events with small reconstruction errors (Zhao et al. 2011; Lu et al. 2013a). Although it does not consider the temporal coherence between neighboring frames within normal or abnormal events, an abnormal event is associated with a large reconstruction error. Wu et al. (2020b) proposed a two-stream neural network to extract spatiotemporal fusion features in hidden layers. With these features, they employed a fast sparse coding network to build a normal dictionary for anomaly detection.

8.10.4 HTM

The HTM is a bio-inspired machine intelligence introduced by Hawkins and Blakeslee (2005). It aimed to lay out a theoretical framework for understanding the neocortex by capturing its structural and algorithmic properties. It makes a prediction by matching new inputs with previously learned patterns to differentiate between normal and abnormal patterns (Hawkins et al. 2010). A typical HTM network is a tree-shaped hierarchy of levels. Currently, HTM is predominately applied for anomaly detection in streaming data. For example, Bamaqa et al. (2020) introduced an HTM-based anomaly detection framework for detecting anomalies in crowds to avoid any suspicious behaviors (e.g., overcrowding or any potential accidents).

8.10.5 MLAD

Vu et al. (2019) introduced MLAD using multi-level representations of both intensity and motion data. Typical, MLAD is composed of two DAEs (Vincent et al. 2008) for extracting the high representations of low-level data (e.g., pixel intensity and optical flow features) and two CGANs (Isola et al. 2017) for detecting anomaly objects at each representation level. MLAD’s training and testing codes (Vu et al. 2019) are publicly available (Vu et al. 2019). It can produce pixel-wise detection, while many existing models only provide frame-wise detection. Wu et al. (2021c) chose MLAD (Vu et al. 2019) as their baseline model for video anomaly detection. However, MLAD-based methods are not sufficient to analyze motion in videos, which may also create optical flow maps of abnormal events (Ji et al. 2020).

8.10.6 MatchNet

MatchNet Han et al. (2015) is a unified model that combines learning feature representations and comparison functions. It consists of a deep convolutional network that extracts features from patches and a network of three fully connected layers that computes the similarity between the extracted features. Considering it similar in the spirit of MatchNet (Han et al. 2015; Ramachandra et al. 2021) used a source set of labeled video anomaly detection datasets to generate similar and dissimilar video patch pairs. The labeled datasets, used to generate training examples, should be disjointed from the target video anomaly detection dataset on which testing would eventually be performed.

8.10.7 TNN

Vaswani et al. (2017) proposed TNN to resolve sequence-to-sequence tasks while handling long-range dependencies with simplicity. TNNs utilize attention layers as their core building blocks. Since the advent of Transformers in Natural Language Processing (NLP), the computer vision community has been focusing very conscientiously on how to employ Transformers in vision-based applications. The Vision Transformer (ViT) instigated by Dosovitskiy et al. (2020) was an architecture directly inherited from NLP (Vaswani et al. 2017). However, it was instead employed for image classification with raw image patches as input (Touvron et al. 2021b). Recently, the purview of Transformers’ applications has heightened in the computer vision domain (Sharif et al. 2023a).

Existing anomaly deletion approaches seldom explicitly consider local consistency at a frame level in addition to the global coherence of temporal dynamics in video sequences. To this end, Feng et al. (2021b) presented a convolutional transformer to perform future frame prediction based on past frames for video anomaly detection. It consisted of three key components, i.e., a convolutional encoder to capture the spatial information of the input video clips, a temporal self-attention module to encode the temporal dynamics, and a convolutional decoder to integrate spatiotemporal features and predict the future frame. Yuan et al. (2021) modified the ViT (Dosovitskiy et al. 2020) to make it capable of video prediction. Then, they combined U-Net and ViT to capture richer temporal information and more global contexts for video anomaly detection. Zhang et al. (2022a) designed a transformer-enabled feature encoder for converting the input task-agnostic features into discriminative task-specific features by mining the semantic correlation and position relation between video snippets. Their model could make a more accurate anomaly detection for current video snippet based on the learned discriminative features.

8.10.8 Q-CNN

Quantum computing is a new computational paradigm in quantum deep learning. The benefits and applications of quantum computing by using artificial intelligence tools and algorithms offer much easier handling, enabling the computing of a huge volume of data. CNN is challenging to learn efficiently if a given dimension of data or the model becomes too large. Q-CNN extends the key features and structures of existing CNN to quantum systems (Oh et al. 2020). Quantum computing remains a challenging problem, as it is hard to implement non-linearities with quantum unitaries (Schuld et al. 2014). Tang (2018) presented quantum-inspired classical algorithms for principal component analysis and supervised clustering. Recently, Blekos and Kosmopoulos (2021) developed a quantum counterpart of 3D-CNN for quantum video classification. It was based on efficiently quantum calculating the difference between successive video frames and then training a quantum convolutional neural network by replacing the convolution operation with a quantum inner-product estimation (Kerenidis et al. 2020).

8.11 A synthesis of key points

8.11.1 CNN is the best performing model in 2020–2022

Here, we show which models (listed in Fig. 6) produced the highest average performance (i.e., ACC and AUC scores) in recent years (e.g., during the years 2020–2022) for crowd anomaly detection. Figure 15 plots the average ACC and AUC values by considering the taxonomy in Fig. 6 and the values of ACC and AUC from Table 6. Figure 15a and Fig. 15b show that, on average, the CNN-based models have the highest average ACC and AUC among all models in Fig. 16.

Fig. 15
figure 15

Group-wise comparison of average ACC and AUC considering Fig. 6 and Table 6. The CNN-based models possessed the maximum average ACC and AUC among all existing deep learning models

CNNs, constructed from layers of artificial neurons, compute the weighted sum of the inputs to offer an output in the form of activation values. Various methods and techniques, including quantization and pruning, have been developed to address the issue of CNN complexity. While training a model considering CNN, the central challenges include overfitting, exploding gradients, and class imbalances. In spite of that, CNNs are well-suited for computer vision and image classification problems. The possible reasons include that:

  • It is fast to implement;

  • It accepts data of any dimensionality (Casalegno 2021);

  • It is good for extracting local and position-invariant features. A well-trained CNN can detect an object from an image even if it is smaller, larger, rotated, or translated from the original image;

  • It is a translation equivariant architecture with shared kernel parameters (Gudovskiy et al. 2022);

  • It automatically detects the important features, thus decreasing the human effort required to develop its functionalities;

  • It can be used to reduce the number of needed parameters to train it without sacrificing performance;

  • It accepts pixel values to output various visual features;

  • As CNN has feature-parameter sharing and dimensionality reductions, the number of parameters is reduced. Consequently, it is computationally efficient;

  • Convolutional layers of CNN take advantage of the inherent image properties.

8.11.2 CNN is the most popular model in 2020–2022

The taxonomies in Fig. 6 and Table 6 show the most frequently used models from 2020 to 2022 for crowd anomaly detection. Considering Equation (16), Fig. 6, as well as Table 6; Figs. 16 and 17 demonstrate the usage frequency of individual and group-wise deep crowd anomaly detection methods, respectively. It is noticeable that the pre-trained 2DCNN-based models have become the most frequently used models. In a group-wise view, CNNs-based models grew as the most frequently used models, and AEs took second place.

Fig. 16
figure 16

Individual basis comparison of usage frequency. The pre-trained 2DCNN-based models became the most commonly used models during the years of 2020–2022 for crowd anomaly detection

Fig. 17
figure 17

Group-wise comparison of usage frequency. The CNNs-based models went well with the most repeatedly used models during the years of 2020–2022 for crowd anomaly detection

CNNs are supervised models, whereas AEs are unsupervised models. AEs can be trained on unlabeled data. If the data are labeled, then the CNN-based model can be employed for better performance.

8.11.3 Can CNNs overcome biased datasets?

Deep learning models could be used to accomplish tasks rapidly, however, this does not mean they always do so reasonably. What would be the consequence if the datasets used to train deep learning models contained biased data? The deep learning models would likely exhibit the same bias when they make decisions in practice. For instance, if a dataset contains mostly images of white men, then a facial-recognition model trained with these data may be less accurate for women or people with different skin tones (News 2022). In 2021, a team of researchers (Madan et al. 2021) from MIT, Harvard University, and Fujitsu Ltd sought to understand when and how a deep learning model is capable of overcoming this kind of dataset bias. Basically, they posed the following question (Madan et al. 2021): “Can a network shown only the Ford Thunderbird from the front and the Mitsubishi Lancer from side generalize to classify the category and viewpoint for a Thunderbird seen from the side?”. Explicitly, if each Ford Thunderbird in the training dataset is exhibited from the forepart and if the trained model is given an image of a Ford Thunderbird from the side view, then the detection result may be misclassified, even if the model was trained on hundreds of thousands of car images. The researchers (Madan et al. 2021) used an approach from neuroscience to study how training data affects whether an artificial neural network can learn to recognize objects it has not seen before. Their results showed that the generalizability of a deep learning model is influenced by both the diversity of the data and the way the model is trained. Can CNNs perform well on growing training data diversity? They demonstrated that CNNs generalize better to out-of-distribution category-viewpoint combinations (i.e., combinations not seen during training) as the training data diversity grows (Madan et al. 2021).

8.11.4 Can transformers outperform CNNs?

Trockman and Kolter (2022) scrutinized the ViT. They noticed that the Transformer-based models (e.g., ViT) might exceed the performance of CNNs in some settings. Nevertheless, due to the quadratic runtime of the self-attention layers in Transformers, ViTs demand the use of patch embeddings. To this end, they posed the following question (Trockman and Kolter 2022): “Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?” They showed some evidence for the latter. Specifically, they introduced an extremely straightforward model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer (Tolstikhin et al. 2021), called ConvMixer. For their ConvMixer, they used only standard convolutions to carry out the mixing steps. They claimed that despite the simplicity of their ConvMixer, it outperformed both the ViT and MLP-Mixer. They further stated that ConvMixer was competitive compared with ResNets, DeiTs (Data-efficient image Transformers) (Touvron et al. 2021b) and ResMLPs (Residual Multi-Layer Perceptrons) (Touvron et al. 2021a).

8.11.5 Are CNNs and transformers complementary technologies?

In 2022, a team of researchers (Liu et al. 2022c) from Facebook AI Research (FAIR) and UC Berkeley studied the differences between CNNs and Transformers to discover the confounding variables while comparing their network performances. As per the team, the objective of their study was to “bridge the gap between the pre-ViT and post-ViT eras for ConvNets, as well as to test the limits of what a pure ConvNet can achieve.” The team proposed a family of pure CNNs called ConvNeXt. The team found that ConvNeXts could compete well with Transformers in terms of accuracy, robustness, and scalability. In addition, the ConvNeXt possesses the efficiency standard of CNNs. Due to the fully-convolutional nature of training and testing, its implementation is easy. As computer vision applications are very diverse, the ConvNeXt may be more suited for some tasks in computer vision, while Transformers may be more flexible for other tasks. Specifically, Transformers may be more flexible when employed for tasks demanding discretized, sparse, or structured outputs (Liu et al. 2022c).

8.11.6 Are we shifting from CNN to transformer technologies?

Cautiously, Transformers have begun to catch up with CNNs for computer vision tasks. One of the key reasons why people are choosing hierarchical Transformers over CNNs is due to the CNN’s poor scalability, with multi-head attention being the key component (Goled 2022). Carion et al. (2020) proposed the DEtection TRansformer (DETR), which mainly consists of a set-based global loss that can make unique predictions via bipartite matching as well as a Transformer encoder–decoder architecture. In spite of Yann LeCun’s own innovation of LeNet (i.e., the early version of CNN), he preferred the DETR architecture (Goled 2022). Are we on the verge of shifting from CNN to Transformer technologies? The debate between CNNs and Transformers is ongoing.

8.11.7 Will the quantum computing be the next paradigm?

Quantum computing is one of the fastest-growing technologies. Currently, the computing capacities of traditional computers restrict the computational capabilities of deep learning algorithms. Quantum computing can process vast datasets at much faster speeds and feed data to artificial intelligence technologies, which can look over data at an ultra-fine level to find diverse patterns and anomalies. A qubit is the basic unit of quantum information. It is the complement in quantum computing to the bit (binary digit) of classical computing. In 2021, IBM and QuEra Computing were among the first businesses to build quantum computers with more than 100 qubits (IBM 2022). One of the key considerations when applying quantum computers is that we can perform more sophisticated analyses and build deep learning models. Compared with traditional computers, quantum computers use more data more efficiently. Accordingly, researchers can have a better understanding of their working data and models. In the literature, there is a common consensus that quantum computers will assist in the solving of previously impossible issues, notably in the areas of data science and artificial intelligence (Abdelgaber and Nikolopoulos 2020; Mahmud et al. 2022; Ayoade et al. 2022; Perez et al. 2022; Arafath and Kumar 2023).

9 Architectural impacts of 2DCNN models

Because 2DCNN-based models were the most popular anomaly detection approaches from 2020 to 2022 (see Fig. 16), we studied their architectural impacts on various crowd datasets for anomaly detection. The architectural influence of pre-trained CNN models on video anomaly detection is not a new research area (Pang et al. 2020; Al-Dhamari et al. 2020; Gutoski et al. 2021). For example, Pang et al. (2020) employed ResNet50 to exclusively examine appearance-based anomalies. Without counting RTM, they compared the performance of ResNet50, VGGNet, and 3DCNN for Ped1-Ped2 in the UCSD, Entrance-Exit of Subway, and Minnesota (2022) datasets. In a different vein, using only Ped1 and Minnesota (2022) datasets, and without counting RTM, Al-Dhamari et al. (2020) argued that VGGNet19 had the highest detection ACC among the GoogleNet, ResNet50, AlexNet, and VGGNet16 models. Gutoski et al. (2021) supported this position by taking 12 pre-trained CNN models on ImageNet as feature extractors and then employing the obtained features to seven video anomaly detection benchmark datasets. Without examining RTM, they performed a simple statistical analysis of their results. Nevertheless, they reached a promising conclusion that the architectural differences are negligible when choosing a pre-trained model to detect video anomalies.

In this section, we performed a similar experiment to Gutoski et al. (2021) by additionally considering the RTM of six pre-trained 2DCNN models (see Table 6) and rigorous statistical analyses. We have taken RTM into account because it’s software performance depends on the features of the computing environment, including RTM Schmidt (2013).

9.1 Essential techniques

The pre-trained 2DCNN models in Table 6 cannot provide necessarily sufficient performance scores. As of today, numerous pre-trained 2DCNN models exist, which can be used to transfer their learning from the ImageNet dataset to other models. In CNN, each layer has two types of parameters, namely weights and biases. Table 1 represents the essential information of our six used pre-trained 2DCNN models (Keras 2021). In a deep network model, parameters (e.g., batch size, learning rates, etc.) and functions (e.g., activation, optimization, and loss functions) should be appropriately chosen (Wu et al. 2021c). Consequently, we compared the performance of our used pre-trained 2DCNNs, employed them as feature extractors, and then trained OCSVM (Scholkopf et al. 1999) over the extracted features to learn the normal patterns. SVMs learned the smallest region of the feature space as normal, and during testing, new samples located outside the region were classified as anomalies. We took the videos, extracted their frames, and categorized the frames into two folders for training and testing. The training limits only positive samples (i.e., normal images), while the testing contains both abnormal and random normal samples.

Table 1 Brief of six pre-trained models

9.2 Hardware and software specifications

The hardware specification involved a Tesla P100 16GB VRAM as GPU along with 13GB RAM + 2-Core of 2GHz Intel Xeon as CPU. The software specification encompassed the Kaggle platform (Kaggle 2021) with Python 3.9 and machine learning packages, including Pandas (Beazley 2012), Numpy (van der Walt et al. 2011), and Scikit-Learn (Pedregosa et al. 2011). Pandas (Beazley 2012), Numpy (van der Walt et al. 2011), and Scikit-Learn (Pedregosa et al. 2011) work cooperatively. For example, Pandas (Beazley 2012) can help to load, clean, or manipulate a data frame. The Numpy (van der Walt et al. 2011) array can allow the translated Pandas data frame. Scikit-Learn (Pedregosa et al. 2011) functions can return the Numpy array.

9.3 Experimental setup

We have considered Ped1 (Chan et al. 2008), Ped2 (Chan et al. 2008), (Minnesota 2022), CUHK-Avenue (Lu et al. 2013a), ShanghaiTech Campus (Luo et al. 2017b), and UCF-Crime (Sultani et al. 2018) (Sultani et al. 2018) datasets. A dataset such as Minnesota (2022) contains slightly more than 7000 images. Furthermore, all images have to be converted to a Numpy array, which currently used memory configurations cannot accommodate. As a result, we placed 4000 and 2550 sample images in the training and testing folders, respectively. We used a total of 6550 sample images from each dataset, which is a fair distribution of the datasets required for the statistical test. All 4000 images in the train folder were normal, whereas the 2550 images in the testing folder consisted of both anomalous and randomly chosen images. The classification was based on learning from the \(\mathcal {X}\) class and predicting anything that was negative in the \(\mathcal {X}\) class. In our pre-processing, we fetched the images and re-sized each of them into a (224,224,3) array and augmented them by re-scaling, then we converted them all into a Numpy array from a list. The array of images was divided into training and testing sets. We employed these data to then extract the features from the image arrays using a pre-trained 2DCNN.

For DenseNet121, as an example, we achieved a (1,7,7,1024) feature image into a list array, which we converted into a Numpy array for efficiency. Furthermore, the format was [samples, rows, columns, channels]. At the end of the deep feature extraction for both the training and testing data, we obtained (4000,7,7,1024) and (4000,7,7,1024) feature matrices, respectively. We divided the data into x_train and x_outliers because it was one classification. Henceforth, the x_train contained only positive, whereas the x_outliers belonged to negatives that were randomly selected from the test folder. We flattened the data from (2000,7,7,1024) into (2000,50176) for the 2D input of our OCSVM classifier. We applied a standard scaler to the data for some statistical scaling and used principal component analysis to reduce the feature space dimensionality of the data. We considered the parameters of OCSVM as (gamma=‘scale’, kernel=‘rbf’, nu=0.01). The predictions returned by OCSVM were of the form {-1, 1}, where -1 and 1 were referred to as anomaly and normal cases, respectively.

9.4 Experimental results

Table 2 records the experimental performance scores of ACC, PRS, RES, F1S, AUC, and TRM along with their ineffectiveness scores. Figures 18a–f visualize the scores of RTM, ACC, PRS, RES, F1S, and AUC, respectively, using data in Table 2. From Fig. 18, it is easy to observe that the effectiveness scores vary based on the underlying datasets and models. As a result, it is extremely hard to determine which model is superior to its alternative. Even so, non-parametric statistical tests can provide an agreeable solution.

Table 2 Experimental results and analysis
Fig. 18
figure 18

Plotting of RTM, ACC, PRS, RES, F1S, and AUC using data from Table 2. The effectiveness scores became different based on the underlying models and datasets

9.5 Average ranking of 2DCNN models

The Friedman (1937) test and its derivatives (e.g., Iman and Davenport 1980 test) are usually referred to as the most well-known non-parametric tests for multiple comparisons. Consequently, we performed the Friedman test for our research. The Friedman test characteristically takes measures in preparation for ranking the performance of a set of algorithms in descending order. Notwithstanding, it can solely inform us about the appearance of differences among all samples of results under comparison. Its alternatives, e.g., Friedman’s aligned rank test (Hodges and Lehmann 1962) and Quade (1979) test, can provide us with further information, and they express opposition through rankings. They also provide better results based on the features of a given experimental study.

To achieve the non-parametric statistical test results from our experimental results, we applied Friedman, Friedman’s aligned rank, and Quade tests to our obtained results in Table 2. The aim of applying the Friedman, Friedman’s aligned rank, and Quade non-parametric tests is to determine whether there are significant differences among various models considering the data in Table 2. These tests provide rankings for the models regarding each individual dataset, i.e., the best-performing model receives the highest rank of 1, the second-best algorithm obtains the rank of 2, and so on. The mathematical equations and further explanation of the non-parametric procedures of the Friedman, Friedman’s aligned rank, and Quade tests can be found in Quade (1979) and Westfall and Young (2004).

Table 3 shows the average ranking computed by using the Friedman, Friedman’s aligned rank, and Quade non-parametric statistical tests.

Table 3 Average ranking of each model using non-parametric statistical tests

DenseNet121 achieved the best scores from the UCF-Crime dataset and comparable results from the Ped1 and CUHK-Avenue datasets, whereas ResNet50 attained the best scores from the UMN dataset. Using the Friedman test, the ranking scores of DenseNet121 from six different datasets were clustered within \(\sigma =1.5217\) of the mean, whereas for ResNet50, the same was \(\sigma =1.0755\). Specifically, the scores of different datasets in ResNet50 were 29% less spread out than those of DenseNet121. Similarly, ResNet50 acquired 26% and 22% better condensed ranking score distributions compared with its counterpart DenseNet121 considering the aligned Friedman and Quade tests, respectively. DenseNet121 concentrates on making the deep learning networks move even deeper as well as simultaneously making them well-organized for training by applying shorter connections among the layers. It requires fewer parameters and allows feature reuse. The key base element of ResNet50 is the residual block. ResNet50 adopts summation, whereas DenseNet121 deals with concatenation. Nevertheless, the dense concatenation of DenseNet121 creates the challenges of a demanding high-GPU memory and more training time (Zhang et al. 2021a). On the other hand, the identity shortcut that balances training in ResNet50 curbs its representation dimensions (Zhang et al. 2021a). Compendiously, there is a dilemma between ResNet50 and DenseNet121 for many applications in terms of the performance and GPU resources (Zhang et al. 2021a).

InceptionV3 is 48 layers deep. It reached comparable scores with the Ped1, Ped2, UMN, and ShanghaiTech Campus datasets. MobileNet (Howard et al. 2017) was based on a streamlined architecture that applied depth-wise separable convolutions to build a lightweight deep neural network. It gained the best scores from the CUHK-Avenue dataset, whereas it performed excellently for the ShanghaiTech Campus dataset. InceptionV3 received a 41% and 47% better condensed distribution of ranking scores compared with MobileNet after taking the Friedman and Quade tests, respectively. However, MobileNet attained a 2% better condensed distribution of ranking scores compared with InceptionV3 from the aligned Friedman test.

There are various variants of VGGNet (e.g., VGG16 and VGG19) that differ exclusively in the total number of layers in the network. VGGNet16 did not come into the best scores from any dataset, whereas VGGNet19 achieved the best scores from the Ped1 and Ped2 datasets. However, VGGNet16 performed better than VGGNet19 using the UCF-Crime, CUHK-Avenue, and UMN datasets. VGGNet16 achieved 5%, 48%, and 14% higher condensed distributions of ranking scores compared with VGGNet19 in the Friedman, aligned Friedman, and Quade tests, respectively. With an increasing number of layers in the CNN model, the potential for the model to fit more sophisticated functions rises. Accordingly, more layers result in better CNN performance. The pre-trained 2DCNN models of VGG16 and VGG19 accommodate 16 and 19 layers, respectively. Compared with VGGNet16, VGGNet19 is a considerably large neural network in terms of the number of parameters to be trained. Nevertheless, Fig. 19 shows that the VGG16 and VGG19 models demonstrated razor-thin performance differences. Although VGGNet is painfully slow to train, many studies were carried out on VGG16 in lieu of VGG19 (see Table 6).

Fig. 19
figure 19

Average ranking using the mean values in Table 3. In spite of the number of layers variation, the performance difference of the best two pre-trained 2DCNN models of VGG16 and VGG19 demonstrated very small

Figure 19 depicts the average ranking upon applying the mean values from Table 3. The Friedman and aligned Friedman tests hinted that VGGNet16 would be the best model among its counterparts, whereas the Quade test preferred VGGNet19. In practice, the differences in the various model are not very significant. Furthermore, it is noticeable that none of the aforementioned 2DCNN models were able to show their continuous superiority over all datasets.

From the outset, the \(\sigma\) of mean value 0 stipulates that all employed 2DCNN models should present identical performance scores from all datasets. Practically, the \(\sigma\) of the mean values, i.e., 0.9899, 2.3241, and 0.3261 secured by the Friedman, aligned Friedman, and Quade tests, respectively, cumulatively specify that the used 2DCNN models showed non-identical but almost similar performance scores from different datasets. Consequently, pre-trained 2DCNN architectural variations play an unimportant role in detecting crowd video anomalies. Furthermore, the computation of a higher number of parameters is time-consuming in the sophisticated model. A less-sophisticated model uses comparatively fewer numbers of parameters. However, it gains nearly the same accuracy as many more sophisticated models. Based on the applications of 2DCNN models, there is a trade-off between two options: computational complexity and accuracy.

9.6 Our observations

CNNs are the preferred option for computer vision applications. They undertake pixel values to output numerous visual features. Does it matter if the simple pre-trained CNN models provide slightly less accuracy by computing a lesser number of parameters than the complex pre-trained 2DCNN models? The aforesaid rigorous statistical analysis evinces that the architectural dissimilarities of the pre-trained 2DCNN models have an insignificant influence on crowd video anomaly detection. As a consequence, a lesser-layered CNN model can achieve approximately the same performance as a more-layered CNN model and reduce computational time. For example, VGGNet16 can be used instead of VGGNet19, saving 3.7% of parameters. Although VGGNet19 can be utilized to extract motion features from complex and noisy surveillance scenes, and showed the best mean performance in our experimental set-up, we do not recommend its use for small-size datasets as it demands a large set of training samples (Al-Dhamari et al. 2020).

10 Open challenges and future prospects

The subjectivity of anomaly definition, the rarity of anomalies, big datasets, and high computational power make crowd anomaly detection a critical and challenging task. Deep learning enables the possibility of automatic feature extraction. Thus, deep learning-based solutions for crowd anomaly detection have performed significantly better than traditional solutions in terms of complexity and accuracy. However, they encounter obstacles when devising reliable methods that can be applied to real-world problems. Explicitly, various challenges exist in this research area. Therefore, here, we briefly review several remarkable facets, which could be minimized in the future.

  • Lack of anomaly definition \(\Rightarrow\) Anomaly definition is fully subjective. Based on the time and place, the same event can be either normal or abnormal. The most popular datasets (e.g., UCSD, CUHK-Avenue, and ShanghaiTech Campus) assume that whatever is unseen in the training data is marked as anomalous. In effect, this causes a very restricted nominal class and an apparent depiction of an anomaly (Doshi and Yilmaz 2022b). As anomaly detection is a fault-finding task, the definition of anomaly should be flawlessly rational in the corresponding contexts. Can anomalies in crowds be reliably identified from CCTV footage without a prior definition of anomalies and the need for the extraction of handcrafted features? This is one of the biggest challenges for crowd anomaly detection.

  • Lack of realistic datasets availability \(\Rightarrow\) Deep learning methods demand a large number of datasets for training. However, the existing datasets are not sufficient to perform accurate training or testing. It is important to build a dataset with a larger amount of data and a wider behavior categorization. In addition to data collection, ensuring correct annotation is a challenging task for a bigger video dataset. The availability of video datasets is also a big issue. Few datasets are publicly accessible, while many datasets are still not available for open research. In addition, publicly available crowd datasets suffer from certain restrictions. For example, the Minnesota (2022) dataset is straightforward where the performance of methods is saturated on it. Furthermore, videos in Ped1 and Ped2 datasets are captured in just one location, and hence the camera is fixed during the training procedure. Additionally, the resolution of the video frames is extremely low. In addition, abnormal scenarios are sensitive to obtain and use. For example, the usage of crime-related videos is strictly regulated by governments or authorities.

  • Lack of ground truth annotation \(\Rightarrow\) Some existing datasets only provide a frame-level ground truth annotation. Henceforth, without performing P-AUC and P-EER, we have to satisfy F-AUC and F-EER. For example, the Minnesota (2022) dataset does not provide the pixel-level ground truth annotation. Therefore, P-AUC and P-EER cannot be performed (Han et al. 2020). Furthermore, due to the unavailability of labeled data, it is very arduous to create a benchmark for anomaly explanation (Szymanowicz et al. 2021).

  • Lack of hardware applications \(\Rightarrow\) In addition to methodology breakthroughs and accessible big-data training, successful solutions for video anomaly detection are also due to recent advances in hardware applications. Crowd anomaly detection demands the processing of a large amount of video data, which needs a powerful GPU. The accessible video dataset is unlimited, whereas the accessible hardware processing volume is in short supply. Therefore, there are challenges regarding hardware constraints. Currently, big artificial intelligence systems are notoriously difficult and expensive to train because the underlying hardware applications are not fast enough.

  • Lack of computing power \(\Rightarrow\) The demand for artificial intelligence models is increasing rapidly. Ten years ago, the largest models were some 10 million parameters, which might be trained in a few hours on a single GPU. However, today, the largest models are over 10 trillion parameters, which can take up to a year for training across tens of thousands of machines. Soon, highly sophisticated artificial intelligence systems could become the exclusive domain of companies and administrations. For various deep learning and large-scale analytics applications, power dissipation across the many components of the computing infrastructure is expected to be an order of magnitude higher than in the current systems. Data are sent along electrical wires in traditional hardware. Nevertheless, electrical wires consume far more energy and transfer far less data over longer distances. A solution to this problem is luminous computing, which is developed on light-based artificial intelligence accelerator chips (Angelini 2022). In theory, such light-based chips may lead to higher performance levels because light produces less heat than electricity. Furthermore, light can propagate faster and is less susceptible to alternations in temperature and electromagnetic fields. However, light-based chips are physically larger than their electronic counterparts. In addition, their architectures still mainly rely on electronic control circuits, which can produce bottlenecks.

  • Lack of unsupervised learning \(\Rightarrow\) Similar to machine learning, deep learning methods can be grouped into supervised, semi-supervised, and unsupervised categories. Supervised learning methods are widely used in video anomaly detection. However, they entail the manual labeling of a large number of datasets. The exponential growth of data makes manual labeling full of challenges. Unsupervised and deep reinforcement learning both require more attention to realize automatic learning from unlabeled videos. However, training deep neural networks should be humanly understandable.

  • Lack of static backgrounds \(\Rightarrow\) Surveillance cameras can capture videos in various dynamic backgrounds, which are subject to illumination variation, occlusions, and viewpoint alternation. Analyzing such videos is a challenging task. Rainy, sunny, and snowy conditions can often cause anomalies. These environmental set-ups are closely related to background features. Model accuracy can affect the neglected background features. Ideally, both foreground and background features can be viewed by the detection models.

  • Lack of high-quality videos \(\Rightarrow\) Religious events, airport arrivals, and departure terminals are busy places where occlusions happen very frequently. Due to the long distances of cameras, the subjects are relatively small, producing poor-quality videos. The relatively low quality of long-distance videos makes the detection process more challenging.

  • Lack of frequency of normal events \(\Rightarrow\) For crowd anomaly detection, the false positive rate should be as low as possible. To discover a method with a high detection rate along with a low false positive rate is a considerable area of research.

  • Lack of model direct applicability \(\Rightarrow\) Some deep learning models cannot be directly applicable enough for real-world applications. For example, GAN is used for generating anomalous data but its inability to generate anomalies has been highlighted in the literature. It assumes that a previously unseen activity creates a higher prediction error (Doshi and Yilmaz 2022b). Most of its generated data are in imitation of random noises. Accordingly, instead of directly using it, the anomaly detection problem can be transferred into a binary classification problem (PourReza et al. 2021).

  • Lack of direct training capability \(\Rightarrow\) Some GAN-based deep learning models are dedicated to one-class classification tasks. The GAN discriminator can be utilized as a deformity detector. Yet, such solutions need trial and error to decipher the potential problem during the training methods.

  • Lack of fairness \(\Rightarrow\) The factors of skewed samples, limited features, tainted examples, and disparities of sample sizes and proxies can bias the training sets (Bolukbasi et al. 2016), resulting in an unfair output from a trained deep neural network. For example, Zhang and Davidson (2021) investigated fairness for the anomaly detection problem and reported that the deep-support-vector data description model (Ruff et al. 2018) failed in fairness evaluation.

  • Lack of assurance \(\Rightarrow\) Deep neural networks are vulnerable to adversarial attacks, which do not signify detecting abnormal events (Mohammadi et al. 2021).

  • Lack of lengthy video segments \(\Rightarrow\) The contemporaneous state-of-the-art methods inherently assume that each test video segment comprises anomalous activity. To meet this assumption, the length of video segments would need to be exceedingly long because, in real-life scenes, abnormal events only occasionally occur. In practice, the video segments in most of the existing benchmark datasets are a few minutes long and consistently labeled by abnormal frames, which cannot undoubtedly keep in touch with the real-life abnormalities. So, they make false alarms in real-world synopses.

  • Lack of negative value count \(\Rightarrow\) The ReLU cuts off negative values. Hence, it can limit diverse feature portrayals (Park et al. 2020). For example, Park et al. (2020) and Szymanowicz et al. (2022) provided solutions to this problem by removing the last BN (Ioffe and Szegedy 2015) and ReLU layers (Krizhevsky et al. 2012) in their encoders to minimize this effect.

  • Lack of temporality richness \(\Rightarrow\) A crystal-clear abnormal event in one scene cannot be recognized as an alike activity that can take the shape of a normal event in a completely dissimilar scene of a significantly larger and more complex multi-scene dataset. The sheer heterogeneity in the multi-scene dataset creates mismatched issues (Ramachandra and Jones 2020). Instead of comprehensive data over various scenes, comprehensive data over time can minimize this problem (Doshi and Yilmaz 2022b).

  • Lack of model understandability \(\Rightarrow\) The ability of a system to automatically detect anomalous events and to recover humanly readable explanations for detected anomalies is important (Rad et al. 2021). Researchers interpreted deep learning models to explain the outcomes of deep neural networks. In reality, refraining from taking the deep neural network as a black-box approach brings about a better understandability and reliability (Holzinger et al. 2018).

  • Lack of solvable mathematical paradox \(\Rightarrow\) As a comparison to the limitations of intelligence for both humans and machines, we observe that humans are frequently effective at recognizing when they get things erroneous. However, deep learning models are not. Although deep learning is the leading artificial intelligence technology for pattern recognition, many deep learning models are untrustworthy and easy to fool. Many deep learning models do not realize when they make errors. Sometimes, it is even more laborious for a deep learning model to know when it is making an error than to generate a true result. Alan Turing and Kurt Gödel, two notable mathematicians, discovered a paradox at the heart of mathematics, namely that it is impracticable to show whether certain mathematical statements are true or false; in addition, some computational problems cannot be addressed with algorithms. Regardless of the data accuracy, we cannot obtain perfect information for constructing a required neural network, and, generally, deep learning models suffer from inherent limitations due to this century-old mathematical paradox ( Cambridge 2022).

  • Lack of anonymization \(\Rightarrow\) Some jurisdictions have very strict privacy laws which preclude the application of any automated video analytics on raw video data. Anonymization is a practical solution to preserve privacy (Majeed and Lee 2021) as it removes or modifies personally identifiable information suitable for utilizing in research and data mining. Generally, anonymization removes any identifying features such as faces (Wereszczynski et al. 2017; Ding et al. 2022), clothing, accessories, objects carried by individuals, etc. from video feeds is thus imperative prior to any further processing. However, it must be performed in such a way that both preserves mandated privacy and does not discard salient video features necessary for anomaly detection. The extent to which a trade-off between anomaly detection performance and level of anonymization may exist remains an open question. Since both anonymization and determining salient features are essentially abstraction operations, the ideal approach would directly use the result of anonymization as input features to automatic crowd anomaly detection. Although anonymization may be an imperative, it does not, on its own, guarantee general acceptance of real-time surveillance in private and public spaces.

11 Conclusion

Computer vision and deep learning technology empower immense research options in the realm of solving real-world problems including crowd anomaly detection. We were motivated to produce this comprehensive survey of deep learning parameters by the lack of contemporary research regarding advanced deep learning-based crowd anomaly detection methods for video clips. We aimed to update similar surveys with the fast evolution of this research area since 2020. In essence, we studied a plethora of deep learning methods on the recent literature published in mainstream conferences and journals between 2020 and 2022. We presented techniques for feature selection, performance metrics, datasets typically employed in benchmarking, and target methods with their performance comparison. Our review revealed several novel facts regarding datasets, taxonomy, and deep learning model architectures, along with their performances. We confirmed that CNNs were the de facto model of choice for computer vision procedures. We predominantly found that the heterogeneities of pre-trained convolutional models have an insignificant impact on crowd video anomaly detection performance. Furthermore, we explored open research challenges to explore prospects for future study.