Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Anomaly detection for Time Series Analysis

Anomaly detection by Author with ideogram.ai

Time series analysis is a very useful and powerful technique for studying data that changes over time, such as sales, traffic, climate, etc. Anomaly detection is the process of identifying values or events that deviate from the normal trend of the data. In this article, I will explain what a time series is, what its components are, how it differs from other types of data, how anomalies in a time series can be detected, and what are the most common techniques for doing so.

Introduction to Time Series Analysis

Time series are data that record the value of one or more variables at different points in time. For example, the number of visitors to a website every day, the average temperature of a city every month, the price of a stock every hour, etc. Time series are very important because they allow us to analyze the past, understand the present, and predict the future. In addition, time series help us uncover hidden patterns and trends in data, which can be used to improve decisions and strategies.

However, time series analysis also presents challenges and differences from non-temporal data analysis. One of the main differences is that time series are time-dependent, i.e. the order and range of the data are relevant and cannot be ignored or changed. Another difference is that time series are often non-stationary, i.e. their statistical properties (such as mean and variance) change over time. This makes it difficult to apply traditional statistical methods, which assume the stationarity of the data.

In addition, time series analysis requires a different approach to anomaly detection. Anomalies are values or events that deviate significantly from the normal trend of the data. Anomalies can be caused by measurement errors, structural changes, fraudulent activities, exceptional occurrences, etc. Anomaly detection is important because it can provide valuable insights into problems or opportunities hidden in the data. However, detecting anomalies in time series is more complex than in non-temporal data, because the time dependence, non-stationarity, and dynamic nature of the data must be taken into account.

Basic Concepts of Time Series Analysis

Before we get into the details of the techniques for time series analysis and anomaly detection, we need to define what a time series is and what its components are. A time series is a sequence of values of one or more variables measured at different points in time.

A time series has three main components: the date, the time, and the characteristics. The date and time indicate when the value of the variable was measured. Characteristics are the very variables we want to analyze. In our example, the date is the day of the month, the weather is the day of the week, and the characteristic is the number of visitors.

To be able to analyze a time series, we need to meet certain requirements. The first requirement is to have a sufficient number of data points, i.e. observations of the variable over time. The number of data points needed depends on the type of analysis we want to do and how often the data is collected. For example, if we want to analyze the seasonality of data, i.e. the periodic variation of data as a function of time, we need to have at least one complete cycle of observations that covers all possible seasons. If data is collected every day, we need to have at least a year’s worth of data to be able to analyze annual seasonality.

The second requirement is to have a good understanding of the domain of the data, i.e. the context in which the data is generated and the meaning of the variables. This helps us to interpret the results of the analysis and identify possible causes of anomalies. For example, if we analyze the number of visitors to a website, we need to know what the type of website is, what its target audience is, what its goals are, what are the factors that influence traffic, etc.

The third requirement is to have a clear definition of the goals of the analysis, i.e. what we want to find out from the data and how we want to use it. The goals of the analysis can be different depending on the use case and research question. For example, we may want to analyze a time series to:

  • Describe the behavior of data over time and its main characteristics
  • Predict future data values based on past values
  • Detect anomalies in data and their causes
  • Test hypotheses about data and their relationships
  • Optimize data-driven decisions and actions

Understanding anomalies in time series

Before we look at how to detect anomalies in time series, we need to understand what anomalies are and how they manifest themselves in the data. An anomaly is a value or event that deviates significantly from the normal trend of the data. The anomalies can be of two types: punctual or collective. Point anomalies are isolated values that are very different from other values in the time series. Collective anomalies are groups of values that are different from the rest of the time series.

For example, in the following figure, we can see a time series that records the number of visitors to a website every day for a month. Point anomalies are indicated in red and collective anomalies are indicated in blue.

Abnormalities can have different causes and meanings. Some anomalies may be due to measuring, transmission or data processing errors. These anomalies are often called noise and can be ignored or corrected. Other anomalies may be due to structural changes, fraudulent activity, exceptional events, or other factors affecting the data. These anomalies are often called signals and can be important to detect and analyze.

In order to detect anomalies in time series, we first need to have expectations about the normal movement of data over time. These expectations are based on the analysis of the main components of a time series, which are:

  • The trend, i.e. the direction and speed of data change over the long run. For example, an increasing trend indicates that the data is increasing over time, while a decreasing trend indicates that the data is decreasing over time.
  • seasonality, i.e. the periodic variation of data as a function of time. For example, annual seasonality indicates that the data has a cyclical pattern that repeats itself every year, such as toy store sales increasing in December and decreasing in January.
  • Cyclicality, i.e. the irregular variation of data as a function of time. For example, economic cyclicality indicates that data has a fluctuating trend that depends on external factors, such as GDP, inflation, unemployment, etc.
  • Noise, i.e. the random variation of data as a function of time. For example, noise can be caused by measuring, transmitting, or processing errors.

In the following figure, we can see an example of a time series that has an increasing trend, annual seasonality, and noise.

When analyzing a time series, we need to take these components into account and understand how they change over time. A change in one or more components may indicate the presence of an anomaly in the data.

Data Requirements for Time Series Analysis

As we have seen, in order to be able to analyze a time series and detect anomalies, we need to have data that meets certain requirements. The first requirement is to have a sufficient number of data points, i.e. observations of the variable over time. The number of data points needed depends on the type of analysis we want to do and how often the data is collected. For example, if we want to analyze the trend of the data, we need to have at least a dozen data points that cover a fairly long time frame. If we want to analyze the seasonality of the data, we need to have at least one complete cycle of observations that covers all possible seasons. If we want to analyze data noise, we need to have at least twenty data points that are sufficiently variable.

The second requirement is to have data that captures changes over time, i.e. that reflects changes in the variable as a function of time. This means that data should be collected at regular and consistent intervals, without skipping or duplicating some observations. In addition, the data must be time-aligned, i.e., each observation must correspond to the time when the variable was measured. This implies that the data must be converted to the appropriate format for time series analysis, such as the datetime format.

The third requirement is to have data that meet the minimum requirements for the analysis of the main components of a time series, i.e. trend, seasonality and noise. These requirements vary depending on the model we want to use for analysis. For example, if we want to use a linear model for the trend, we need to have data that has a linear relationship between the variable and time. If we want to use an exponential model for the trend, we need to have data that has an exponential relationship between the variable and time. If we want to use an ARIMA model for seasonality and noise, we need to have data that is stationary or differentiable.

Differentiation in Time Series Analysis

As we have seen, one of the main challenges in time series analysis is the presence of non-stationarity in the data, i.e. the fact that the statistical properties of the data (such as mean and variance) change over time. This makes it difficult to apply traditional statistical methods, which assume the stationarity of the data. In order to use these methods, we must first transform the data so that it becomes stationary, or at least approximately stationary. One of the most common techniques for doing this is differentiation.

Differentiation consists of subtracting the previous value from each value in the time series, resulting in a new time series that represents the change in the data over time. For example, if we have a time series {x1, x2, x3, …}, its first difference is {x2 β€” x1, x3 β€” x2, …}. The differentiation can be repeated several times, thus achieving the second difference, the third difference, etc. The differentiation is intended to remove the components of trend and seasonality from the time series, which are the main causes of non-stationarity. In fact, if the data has a trend or seasonality, its values will be correlated with the previous or subsequent values. Subtracting these values reduces or eliminates this correlation.

For example, we can see a time series that has an increasing trend and an annual seasonality. Its first difference removes the trend, but not the seasonality. Its second difference removes both trend and seasonality.

Differentiation is the basis of one of the most widely used models for time series analysis and anomaly detection: the ARIMA model.

Introduction to the ARIMA model

The ARIMA model is one of the most widely used models for time series analysis and anomaly detection. ARIMA stands for Autoregressive Integrated Moving Average. This model combines three main components:

  • The autoregressive (AR) component, which models the correlation between time series values and previous values. For example, if the data is cyclical, the time series values will be affected by the past values.
  • The built-in (I) component, which models the differentiation of the time series to make it stationary. For example, if the data has a trend or seasonality, differentiation removes these components from the time series.
  • The moving average (MA) component, which models the correlation between time series errors and previous errors. For example, if the data has noise, the time series errors will be affected by past errors.

The ARIMA model has three main parameters: p, d and q. The p parameter indicates the number of autoregressive terms to use in the model. The parameter d indicates the number of times the time series must be differentiated to make it stationary. The q parameter indicates the number of moving average terms to use in the model. For example, an ARIMA(1,1,1) model uses an autoregressive term, a difference, and a moving average term.

The ARIMA model can be used to describe, predict, and detect anomalies in a time series. To do this, we need to follow a few steps:

  • First, we need to check whether the time series is stationary or not. We can use statistical tests, such as the augmented Dickey-Fuller test, to check whether the mean and variance of the time series are constant over time.
  • Second, we need to differentiate the time series until it is stationary. We can use graphs, such as the graph of autocorrelation functions and partial autocorrelation functions, to determine the number of differences needed.
  • Third, we need to estimate the parameters of the ARIMA model using optimization methods, such as the maximum likelihood method. We can use model selection criteria, such as the Akaike information criterion or the Bayesian information criterion, to choose the optimal values of the p, d, and q parameters.
  • Fourth, we need to validate the ARIMA model using verification methods, such as the Ljung-Box test or the Jarque-Bera test. We can use graphs, such as the residuals graph or the forecast graph, to check if the model fits well with the data and if there is any anomaly in the data.
  • Fifth, we need to use the ARIMA model to describe the main features of the time series, predict future time series values, and detect anomalies in the time series. We can use measures of accuracy, such as mean square error or mean absolute error, to evaluate the quality of predictions and anomalies.

Time Series Anomaly Detection

After estimating and validating the ARIMA model for our time series, we can use it to detect anomalies in the data. An anomaly is a value or event that deviates significantly from the normal trend of the data. To detect anomalies, we need to compare the observed time series values with the values predicted by the ARIMA model. If the difference between the two values is greater than a certain threshold, we can consider the observed value as an anomaly.

The threshold for defining anomalies depends on several factors, such as the confidence level, the distribution of errors, the frequency of the data, etc. In general, we can use the concept of confidence interval to determine the threshold. The confidence interval is an interval that contains the predicted value with a certain probability. For example, a 95% confidence interval means that the predicted value is in that range with a 95% probability. If the observed value is outside the confidence interval, we can consider it as an anomaly.

When we detect anomalies in a time series, we must also try to understand their causes and meanings. Some anomalies may be due to measuring, transmission or data processing errors. These anomalies are often called noise and can be ignored or corrected. Other anomalies may be due to structural changes, fraudulent activity, exceptional events, or other factors affecting the data. These anomalies are often called signals and can be important to detect and analyze.

To understand the causes and meanings of anomalies, we need to use our knowledge of the domain of the data, i.e. the context in which the data is generated and the meaning of the variables. In addition, we need to use additional sources of information, such as other related time series, historical data, news, reports, etc. This helps us interpret the results of anomaly detection and identify possible actions to take.

In this article, we’ve seen how to use the ARIMA model for time series analysis and anomaly detection.

Conclusion

In this article, we’ve seen how to use the ARIMA model for time series analysis and anomaly detection. We’ve seen what a time series is, what its components are, how it differs from other types of data, how anomalies in a time series can be detected, and what are the most common techniques for doing so. We have seen how to verify the stationarity of data, how to differentiate the time series, how to estimate and validate the ARIMA model, how to use the ARIMA model to describe, predict, and detect anomalies in the time series, and how to interpret the results of anomaly detection.

Time series analysis and anomaly detection are very useful and powerful techniques for studying data that changes over time, such as sales, traffic, climate, etc. These techniques allow us to analyze the past, understand the present, and predict the future. Additionally, these techniques help us uncover hidden patterns and trends in data, which can be used to improve decisions and strategies. Finally, these techniques help us identify hidden problems or opportunities in the data, which may be caused by anomalies in the data.

I hope you found this article useful and interesting. If you have any questions or comments, please feel free to contact me. Thank you for reading my article.

--

--

Carlo C.
π€πˆ 𝐦𝐨𝐧𝐀𝐬.𝐒𝐨

Data scientist, avidly exploring ancient philosophy as a hobby to enhance my understanding of the world and human knowledge.