Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Change Point Detection in Time Series using ChatGPT

István Szatmári
8 min readMar 5, 2023

--

In recent months, the Internet has exploded with ChatGPT and the various possibilities for what we can use it for. My colleagues and I often talk about this. I regularly analyze time series in my data analysis work, so I was curious to see what I could get out of ChatGPT. I thought, let’s start with something “simple”, how we can detect changes in time series. The problem is easy to understand, and results can be achieved quickly even at the beginning, but at the same time there are also complex algorithms, and some more difficult problems may require in-depth research work.

Motivations

  • to play and explore the various possibilities of ChatGPT and how it can be used in my data analysis work,
  • to get first-hand experiments and discuss the potential applications of ChatGPT with my colleagues,
  • to explore how it can be used in in-depth research work for more complex problems in the future,
  • for education: a good introduction for our junior colleagues to gain experience with the tool,
  • I’m simply interested in detecting changes in time series.

For reference only, change point detection marks the locations where the underlying properties (statistical characteristics e.g. mean and variance) of the time series shift abruptly. As an example, see the following image where the changes are marked as vertical red lines.

Time series example with marked change points
Time series example with marked change points

Experiments

Let’s get into it. That was my first request:

QUERY: I need a python code that runs on time series data, which is a one-dimensional signal. The routine should detect abrupt changes in the time series signal. The parameters for the abrupt change define a) the change in percent in the range of the whole amplitude, and b) the maximum time interval within the change should happen. This is given in a number of samples. The time series data is given in Pandas DataFrame, in one column called “signal”.

import pandas as pd

def detect_abrupt_changes(signal, percent_change, max_time_interval):
"""
Detect abrupt changes in a time series signal based on the specified parameters.

Parameters:
signal (pandas.Series): One-dimensional time series signal.
percent_change (float): The minimum percentage change in the signal range that is considered an abrupt change.
max_time_interval (int): The maximum time interval in samples between two data points to be considered part of the same change.

Returns:
pandas.Series: A boolean mask indicating the positions of the abrupt changes in the input signal.
"""
# Calculate the absolute change in the signal.
abs_change = abs(signal.diff())

# Calculate the threshold for the minimum change required to be considered an abrupt change.
change_threshold = (signal.max() - signal.min()) * percent_change

# Initialize a mask of False values to indicate no abrupt changes have been detected yet.
abrupt_changes = pd.Series(False, index=signal.index)

# Loop over each data point in the signal.
for i in range(1, len(signal)):
# If the absolute change is greater than the threshold, mark this data point as the start of an abrupt change.
if abs_change[i] >= change_threshold:
abrupt_changes[i] = True

# Keep track of the end of the current change.
end_of_change = i

# Continue checking subsequent data points to see if they are still part of the same change.
for j in range(i+1, min(i+max_time_interval, len(signal))):
if abs_change[j] >= change_threshold:
# If the change is still above the threshold, mark this data point as part of the same change.
abrupt_changes[j] = True

# Update the end of the current change.
end_of_change = j
else:
# If the change has fallen below the threshold, stop checking subsequent data points.
break

# Skip checking data points that are already part of the current change.
i = end_of_change

return abrupt_changes

Pros:

  • well-documented code,
  • a plain algorithm without using any package (Why is it positive? I like to start with a simple method so that I have a baseline and can compare the results.),
  • good function and parameter names (or at least skillfully derived from my request).

Cons:

  • comparing only consecutive samples,
  • pretty strange logic derived from my “b) the maximum time interval …” condition. (This warns me of the importance of clear wording).

QUERY: To test this routine, generate a random pandas time series.

import pandas as pd
import numpy as np

# Set the random seed for reproducibility
np.random.seed(42)

# Generate a random time series with 1000 data points
signal = pd.Series(np.random.normal(0, 1, 1000), index=pd.date_range('2022-01-01', periods=1000, freq='D'))

# Add some abrupt changes to the time series
signal.loc['2022-02-05':'2022-02-15'] -= 5
signal.loc['2022-03-10':'2022-03-20'] += 3.5
signal.loc['2022-04-05':'2022-04-08'] -= 4

QUERY: OK, I need a graphical representation of the result.

import matplotlib.pyplot as plt

# Plot the time series
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(signal, label='signal')

# Plot the abrupt changes
abrupt_changes = detect_abrupt_changes(signal, 0.3, 10)
abrupt_changes = abrupt_changes.astype(int)
ax.plot(signal[abrupt_changes==1], 'ro', label='abrupt changes')

# Add labels, grid, and legend
ax.set_xlabel('Time')
ax.set_ylabel('Signal')
ax.set_title('Time Series with Abrupt Changes')
ax.legend()
ax.grid(True)

# Show the plot
plt.show()

Wow, that was fast. Using Google Colab, I put them into a Jupyter Notebook and checked the results.

First automatic example and result
First automatic example and result

I didn’t expect too much, and it’s interesting that it didn’t detect some bigger jumps that I would have noted at first glance.

  • 1st Remark: I had to modify the visualization code a few times to get it to display properly because I got some errors. But that’s another story, and it’s funny how ChatGPT explains its “work”.
  • 2nd Remark: The source notebook file for the previous examples and the following ones can be found here.

Now I changed the signal to the one shown in the first image and tested the routine.

signal generation:

import numpy as np

def generate_signal(num_points):
"""
Generate a stochastic signal with multiple change points in mean and standard deviation.
"""
s = np.zeros(num_points)
s = np.concatenate([np.random.normal(loc=0, scale=1, size=num_points // 6),
np.random.normal(loc=1.0, scale=1, size=num_points // 6),
np.random.normal(loc=1.0, scale=2, size=num_points // 6),
np.random.normal(loc=5, scale=1, size=num_points // 6),
np.random.normal(loc=5, scale=2, size=num_points // 6),
np.random.normal(loc=-2, scale=1, size=num_points // 6)])
# Set the random seed for reproducibility

# Generate a random time series with 1000 data points
signal = pd.Series(s, index=pd.date_range('2022-01-01', periods=num_points, freq='D'))
return signal

visualization:

# Detect abrupt changes
abrupt_changes = detect_abrupt_changes(signal, 0.3, 10)
abrupt_changes = abrupt_changes.astype(int)

# Plot the time series
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(signal, label='signal')

# Plot the abrupt changes
for i in signal.index[abrupt_changes == 1]:
ax.axvline(i, color='r')

# Add labels, grid, and legend
ax.set_xlabel('Time')
ax.set_ylabel('Signal')
ax.set_title('Time Series with Abrupt Changes')
ax.legend()
ax.grid(True)

# Show the plot
plt.show()
Result applying the change detection routine
Result applying the change detection routine

I experimented a bit with the parameters, but you couldn’t expect a really “striking end result”.

Now came an exciting part, what different variations can I get from ChatGPT, e.g. compare changes to a moving average or take into account the local variances. The textual instructions were typically like this, “Modify the routine …”, “Consider …”, “Apply a rolling window …”, and “Compare it to the moving average …”.

I’ll show you a few examples to give you a taste.

Change detection relative to the moving average
Change detection relative to the moving average
Change detection relative to the moving average and variance
Change detection relative to the moving average and variance

Of course, we could have achieved much better results if we changed the logic more, tried to fine-tune the parameters, etc. What was very impressive was how quickly it was possible to try different variations, without putting a lot of emphasis on coding.

Topic research

QUERY: Are there specific change detection algorithms?

QUERY: Explain how CUSUM works and how it can be used for change detection.

QUERY: Are there specific python packages for change detection?

I was sure that sooner or later we would get to the ruptures package ✔.

QUERY: Create a python function to detect changes in time series. Return locations of changes. Use ruptures package.

Function detect_changes using ruptures package
Function detect_changes using ruptures package

My problem was with the parameter n_bkps that specifies the number of change points to detect. I asked to suggest parameter settings where this does not need to be pre-defined. I was also wondering what the other parameters are good for. All of this could be resolved quite quickly.

Parameters explained in ruptures package
Parameters explained in ruptures package

To make a long story short, I ended up creating a simple routine. To tell the truth, I received several error messages due to different package versions, so in the end, I had to edit the code manually. But that should be the least of my worries 😊.

import numpy as np
import pandas as pd
import ruptures as rpt

def detect_changes(time_series):
# Convert time series to a numpy array
signal = time_series.values

# Perform change point detection using the Pelt algorithm
algo = rpt.Pelt(model="rbf", min_size=1, jump=10).fit(signal)
result = algo.predict(pen=2)

# remove location if equal to len(signal)
change_points = [i for i in result if i < len(signal)]

# Return the list of change point locations
return change_points

And the final result — not bad:

Detected change points
Detected change points

Conclusions

Overall I was very satisfied and impressed.

  • By using a simple conversational interface, we can access information more easily.
  • I think this lowers the threshold for conducting data analysis and makes it more accessible to a wider range of users.
  • Even experts with the help of chat functions can process, analyze, and visualize data more efficiently.

The source code can be found here. Thanks for reading.

Give ChatGPT a try for your data analysis needs and discover a very exciting way to analyze your data.

UPDATE on March 28: some readers pointed out that there were logical errors in the code and several false information in the text generated by ChatGPT. Thank you for writing this, it is very important and I appreciate it. For example, there is no CUSUM algorithm in the ruptures package, and bcp is an existing module, but it has nothing to do with change point detection. I left all of them in the text because it highlights a very important point: ChatGPT generates text “blindly”, and there is no content control or validation. The name ‘bcp’ sounds good as ‘Bayesian Change Point’. We need critical thinking, thank you very much for your comments.

--

--

István Szatmári

Data Analytics competence group leader at evopro Innovation Kft