Hctsa Manual
Hctsa Manual
Hctsa Manual
of Contents
Introduction 1.1
List of included code files 1.1.1
Installing and setting up 1.2
Structure of the hctsa framework 1.2.1
Overview of an hctsa analysis 1.2.2
Compiling binaries 1.2.3
Running hctsa computations 1.3
Input files 1.3.1
Performing calculations 1.3.2
Inspecting errors 1.3.3
Working with hctsa files 1.3.4
Analyzing and visualizing results 1.4
Assigning group labels to data 1.4.1
Filtering and normalizing 1.4.2
Clustering rows and columns 1.4.3
Visualizing the data matrix 1.4.4
Plotting the time series 1.4.5
Low dimensional representation 1.4.6
Finding nearest neighbors 1.4.7
Investigating specific operations 1.4.8
Exploring classification accuracy 1.4.9
Finding informative features 1.4.10
Interpreting features 1.4.11
Working with short time series 1.4.12
Working with a mySQL database 1.5
Setting up the mySQL database 1.5.1
The database structure 1.5.2
Populating the database with time series and operations 1.5.3
Adding time series 1.5.4
Retrieving from the database 1.5.5
2
Computing operations and writing back to the database 1.5.6
Cycling through computations using runscripts 1.5.7
Clearing or removing data 1.5.8
Retrieving data from the database 1.5.9
Error handling and maintenance 1.5.10
3
Introduction
4
List of included code files
The full default set of over 7700 features in hctsa is produced by running all of the code files
below, many of which produce a large number of outputs (e.g., some functions fit a time-
series model and then output statistics including the parameters of the best-fitting model,
measures of the model's goodness of fit, the optimal model order, and autocorrelation
statistics on the residuals). In our default feature set, each function is also run with different
sets of input parameters, with each parameter set yielding characteristic outputs. For
example, different inputs to CO_AutoCorr determine the method in which autocorrelation is
computed, as well as the time lag at which autocorrelation is calculated, e.g., lag 1, lag 2, lag
3, etc.; WL_dwtcoeff has inputs that set the mother wavelet to use, and level of wavelet
decomposition; FC_LocalSimple has inputs that determine the time-series forecasting
method to use and the size of the training window. The set of code files below and their input
parameters that define the default hctsa feature set are in the INP_mops.txt file of the hctsa
repository.
Distribution
Code summarizing properties of the distribution of values in a time series (disregarding their
sequence through time).
5
List of included code files
DN_Quantile
Quantiles of the distribution of values in the time series data
vector.
DN_RemovePoints How time-series properties change as points are removed.
DN_SimpleFit Fit distributions or simple time-series models to the data.
DN_Spread Measure of spread of the input time series.
DN_TrimmedMean Mean of the trimmed time series using trimmean .
DN_Unique The proportion of the time series that are unique values.
DN_Withinp
Proportion of data points within p standard deviations of the
mean.
DN_cv Coefficient of variation.
DN_pleft
Distance from the mean at which a given proportion of data
are more distant.
EN_DistributionEntropy Distributional entropy.
HT_DistributionTest Hypothesis test for distributional fits to a data vector.
Correlation
Code summarizing basic properties of how values of a time series are correlated through
time.
6
List of included code files
CO_AddNoise
Changes in the automutual information with the addition of
noise.
CO_AutoCorr Compute the autocorrelation of an input time series.
CO_AutoCorrShape How the autocorrelation function changes with the time lag.
CO_Embed2
Statistics of the time series in a 2-dimensional embedding
space.
CO_Embed2_AngleTau Angle autocorrelation in a 2-dimensional embedding space.
CO_Embed2_Basic Point density statistics in a 2-d embedding space.
CO_Embed2_Dist
Analyzes distances in a 2-d embedding space of a time
series.
CO_Embed2_Shapes Shape-based statistics in a 2-d embedding space.
CO_FirstMin Time of first minimum in a given correlation function.
CO_FirstZero The first zero-crossing of a given autocorrelation function.
CO_NonlinearAutocorr A custom nonlinear autocorrelation of a time series.
CO_StickAngles
Analysis of line-of-sight angles between time-series data
points.
CO_TranslateShape
Statistics on datapoints inside geometric shapes across the
time series.
CO_f1ecac The 1/e correlation length.
CO_fzcglscf
The first zero-crossing of the generalized self-correlation
function.
CO_glscf
The generalized linear self-correlation function of a time
series.
CO_tc3 Normalized nonlinear autocorrelation function, tc3 .
CO_trev
Normalized nonlinear autocorrelation, trev function of a
time series.
DK_crinkle Computes James Theiler's crinkle statistic.
DK_theilerQ Computes Theiler's Q statistic.
DK_timerev Time reversal asymmetry statistic.
NL_embed_PCA
Principal Components analysis of a time series in an
embedding space.
Automutual
information:
CO_RM_AMInformation
Automutual information (Rudy Moddemeijer
implementation).
7
List of included code files
CO_HistogramAMI
The automutual information of the distribution using
histograms.
IN_AutoMutualInfoStats
Statistics on automutual information function for a time
series.
EN_Randomize
How time-series properties change with increasing
randomization.
EN_SampEn Sample Entropy of a time series.
EN_mse Multiscale entropy of a time series.
EN_rpde Recurrence period density entropy (RPDE).
EN_wentropy Entropy of time series using wavelets.
8
List of included code files
MF_FitSubsegments
Robustness of model parameters across different segments
of a time series.
MF_GARCHcompare Comparison of GARCH time-series models.
MF_GARCHfit GARCH time-series modeling.
MF_GP_FitAcross Gaussian Process time-series modeling for local prediction.
MF_GP_LocalPrediction Gaussian Process time-series model for local prediction.
MF_GP_hyperparameters
Gaussian Process time-series model parameters and
goodness of fit.
MF_StateSpaceCompOrder
Change in goodness of fit across different state space
models.
MF_StateSpace_n4sid State space time-series model fitting.
MF_arfit Statistics of a fitted AR model to a time series.
MF_armax Statistics on a fitted ARMA model.
MF_hmm_CompareNStates Hidden Markov Model (HMM) fitting to a time series.
MF_hmm_fit Fits a Hidden Markov Model to sequential data.
MF_steps_ahead Goodness of model predictions across prediction lengths.
FC_LocalSimple Simple local time-series forecasting.
FC_LoopLocalSimple How simple local forecasting depends on window length.
FC_Surprise
How surprised you would be of the next data point given
recent memory.
PP_ModelFit
Investigates whether AR model fit improves with different
preprocessings.
9
List of included code files
SY_DynWin
How stationarity estimates depend on the number of time-
series subsegments.
SY_KPSStest The KPSS stationarity test.
SY_LocalDistributions
Compares the distribution in consecutive time-series
segments.
SY_LocalGlobal Compares local statistics to global statistics of a time series.
SY_PPtest Phillips-Peron unit root test.
SY_RangeEvolve How the time-series range changes across time.
SY_SlidingWindow Sliding window measures of stationarity.
SY_SpreadRandomLocal Bootstrap-based stationarity measure.
SY_StatAv Simple mean-stationarity metric, StatAv .
SY_StdNthDer Standard deviation of the nth derivative of the time series.
SY_StdNthDerChange
How the output of SY_StdNthDer changes with order
parameter.
SY_TISEAN_nstat_z Cross-forecast errors of zeroth-order time-series models.
SY_VarRatioTest Variance ratio test for random walk.
Step detection:
CP_ML_StepDetect Analysis of discrete steps in a time series.
CP_l1pwc_sweep_lambda Dependence of step detection on regularization parameter.
CP_wavelet_varchg Variance change points in a time series.
10
List of included code files
NL_DVV
Delay Vector Variance method for real and complex
signals.
NL_MS_fnn False nearest neighbors of a time series.
NL_MS_nlpe
Normalized drop-one-out constant interpolation nonlinear
prediction error.
NL_TISEAN_c1 Information dimension.
NL_TISEAN_d2 d2 routine from the TISEAN package.
NL_TSTL_GPCorrSum
Correlation sum scaling by Grassberger-Proccacia
algorithm.
NL_TSTL_LargestLyap Largest Lyapunov exponent of a time series.
NL_TSTL_PoincareSection Poincare sectino analysis of a time series.
NL_TSTL_ReturnTime Analysis of the histogram of return times.
NL_TSTL_TakensEstimator Taken's estimator for correlation dimension.
NL_TSTL_acp acp function in TSTOOL
NL_TSTL_dimensions
Box counting, information, and correlation dimension of a
time series.
NL_crptool_fnn Analyzes the false-nearest neighbors statistic.
SD_SurrogateTest
Analyzes test statistics obtained from surrogate time
series.
SD_TSTL_surrogates Surrogate time-series analysis.
TSTL_delaytime
Optimal delay time using the method of Parlitz and
Wichard.
TSTL_localdensity
Local density estimates in the time-delay embedding
space.
Fluctuation analysis:
SC_MMA
Physionet implementation of multiscale multifractal
analysis
SC_fastdfa Matlab wrapper for Max Little's ML_fastdfa code
SC_FluctAnal Implements fluctuation analysis by a variety of methods.
11
List of included code files
Properties of the time-series power spectrum, wavelet spectrum, and other periodicity
measures.
WL_fBM
Parameters of fractional Gaussian noise/Brownian motion in a
time series.
WL_scal2frq Frequency components in a periodic time series.
Symbolic transformations
Properties of a discrete symbolization of a time series.
SB_MotifThree
Motifs in a coarse-graining of a time series to a 3-letter
alphabet.
SB_MotifTwo Local motifs in a binary symbolization of the time series.
SB_TransitionMatrix Transition probabilities between different time-series states.
SB_TransitionpAlphabet How transition probabilities change with alphabet size.
12
List of included code files
ST_MomentCorr
Correlations between simple statistics in local windows of a time
series.
ST_SimpleStats Basic statistics about an input time series.
Others
Other properties, like extreme values, visibility graphs, physics-based simulations, and
dependence on pre-processings applied to a time series.
PH_Walker
Simulates a hypothetical walker moving through the time
domain.
PP_Compare
Compare how time-series properties change after pre-
processing.
PP_Iterate
How time-series properties change in response to iterative pre-
processing.
13
List of included code files
14
Installing and setting up
Getting started
The hctsa package can be used completely within Matlab, allowing users to analyse time-
series datasets quickly and easily. Here we will focus on this Matlab-based use of the
software, but note that, for larger datasets requiring distributed computing set-ups, or
datasets that may grow with time, hctsa can also be linked to a mySQL database, as
described in a dedicated chapter.
After installation, future use of the package can begin by opening Matlab, navigating to the
hctsa package, and then loading the paths required by the hctsa package by running the
startup script.
15
Structure of the hctsa framework
Overview
The hctsa framework consists of three basic objects:
1. Master Operations specify pieces of code (Matlab functions) and their inputs to be
computed. Taking in a single time series, master operations can generate a large
number of outputs as a Matlab structure, each of which can be identified with a single
operation (or 'feature').
2. Operations (or 'features') are a single number summarizing some measure of structure
in a time series. In hctsa, each operation links to an output from a piece of evaluated
code (a master operation).
3. Time series are univariate, uniformly sampled, time-ordered measurements.
(x) at lags 1, 2, ..., 5. Each operation (or 'feature') is a single number that draws on this set of
outputs, for example, the autocorrelation at lag 1, which is named AC_1 , for example.
In the hctsa framework, master operations, operations, and time series are stored as
structure arrays that contain all of their associated keywords and metadata (and actual time-
series data in the case of time series).
For a given hctsa analysis, the user must specify a set of code to evaluate (master
operations), their associated individual outputs to measure (operations), and a set of time
series to evaluate the features on (time series).
16
Structure of the hctsa framework
Having specified a set of master operations, operations, and time series, the results of
computing these functions in the time series data are stored in three matrices:
Quality labels
Quality labels are used to indicate when operations take non-real values, or when fatal
errors were encountered. Quality labels are stored in the Quality column of the Results
table in the mySQL database, and in local Matlab files as the TS_Quality matrix.
When the quality label is nonzero, this indicates that a special-valued output occurred. In this
case, the output value of the operation is set to zero, as a convention, and the quality label
codes the special output value:
17
Structure of the hctsa framework
Quality
Description
label
Field specified for this operation did not exist in the master operation
7
output structure
18
Overview of an hctsa analysis
1. Initialize a HCTSA.mat file, which contains all of the information about the set of time
series and operations in your analysis, as well as the results of applying all operations
to all time series, using TS_init ,
2. These operations can be computed on your time-series data using TS_compute . The
results are structured in the local HCTSA.mat file containing matrices (that store the
results of the computations) and structure arrays (that store information about the time-
series data and operations), as described here.
labels, as described here. You then initialize an hctsa calculation using the default library of
features:
19
Overview of an hctsa analysis
>> TS_init('INP_ts.mat');
This generates a local file, HCTSA.mat containing the associated metadata for your time
series, as well as information about the full time-series feature library ( Operations ) and the
set of functions and code to call to evaluate them ( MasterOperations ), as described here.
Next you want to evaluate the code on all of the time series in your dataset. For this you can
simply run:
>> TS_compute;
As described here, or, for larger datasets, using a script to regularly save back to the local
file (cf. sample_runscript_matlab ).
Having run your calculations, you may then want to label your data using the keywords you
provided in the case that you have labeled groups of time series:
>> TS_LabelGroups;
and then normalize and filter the data using the default sigmoidal transformation:
>> TS_normalize;
A range of visualization scripts are then available to analyze the results, such as plotting the
reordered data matrix:
>> TS_plot_pca;
Or to determine which features are best at classifying the labeled groups of time series in
your dataset:
>> TS_TopFeatures;
20
Overview of an hctsa analysis
21
Compiling binaries
Compiling binaries
Some external code packages require compiled binary code to be used. Compilation of the
mex code is handled by compile_mex as part of the install script, but the TISEAN
package binaries need to be compiled separately in the command line.
Once mex is set up, the mex functions used in the time-series code repository can be
compiled by navigating to the Toolboxes directory and then running compile_mex .
$ ./configure
$ make clean
$ make
$ make install
22
Compiling binaries
This should install the TISEAN binaries in your ~/bin/ directory (you can instead install into a
system-wide directory, /usr/bin, for example, by running ./configure prefix=/usr ).
Additional information about the TISEAN installation process is provided on the TISEAN
website.
If installation was successful then you should be able to access the newly-compiled binaries
from the commandline, e.g., typing the command which poincare should return the path to
the TISEAN function poincare . Otherwise, you should check that the install directory is in
your system path, e.g., by adding the following:
export PATH=$PATH:$HOME/bin
The path where TISEAN is installed will also have to be in Matlabs environment path, which
is added by startup.m , assuming that the binaries are stored in ~/bin. The startup.m
code also adds the DYLD_LIBRARY_PATH, which is also required for TISEAN to function
properly.
If you choose to use a custom location for the TISEAN binaries, that is not in the default
Matlab system path ( getenv('PATH') in Matlab), then you will have to add this path
manually. You can test that Matlab can see the TISEAN binaries by typing, for example, the
following into Matlab:
If Matlabs system paths are set up correctly, this command should return the path to your
compiled TISEAN binary, poincare .
23
Compiling binaries
1. To filter a local Matlab hctsa file (e.g., HCTSA.mat ), you can use the following:
TS_local_clear_remove('ops',TS_getIDs('tisean','raw','ops'),1,'raw'); , which will
remove all operations with the 'tisean' keyword from the hctsa dataset in HCTSA.mat .
2. If using a mySQL database, TISEAN functions can be removed from the database as
follows: SQL_clear_remove('ops',SQL_getids('ops',0,'tisean',{}),1) .
24
Running hctsa computations
Running Computations
An hctsa analysis requires setting a library of time series, master operations and operations,
and generates a HCTSA.mat file (using TS_init ), as described here. Once this is set up,
computations are run using TS_compute .
These steps, as well as information on how to inspect the results of an hctsa analysis and
working with HCTSA*.mat files, are provided in this chapter.
25
Input files
To use the default library of operations, you can initiate a time-series dataset (e.g., as
specified in the .mat file, INP_test_ts.mat ) using the following:
TS_init('INP_test_ts.mat');
To specify all sets of master operations and operations, you can use the following:
TS_init('INP_ts.mat','INP_mops.txt','INP_ops.txt');
TS_init produces a Matlab file, HCTSA.mat , containing all of the structures required to
understand the set of time series, operations, and the results of their computation (explained
here).
Through this initialization process, each time series will be assigned a unique ID, as will
each master operation, and each operation.
26
Input files
When formatting a time series input file, two formats are available:
.mat file input, which is suited to data that are already stored as variables in Matlab,
.txt file input, which is better suited to when each time series is already stored as an
individual text file.
Note that when using the .mat file input method, time-series data is stored in the database to
six significant figures. However, when using the .txt file input method, time-series data
values are stored in the database as written in the input text file of each time series.
timeSeriesData : either a N x 1 cell (for N time series), where each element contains a
vector of time-series values, or a N x M matrix, where each row specifies the values of a
time series (all of length M).
labels : a N x 1 cell of unique strings containing a named label for each time series.
An example involving two time series is below. In this example, we add two time series
(showing only the first two values shown of each), which are labeled according to .dat files
from a hypothetical EEG experiment, and assigned keywords (which are separated by
commas and no whitespace). In this case, both are assigned keywords 'subject1' and 'eeg'
and, additionally, the first time series is assigned 'trial1', and the second 'trial2' (these labels
can be used later to retrieve individual time series). Note that the labels do not need to
specify filenames, but can be any useful label for a given time series.
% Initialize a new hctsa analysis using these data and the default feature library:
TS_init('INP_test.mat');
27
Input files
When using a text file input, the input file now specifies filenames of time series data files,
which Matlab will then attempt to load (using dlmread ). Data files should thus be accessible
in the Matlab path. Each time-series text file should have a single real number on each row,
specifying the ordered values that make up the time series. Once imported, the time-series
data is stored in the database; thus the original time-series data files are no longer required,
and can be removed from the Matlab path.
The input text file should be formatted as rows with each row specifying two whitespace
separated entries: (i) the file name of a time-series data file and (ii) comma-delimited
keywords.
For example, consider the following input file, containing three lines (one for each time
series to be added to the database):
gaussianwhitenoise_001.dat noise,gaussian
gaussianwhitenoise_002.dat noise,gaussian
sinusoid_001.dat periodic,sine
Using this input file, a new analysis will contain 3 time series, gaussianwhitenoise_001.dat
and gaussianwhitenoise_002.dat will be assigned the keywords noise and gaussian, and
the data in sinusoid_001.dat will be assigned keywords periodic and sine. Note that
keywords should be separated only by commas (and no whitespace).
The (potentially many) outputs from a master operation can thus be mapped to individual
operations (or features), which are single real numbers summarizing a time series that make
up individual columns of the resulting data matrix.
Two example lines from the input file, INP_mops.txt (in the Database directory of the
repository), are as follows:
CO_tc3(y,1) CO_tc3_y_1
ST_length(x) ST_length
28
Input files
Each line in the input file specifies two pieces of information, separated by whitespace:
We use the convention that x refers to the input time series and y refers to a z-scored
transformation of the input time series (i.e., ). In the example above, the first
line thus adds an entry in the database for running the code CO_tc3 using a z-scored time
series as input (y), with 1 as the second input with the label CO_tc3_y_1, and the second
line will add an entry for running the code ST_length using the non-z-scored time-series x,
with the label length.
When the time comes to perform computations on data using the methods in the database,
Matlab needs to have path access to each of the master operations functions specified in
the database. For the above example, Matlab will attempt to run both CO_tc3(y,1) and
ST_length(x) , and thus the functions CO_tc3.m and ST_length.m must be in the Matlab
path. Recall that the script startup.m , which should be run at the start of each session
using hctsa, handles the addition of paths required for the default code library.
The first column references a corresponding master label and, in the case of master
operations that produce structure, the particular field of the structure to reference (after the
fullstop), the second column denotes the label for the operation, and the final column is a set
of comma-delimited keywords (that must not include whitespace). Whitespace is used to
separate the three entries on each line of the input file. In this example, the master operation
labeled CO_tc3_y_1 , outputs is a structure, with fields that are referenced by the first five
operations listed here, and the ST_length master operation outputs a single number (the
length of the time series), which is referenced by the operation named 'length' here. The
two keywords 'correlation' and 'nonlinear' are added to the CO_tc3_1 operations, while the
29
Input files
keywords raw and lengthDependent are added to the operation called length . These
keywords can be used to organize and filter the set of operations used for a given analysis
task.
30
Performing calculations
Calculations are performed using the function TS_compute , which stores results back into
the matrices in HCTSA.mat . This function can be run without inputs to compute all missing
values in the default hctsa file, HCTSA.mat :
TS_compute will begin evaluating operations on time series in HCTSA.mat for which elements
in TS_DataMat are NaN (i.e., computations that have not been run previously). Results are
stored back in the matrices of HCTSA.mat : TS_DataMat (output of each operation on each
time series), TS_CalcTime (calculation time for each operation on each time series), and
TS_Quality (labels indicating errors or special-valued outputs).
(2) Computing across a custom range of time-series IDs ( ts_id ) and operation IDs
( op_id ). This can be achieved by setting the second and third inputs:
% Compute missing values in HCTSA.mat for ts_ids from 1:10 and op_ids from 1:1000
TS_compute(false,1:10,1:1000);
31
Performing calculations
% Compute all values that have never been computed before (default)
TS_compute(false,[],[],'missing');
% Compute all values that have never previous been calculated OR have previously b
een computed but returned an error:
TS_compute(false,[],[],'error');
(5) Suppress commandline output. All computations are displayed to screen by default
(which can be overwhelming but is useful for error checking). This functionality can be
suppressed by setting the final (6th) input to false :
32
Performing calculations
Times may vary across on individual machines, but the above plot can be used to estimate
the computation time per time series, and thus help decide on an appropriate computation
strategy for a given dataset.
Note that if computation times are too long for the computational resources at hand, one can
always choose a reduced set of features, rather than the full set of >7000, to get a
preliminary understanding of the dataset. One such reduced set of features is in
INP_ops_reduced.txt . We plan to reduced additional reduced feature sets, determined
On a single machine
If only a single machine is available for computation, there are a couple of options:
1. For small datasets, when it is feasible to run all computations in a single go, it is easiest
33
Performing calculations
34
Inspecting errors
Some errors are not problems with the code, but represent issues with applying particular
sets of code to particular time series, such as when a Matlab fitting function reaches the
maximum number of iterations and returns an error. Other errors are genuine problems with
the code that need to be corrected. Both cases are labeled as errors in our framework.
It can be good practice to visualize where special values and errors are occurring after a
computation to see where things might be going wrong, using TS_InspectQuality . This can
be run in four modes:
operations as columns), and shows where each possible special-valued output can
occur (including 'error', 'NaN', 'Inf', '-Inf', 'complex', 'empty', or a 'link error').
4. TS_InspectQuality('reduced'); As 'full' , but includes only columns where special
values occurred.
35
Inspecting errors
packages like TISEAN) can produce a fault that crashes Matlab or the system. We have
performed some basic testing on all mex functions, but for some unusual time series, such
faults may still occur. These situations must be dealt with by either identifying and fixing the
problem in the original source code and recompiling, or by removing the problem code.
Troubleshooting errors
When getting information on operations that produce special-valued outputs (getting IDs
listed from TS_InspectQuality ), it can be useful to then test examples by re-running pieces
of code with the problematic data. The function TS_WhichProblemTS can be used to retrieve
time series from an hctsa dataset that caused a problem for a given operation.
Usage is as follows:
% Find time series that failed for the operation with ID = 684.
[ts_ind, dataCell, codeEval] = TS_WhichProblemTS(684);
36
Inspecting errors
This provides the list of time series IDs ( ts_ind ), their time-series data vectors ( dataCell ),
and the code to evaluate the given operation (in this case, the master operation code
corresponding to the operation with ID 684).
You can then pick an example time series (e.g., the first problem time series: x =
dataCell{1}; y = zscore(x) ), and then copy and paste the code in codeEval into the
command line to evaluate the code for this time series. This method allows easy debugging
and inspection of examples of time-series data that caused problems for particular
operations flagged through the TS_InspectQuality process.
37
Working with hctsa files
The hctsa package contains a range of functions for these types of tasks, working directly
with hctsa .mat files, and are described below. Note that these types of tasks are easier to
manage when hctsa data are stored in a mySQL database.
Most filtering functions (such as those listed in this section), require you to specify a range of
IDs of TimeSeries or Operations in order to specify them. Recall that each TimeSeries and
Operation is assigned a unique ID (assed as the ID field in the corresponding structure
array). To quickly get the IDs of time series that match a given keyword, the following
function can be used:
TimeSeriesIDs = TS_getIDs(theKeyword,'HCTSA_N.mat');
OperationIDs = TS_getIDs('entropy','norm','ops');
These IDs can then be used in the functions below (e.g., to clear data, or extract a subset of
data).
Note that to get a quick impression of the unique time-series keywords present in a dataset,
use the function TS_WhatKeywords , which gives a text summary of the unique keywords in an
hctsa dataset.
38
Working with hctsa files
For example, often you want to remove from your operation library operations that are
dependent on the location of the data (e.g., its mean: 'locdep' ), that only operate on
positive-only time series ( 'posOnly' ), that require the TISEAN package ( 'tisean' ), or that
are stochastic (i.e., they give different results when repeated, 'stochastic' ).
The function TS_local_clear_remove achieves these tasks when working directly with .mat
files (NB: if using a mySQL database, SQL_clear_remove should be used instead).
TS_local_clear_remove loads in a an hctsa .mat data file, clears or removes the specified
time series or operations, and then writes the result back to the file.
Example 1: Clear all computed data from time series with IDs 1:5 from HCTSA.mat
(specifying 'raw' ):
TS_local_clear_remove('ts',1:5,0,'raw');
Example 2: Remove all operations with the keyword 'tisean' (that depend on the TISEAN
package) from HCTSA.mat :
TS_local_clear_remove('ops',TS_getIDs('tisean','raw','ops'),1,'raw');
Example 3: Remove all operations that require positive-only data (the 'posOnly' keyword)
from HCTSA.mat :
TS_local_clear_remove('ops',TS_getIDs('posOnly','raw','ops'),1,'raw');
Example 4: Remove all operations that are location dependent (the 'locdep' keyword)
from HCTSA.mat :
TS_local_clear_remove('ops',TS_getIDs('locdep','raw','ops'),1,'raw');
See the documentation in the function file for additional details about the inputs to
TS_local_clear_remove .
39
Working with hctsa files
Example 1: Import data from 'HCTSA_N.mat', then save a new dataset containing only time
series with IDs in the range 1--100, and all operations, to 'HCTSA_N_subset.mat' (see
documentation for all inputs).
TS_subset('norm',1:100,[],1,'HCTSA_N_subset.mat')
Note that the subset in this case will have be normalized using the full dataset of all time
series, and just this subset (with IDs up to 100) are now being analyzed. Depending on the
normalization method used, different results would be obtained if the subsetting was
performed prior to normalization.
TS_subset('raw',TS_getIDs('healthy','raw'),[],1,'HCTSA_healthy.mat')
When analyzing a growing dataset, sometimes new data needs to be combined with
computations on existing data. Alternatively, when computing a large dataset, sometimes
you may wish to compute sections of it separately, and may later want to combine each
section into a full dataset.
To combine hctsa data files, you can use the TS_combine function.
TS_combine('HCTSA_healthy.mat','HCTSA_disease.mat',0,'HCTSA_combined.mat')
40
Working with hctsa files
The third input, compare_tsids , controls the behavior of the function in combining time
series. By setting this to 1, TS_combine assumes that the TimeSeries IDs are comparable
between the datasets (e.g., most common when using a mySQL database to store hctsa
data), and thus filters out duplicates so that the resulting hctsa dataset contains a unique set
of time series. By setting this to 0 (default), the output will contain a union of time series
present in each of the two hctsa datasets. In the case that duplicate TimeSeries IDs exist in
the combination file, a new index will be generated in the combined file (where IDs assigned
to time series are re-assigned as unique integers using TS_ReIndex ).
In combining operations, this function works differently when data have been stored in a
unified mySQL database, in which case operation IDs can be compared meaningfully and
combined as an intersection. However, when hctsa datasets have been generated using
TS_init , the function will check that the same set of operations have been used in both
files.
41
Analyzing and visualizing results
The type of analysis employed should the be motivated by the specific time-series analysis
task at hand. Setting up the problem, guiding the methodology, and interpreting the results
requires strong scientific input that should draw on domain knowledge, including the
questions asked of the data, experience performing data analysis, and statistical
considerations.
The first main component of an hctsa analysis involves filtering and normalizing the data
using TS_normalize , described here, which produces a file called HCTSA_N.mat.
Information about the similarity of pairs of time series and operations can be computed using
TS_cluster , described here which stores this information in HCTSA_N.mat. The suite of
plotting and analysis tools that we provide with hctsa work with this normalized data, stored
in HCTSA_N.mat, by default.
HCTSA*.mat file, and used by default in the various plotting and analysis functions provided.
Additional analysis functions are provided for basic time-series classification tasks:
Explore the classification performance of the full library of features using TS_classify
Determine the features that (individually) best distinguish between the labeled groups
using TS_TopFeatures
42
Analyzing and visualizing results
43
Assigning group labels to data
The example below assigns labels to two groups of time series in the HCTSA.mat (specifying
the shorthand 'raw' for this default, un-normalized data), corresponding to those labeled
as 'parkinsons' and those labeled as 'healthy':
TS_LabelGroups({'parkinsons','healthy'},'raw');
The first input is a cell specifying the keyword string to use to match each group.
To automatically detect unique keywords for labelling, TS_LabelGroups can be run with an
empty first input, as TS_LabelGroups([],'raw');
By default, this function saves the group indices back to the data file (in this example,
HCTSA.mat ), by adding a new field, Group , to the TimeSeries structure array, which
Group indices stay with the time series they are assigned to after filtering and normalizing
the data (using TS_normalize ). Group labels can be reassigned at any time by re-running
the TS_LabelGroups function.
44
Filtering and normalizing
TS_normalize('scaledRobustSigmoid',[0.8,1.0]);
The first input controls the normalization method, in this case a scaled, outlier-robust
sigmoidal transformation, specified with 'scaledRobustSigmoid'. The second input controls
the filtering of time series and operations based on minimum thresholds for good values in
the corresponding rows (corresponding to time series; filtered first) and columns
(corresponding to operations; filtered second) of the data matrix.
In the example above, time series (rows of the data matrix) with more than 20% special
values (specifying 0.8) are first filtered out, and then operations (columns of the data matrix)
containing any special values (specifying 1.0) are removed. Columns with approximately
constant values are also filtered out. After filtering the data matrix, the outlier-robust
scaledRobustSigmoid sigmoidal transformation is applied to all remaining operations
(columns). The filtered, normalized matrix is saved to the file HCTSA_N.mat .
45
Filtering and normalizing
Note that the 'scaledRobustSigmoid' transformation does not tolerate distributions with an
interquartile range of zero, which will be filtered out.
Analysis can be performed on the data contained in HCTSA_N.mat in the knowledge that
different settings for filtering and normalizing the results can be applied at any time by simply
rerunning TS_normalize , which will overwrite the existing HCTSA_N.mat with the results of
the new normalization and filtration settings.
46
Clustering rows and columns
This function reads in the data from HCTSA_N.mat , and stores the re-ordering of rows and
columns back into HCTSA_N.mat in the ts_clust and op_clust (and, if the size is
manageable, also the pairwise distance information). Visualization functions (such as
TS_plot_DataMatrix and TS_SimSearch ) can then take advantage of this information, using
47
Visualizing the data matrix
TS_plot_DataMatrix
This will produce a colored visualization of the data matrix such as that shown below.
When data is grouped according to a set of distinct keywords and stored as group metadata
(using the TS_LabelGroups function), these can also be visualized using
TS_plot_DataMatrix('colorGroups',1) .
where black rectangles label missing values, and other values are shown from low (blue) to
high (red) after normalization using the scaled outlier-robust sigmoidal transformation. Due
to the size of the matrix, operations are not labeled.
Examples of time series segments are shown to the left of the plot, and when the middle plot
is zoomed, the time-series annotations remain matched to the data matrix:
48
Visualizing the data matrix
By reordering rows and columns, this representation reveals correlated patterns of outputs
across different types of operations, and similar sets of properties between different types of
time series.
49
Visualizing the data matrix
In this example, we consider a set of 20 periodic and 20 noisy periodic signals. We assigned
the time series to groups (using TS_LabelGroups('orig',{'periodic','noisy'},'ts') ),
normalized the data matrix ( TS_normalize ), and then clustered it ( TS_cluster ). So now we
have a clustered data matrix containing thousands of summaries of each time series, as well
as pre-assigned group information as to which time series are periodic and which are noisy.
When the time series have been assigned to groups , this can be accessed by setting the
second input to 1:
When group information is not used (the left plot), the data is visualized in the default
blue/yellow/red color scheme, but when the assigned groups are colored (right plot), we see
that the clustered dataset separates perfectly into the periodic (green) and noisy (blue) time
series, and we can visualize the features that contribute to the separation.
50
Plotting the time series
Basic plotting
For example, to plot a set of time series that have not been assigned groups, we can run the
following:
For our assorted set of time series, this produces the following:
51
Plotting the time series
52
Plotting the time series
Showing the first 400 samples of 10 selected time series, equally-spaced through the
TimeSeries IDs in HCTSA_N.mat .
Freeform plotting
Many more custom plotting options are available by passing an options structure to
TS_plot_timeseries , including the 'plotFreeForm' option which allows very many time
53
Plotting the time series
producing an overview picture of the first 300 samples of 40 time series (spaced through the
rows of the data matrix).
54
Plotting the time series
In this case the two labeled groups of time series are recognized by the function: red (noisy),
blue (no noise), and then 5 time series in each group are plotted, showing the first 500
samples of each time series.
55
Low dimensional representation
TS_plot_pca('norm','ts');
This uses the normalized data (specifying 'norm' ), plotting time series (specifying 'ts' ) in
the reduced, two-dimensional principal components space of operations (the leading two
principal components of the data matrix).
By default, the user will be prompted to select 10 points on the plot to annotate with their
corresponding time series, which are annotated as the first 300 points of that time series
(and their names by default).
56
Low dimensional representation
Customizing annotations
Annotation properties can be altered with some detail by specifying properties as the
annotateParams input variable, for example:
yields:
57
Low dimensional representation
Consider the sample dataset containing 20 periodic signals with additive noise (given the
keyword noisy in the database), and 20 purely periodic signals (given the keyword periodic
in the database). After retrieving and normalizing the data, we store the two groups in the
metadata for the normalized dataset HCTSA_N.mat:
TS_LabelGroups('norm',{'noisy','periodic'},'ts');
We found:
noisy -- 20 matches
periodic -- 20 matches
Saving group labels and information back to HCTSA_N.mat... Saved.
Now when we plot the dataset in TS_plot_pca , it will automatically distinguish the groups in
the plot and attempt to classify the difference in the reduced principal components space.
The function then directs you to select 6 points to annotate time series to, producing the
following:
58
Low dimensional representation
Notice how the two labeled groups have been distinguished as red and blue points, and a
linear classification boundary has been added (with in-sample misclassification rate
annotated to the title and to each individual principal component). If marginal distributions
are plotted (setting showDistribution = 1 above), they are labeled according to the same
colors.
59
Finding nearest neighbors
The hctsa framework provides a way to easily compute distances between pairs of time
series, e.g., as a Euclidean distance between their normalized feature vectors. This allows
very different time series (in terms of their origin, their method of recording and
measurement, and their number of samples) to be compared straightforwardly according to
their properties, measured by the algorithms in our hctsa library.
For this, we use the TS_SimSearch function, specifying the id of the time series of interest
(i.e., the ID field of the TimeSeries structure) with the first input and the number of
neighbors with the 'numNeighbors' input specifier (default: 20). By default, data is loaded
from HCTSA_N.loc , but a custom source can be specified using the 'whatDataFile' input
specifier (e.g., TS_SimSearch('whatDataFile','HCTSA_custom.mat') ).
After specifying the target and how many neighbors to retrieve, TS_SimSearch outputs the
list of neighbors and their distances to screen, and the function also provides a range of
plotting options to visualize the neighbors. The plots to produce are specified as a cell using
the 'whatPlots' input.
To investigate the pairwise relationships between all neighbors retrieved, you specify the
'matrix' option of the TS_SimSearch function. An example output using a publicly-available
60
Finding nearest neighbors
The specified target time series ( ID = 1 ) is shown as a white star, and all 14 neighbors are
shown, as labeled on the left of the plot with their respective IDs, and a 100-sample subset
of their time traces.
Pairwise distances are computed between all pairs of time series (as a Euclidean distance
between their feature vectors), and plotted using color, from low (red = more similar pairs of
time series) to high (blue = more different pairs of time series).
Because this dataset contains 3 classes that were previously labeled (using TS_LabelGroups
as: TS_LabelGroups({'seizure','eyesOpen','eyesClosed'}) ), the function shows these class
assignments using color labels to the left of the plot (purple, green, and orange in this case).
In this case we see that the purple and green classes are relatively similar under this
distance metric (eyes open and eyes closed), whereas the orange time series (seizure) are
distinguished.
Network of neighbors
Another way to visualize the similarity (under our feature-based distance metric) of all pairs
of neighbors is using a network visualization. This is specified as:
61
Finding nearest neighbors
TS_SimSearch(1,'whatPlots',{'network'});
The strongest links are visualized as blue lines (by default, the top 40% of strongest links are
plotted, cf. the legend showing 0.9, 0.8, 0.7, and 0.6 for the top 10%, 20%, 30%, and 40% of
links, respectivly).
The target is distinguished (as purple in this case), and the other classes of time series are
shown using color, with names and time-series segments annotated. Again, you can see
that the EEG time series during seizure (blue) are distinguished from eyes open (red) and
eyes closed (green).
62
Finding nearest neighbors
TS_SimSearch(1,'whatPlots',{'scatter'});
The scatter setting visualizes the relationship between the target and each of 12 time series
with the most similar properties to the target. Each subplot is a scatter of the (normalized)
outputs of each feature for the specified target (x-axis) and the match (y-axis). An example is
shown below.
Other details
Multiple output plots can be produced simultaneously by specifying many types of plots as
follows:
TS_SimSearch(1,'whatPlots',{'matrix','network','scatter'})
63
Finding nearest neighbors
Note that pairwise distances can be pre-computed and saved in the HCTSA*.mat file using
TS_PairwiseDist for custom distance metrics (which is done by default in TS_cluster for
datasets containing fewer than 1000 objects). TS_SimSearch checks for this information in
the specified input data (containing the ts_clust or op_clust structure), and uses it to
retrieve neighbors. If distances have not previously been computed, distances from the
target are computed as euclidean distances (time series) or absolute correlation distances
(operations) between feature vectors within TS_SimSearch .
64
Investigating specific operations
These types of simple questions for specific features of interest can be investigated using
the TS_FeatureSummary function.
The function takes in an operation ID as it's input (and can also take inputs specifying a
custom data source, or custom annotation parameters), and produces a distribution of
outputs from that operation across the dataset, with the ability to then annotate time series
onto that plot.
TS_FeatureSummary(100)
Produces the following plot (where 6 points on the distribution have been clicked on to
annotate them with short time-series segments):
65
Investigating specific operations
You can visually see that time series with more autocorrelated patterns through time receive
higher values from this operation.
Because no group information is present in this dataset, the time series are colored at
random.
annotateParams = struct('maxL',500);
TS_FeatureSummary(4310, 'raw', 1, annotateParams);
This plots the distribution of feature 4310 from HCTSA.mat as a violin plot, with ten 500-point
time series subsegments annotated at different points through the distribution, shown to the
right of the plot:
66
Investigating specific operations
When time series groups have been labeled (using TS_LabelGroups as:
TS_LabelGroups({'seizure','eyesOpen','eyesClosed'},'raw'); ), TS_FeatureSummary will plot
Simpler distributions
TS_SingleFeature provides a simpler way of seeing the class distributions without
Shows the distributions with classification bar underneath (for where a linear classifier would
classify different parts of the space as either noisy or periodic):
67
Investigating specific operations
Shows the distributions shown as a violin plot, with means annotated and classification bar
to the left:
Note that the title, which gives an indication of the 10-fold cross-validated balanced accuracy
of a linear classifier in the space is done on the basis of a single 10-fold split and is
stochastic.
Thus, as shown above, this can yield slightly different results when repeated.
For a more rigorous analysis than this simple indication, the procedure should be repeated
many more times to give a converged estimate of the balanced classification accuracy.
68
Exploring classification accuracy
Classification rate (3-class) using 5-fold svm classification with 8192 features:
73.333 +/- 14.907%
In this case, the function has attempted to learn a linear svm classifier on the features to
predict the labels assigned to the data, using 5-fold cross validation (note that the default is
10-fold, but for smaller datasets such as this one, fewer folds are used automatically). The
results show that using 8192 features we obtain a mean classification accuracy of 73.3%
(with a standard deviation over the 5-folds of 14.9%).
If the 3rd input to the function is set to 1 (or fewer than 3 inputs are provided), the function
then computes the top 10 PCs of the data matrix, and uses them to classify the time series,
yielding:
69
Exploring classification accuracy
where the classification accuracy is shown for all features (green, dashed), and as a function
of the number of leading PCs included in the classifier (black circles).
Because we have so few examples of time series in this case (5 time series from each of 3
classes), attempting to learn a classifier for the dataset using thousands of features is
overkill -- we have no where near enough data to constrain such a classifier. Indeed, these
results show that a classifier using just a single feature (the first PC of the data matrix)
reproduces the accuracy of a classifier the full set of 8192 features (both achieving 73.3% on
this small dataset).
70
Finding informative features
A simple way to determine which individual features are useful for distinguishing the labeled
classes of your dataset is to compare each feature individually in terms of its ability to
separate the labeled classes of time series. This can be achieved using the TS_TopFeatures
function.
TS_TopFeatures()
By default this function will compare the in-sample linear classification performance of all
operations in separating the labeled classes of the dataset (individually), produce plots
summarizing the performance across the full library (compared to an empirical null
distribution), characterize the top-performing features, and show their dependencies on the
dataset. Inputs to the function control how these tasks are performed (including the statistic
used to measure individual performance, what plots to produce, and whether to produce
nulls).
71
Finding informative features
First, we get an output to screen, showing the mean linear classification accuracy across all
operations, and a list of the operations with top performance (ordered by their test statistic,
with their ID shown in square brackets, and keywords in parentheses):
Here we see that measures of entropy are dominating this top list, including measures of the
time-series distribution.
An accompanying plot summarizes the performance of the full library on the dataset,
compared to a set of nulls (generated by shuffling the class labels randomly):
72
Finding informative features
Here we can see that the default feature library (blue: 7275 features remaining after filtering
and normalizing) are performing much better than the randomized features (null: red).
Next we get a set of plots showing the class probability distributions for the top features. For
example:
73
Finding informative features
This allows us to interpret the values of features in terms of the dataset. After some
inspection of the listed operations, we find that various measures of EEG Sample Entropy
are higher for healthy individuals with eyes open (blue) than for seizure signals (purple).
Finally, we can visualize how the top operations depend on each other, to try to deduce the
main independent behaviors in this set of 25 top features for the given classification task:
74
Finding informative features
In this plot, the magnitude of the linear correlation, R (as 1-|R| ) between all pairs of top
operations is shown using color, and a visualization of a clustering into 5 groups is shown
(with green boxes corresponding to clusters of operations based on the dendrogram shown
to the right of the plot). Blue boxes indicate a representative operation of each cluster
(closest to the cluster centre).
Here we see that most pairs of operations are quite dependent (at |R| > ~0.8 ), and the
main clusters are those measuring distribution entropy ( EN_DistributionEntropy_ ), Lempel-
Ziv complexity ( EN_MS_LZcomplexity_ ), Sample Entropy ( EN_SampEn_ ), and a one-step-
ahead surprise metric ( FC_Surprise_ ).
This function produces all of the basic visualizations required to achieve a basic
understanding of the differences between the labeled classes of the dataset, and should be
considered a first step in a more detailed analysis. For example, in this case we may
investigate the top-performing features in more detail, to understand in detail what they are
measuring, and how. This process is described in the Interpreting features section.
75
Interpreting features
Functions like TS_TopFeatures are helpful in showing us how these different types of
features might cluster into groups that measure similar properties (as shown in the previous
section). This helps us to be able to inspect sets of similar, inter-correlated features together
as a group, but even when we have isolated such a group, how can we start to interpret and
understand what these features are actually measuring? Some features in the list may be
easy to interpret directly (e.g., rms in the list above is simply the root-mean-square of the
distribution of time-series values), and others have clues in the name (e.g., features starting
with WL_coeffs are to do with measuring wavelet coefficients, features starting with EN_mse
correspond to measuring the multiscale entropy, mse, and features starting with
FC_LocalSimple_mean are related to time-series forecasting using local means of the time
series). Below we outline a procedure for how a user can go from a time-series feature
76
Interpreting features
selected by hctsa towards a deeper understanding of the type of algorithm that feature is
derived from, how that algorithm performs across the dataset, and thus how it can provide
interpretable information about your specific time-series dataset.
Inspecting keywords
The simplest way of interpreting what sort of property a feature might be measuring is from
its keywords, that often label individual features by the class of time-series analysis method
from which they were derived. In the list above, we see keywords listed in parentheses, as
'forecasting' (methods related to predicting future values of a time series), 'entropy' (methods
related to predictability and information content in a time series), and 'wavelet' (features
derived from wavelet transforms of the time series). There are also keywords like 'locdep'
(location dependent: features that change under mean shifts of a time series), 'spreaddep'
(spread dependent: features that change under rescaling about their mean), and 'lengthdep'
(length dependent: features that are highly sensitive to the length of a time series).
Inspecting code
To find more specific detailed information about a feature, beyond just a broad categorical
label of the literature from which it was derived, the next step is find and inspect the code file
that generates the feature of interest. For example, say we were interested in the top
performing feature in the list above:
We know from the keyword that this feature has something to do with forecasting, and the
name provides clues about the details (e.g., FC_ stands for forecasting, the function
FC_LocalSimple is the one that produces this feature, which, as the name suggests,
performs simple local time-series prediction). We can use the feature ID (3016) provided in
square brackets to get information from the Operations structure array:
>> disp(Operations([Operations.ID]==3016));
ID: 3016
Name: 'FC_LocalSimple_mean3_taures'
Keywords: 'forecasting'
CodeString: 'FC_LocalSimple_mean3.taures'
MasterID: 836
Inspecting the text before the dot, '.', in the CodeString field ( FC_LocalSimple_mean3 ) tells us
the name that hctsa uses to describe the Matlab function and its unique set of inputs that
produces this feature. Whereas the text following the dot, '.', in the CodeString field
77
Interpreting features
( taures ), tells us the field of the output structure produced by the Matlab function that was
run. We can use the MasterID to get more information about the code that was run using
the MasterOperations structure array:
>> disp(MasterOperations([MasterOperations.ID]==836));
ID: 836
Label: 'FC_LocalSimple_mean3'
Code: 'FC_LocalSimple(y,'mean',3)'
This tells us that the code used to produce our feature was FC_LocalSimple(y,'mean',3) . We
can get information about this function in the commandline by running a help command:
Simple predictors using the past trainLength values of the time series to
predict its next value.
---INPUTS:
y, the input time series
trainLength, the number of time-series values to use to forecast the next value
We can also inspect this code FC_LocalSimple directly for more information. Like all code
files for computing time-series features, FC_LocalSimple.m is located in the Operations
directory of the hctsa repository. Inspecting the code file, we see that running
FC_LocalSimple(y,'mean',3) does forecasting using local estimates of the time-series mean
(since the second input to FC_LocalSimple , forecastMeth is set to 'mean' ), using the
previous three time-series values to make the prediction (since the third input to
FC_LocalSimple , trainLength is set to 3 ).
To understand what the specific output quantity from this code is that came up as being
highly informative in our TS_TopFeatures analysis, we need to look for the output labeled
taures of the output structure produced by FC_LocalSimple . We discover the following
78
Interpreting features
This shows us that, after doing the local mean prediction, FC_LocalSimple then outputs
some features on whether there is any residual autocorrelation structure in the residuals of
the rolling predictions (the outputs labeled ac1 , ac2 , and our output of interest: taures ).
The code shows that this taures output computes the CO_FirstZero of the residuals,
which measures the first zero of the autocorrelation function (e.g., cf help CO_FirstZero ).
When the local mean prediction still leaves alot of autocorrelation structure in the residuals,
our feature, FC_LocalSimple_mean3_taures , will thus take a high value.
Visualizing outputs
Once we've seen the code that was used to produce a feature, and started to think about
how such an algorithm might be measuring useful structure in our time series, we can then
check our intuition by inspecting its performance on our dataset (as described in
Investigating specific operations).
TS_FeatureSummary(3016,'raw',true);
79
Interpreting features
which produces a plot like that shown below. We have run this on a dataset containing noisy
sine waves, labeled 'noisy' (red) and periodic signals without noise, labeled 'periodic' (blue):
On the plot on the right, we see how this feature orders time series (with the distribution of
values shown on the left, and split between the two groups: 'noisy', and 'periodic'). Our
intuition from the code, that time series with longer correlation timescales will have highly
autocorrelated residuals after a local mean prediction, appears to hold visually on this
dataset. In general, the mechanism provided by TS_FeatureSummary to visualize how a given
feature orders time series, including across labeled groups, can be a very useful one for
feature interpretation.
Summary
hctsa contains a large number of features, many of which can be expected to be highly inter-
correlated on a given time-series dataset. It is thus crucial to explore how a given feature
relates to other features in the library, e.g., using the correlation matrix produced by
TS_TopFeatures (cf. Finding informative features), or by searching for features with similar
behavior on the dataset to a given feature of interest (cf. Finding nearest neighbors). In a
specific domain context, the analyst typically needs to decide on the trade-off between more
complicated features that may have slightly higher in-sample performance on a given task,
80
Interpreting features
and simpler, more interpretable features that may help guide domain understanding. The
procedures outlined above are typically the first step to understanding a time-series analysis
algorithm, and its relationship to alternatives that have been developed across science.
81
Working with short time series
The number of features with a meaningful output, from time series as short as 5 samples, up
to those with as many as 500 samples, is shown below (where the maximum set of 7749 is
shown as a dashed horizontal line):
82
Working with short time series
In each case, over 3000 features can be computed. Note that one must be careful when
representing a 5-dimensional object with thousands of features, the vast majority of which
will be highly intercorrelated.
83
Working with short time series
Inspecting the time series plots to the left of the colored matrix, we can see that genes with
similar temporal expression profiles are clustered together based on their 2829-long feature
vector representations. Thus, these feature-based representations provide a meaningful
representation of these short time series. Further, while these 2829-long feature vectors are
shorter than those that can be computed from longer time series, they still constitute a highly
comprehensive representation that can be used as the starting point to obtain interpretable
understanding in addressing specific domain questions.
84
Working with a mySQL database
The hctsa software comes with this (optional) functionality, allowing a more powerful,
distributed way to compute and store results of a large-scale computation.
This chapter outlines the steps involved in setting up, and running hctsa computations using
a linked mySQL database.
After the database is set up, and the packages required by hctsa are installed (by running
the install script), linking to a mySQL database can be done by running the
install_database script, which:
1. Sets up Matlab to be able to communicate with the mySQL server and creates a new
database to store Matlab calculations in, described here.
2. Populates the database with our default library of master operations and operations, as
described here. (NB: a description of the terminology of 'master operations': a set of
input arguments to an analysis function, and 'operations': a single time-series feature, is
here).
Note that the above steps are one-off installation steps; once the software is installed and
compiled, a typical workflow will simply involve opening Matlab, running the startup script
(which adds all paths required for the hctsa software), and then working within Matlab from
any desired directory.
85
Working with a mySQL database
After the computation is complete for a time-series dataset, a range of processing, analysis,
and plotting functions are also provided with the software, as described here.
86
Setting up the mySQL database
The following outlines the actions performed by the install_jconnector script (including
instructions on how to perform the steps manually):
It is necessary to relocate the J connector from the Database directory of this code
repository (which is also freely available here): the file mysql-connector-java-5.1.35-bin.jar
(for version 5.1.35). Instructions are here and are summarized below, and described in the
Matlab documentation. This .jar file must be added to a static path where it can always be
found by Matlab. A good candidate directory is the java/jarext/ subdirectory of the Matlab
root directory (to determine the Matlab root directory, simply type matlabroot in an open
Matlab command window).
For Matlab to see this file, you need to add a reference to it in the javaclasspath.txt file (an
alternative is to modify the classpath.txt file directly, but this may not be supported by
newer versions of Matlab). This file can be found (or if it does not exist, should be created) in
Matlabs preferences directory (to determine this location, type prefdir in a command
window, or navigate to it within Matlab using cd(prefdir) ).
This javaclasspath.txt file must contain a text reference to the location of the java
connector on the disk. In the case recommended above, where it has been added to the
java/jarext directory, we would add the following to the javaclasspath.txt file:
$matlabroot/java/jarext/mysql-connector-java-5.1.35-bin.jar
87
Setting up the mySQL database
ensuring that the version number (5.1.35) matches your version of the J connector (if you
are using a more recent version, for example).
Note that javaclasspath.txt can also be in Matlabs startup directory (for example, to modify
this just for an individual user).
After restarting Matlab, Matlab should then have the ability to communicate with mySQL
servers (we will check whether this works below).
hostname,databasename,username,password
The settings listed here are those used to connect to the mySQL server. Remember that
your password is sitting here in this document in unencrypted plain text, so do not use a
secure or important password.
To check that Matlab can connect to external servers using the mySQL J-connector, using
correct host name, username, and password settings, we introduce the Matlab routines
SQL_opendatabase and SQL_closedatabase . An example usage is as follows:
88
Setting up the mySQL database
For this to work, the sql_settings.conf file must be set up properly. This file specifies (in
unencrypted plain text!) the login details for your mySQL database in the form
hostName,databaseName,username,password .
localhost,myTestDatabase,benfulcher,myInsecurePassword
Once you have configured your sql_settings.conf file, and you can run dbc =
SQL_opendatabase; and SQL_closedatabase(dbc) without errors, then you can smile to
yourself and you should at this point be happy because Matlab can communicate
successfully with your mySQL server! You should also be excited because you are now
ready to set up the database structure!
Note that if your database is not set up on your local machine (i.e., localhost ), then Matlab
can communicate with a mySQL server through an ssh tunnel, which requires some
additional setup (described below).
Note also that the SQL_opendatabase function uses Matlab's Database Toolbox if a license is
available, but otherwise will use java commands; both are supported and should give
identical operational behavior.
Note that one can swap between multiple databases easily by commenting out lines of the
sql_settings.conf file (adding % to the start of a line to comment it out).
89
Setting up the mySQL database
This command connects port 1234 on your local computer to port 3306 (the default mySQL
port) on the server. Now, telling Matlab to connect to localhost through port 1234 will
connect it, through the established ssh tunnel, to the server. This can be achieved by
specifying the server as localhost and the port number as 1234 in the sql_settings.conf
file (or during the install process), which can be specified as the (optional) fifth entry, i.e.,:
hostname,databasename,username,password,1234
90
The database structure
1. A lists of all the filenames and other metadata of time series (the TimeSeries table),
2. A list of all the names and other metadata of pieces of time-series analysis operations
(the Operations table),
3. A list of all the pieces of code that must be evaluated to give each operation a value,
which is necessary, for example, when one piece of code produces multiple outputs (the
MasterOperations table), and
4. A list of the results of applying operations to time series in the database (the Results
table).
Additional tables are related to indexing and managing efficient keyword labeling, etc.
Time series and operations have their own tables that contain metadata associated with
each piece of data, and each operation, and the results of applying each method to each
time series is in the Results table, that has a row for every combination of time series and
operation, where we also record calculation times and the quality of outputs (for cases
where there the output of the operation was not a real number, or when some error occurred
in the computation). Note that while data for each time series data is stored on the database,
the executable time-series analysis code files are not, such that all code files must be in
Matlabs path (all required paths can be added by running startup.m ).
Another handy (but dangerous) function to know about is SQL_reset , which will delete all
data in the mySQL database, create all the new tables, and then fill the database with all
the time-series analysis operations. The TimeSeries, Operations, and MasterOperations
tables can be generated by running SQL_create_all_tables , with master operations and
operations added to the database using SQL_add commands (described here).
You now you have the basic table structure set up in the database and have done the first
bits of mySQL manipulation through the Matlab interface.
It is very useful to be able to inspect the database directly, using a graphical interface. A very
good example for Mac is the excellent, free application, Sequel Pro (a screenshot is shown
below, showing the first 40 rows of the Operations table of our default operations library).
Applications similar to Sequal Pro exist for Windows and Linux platforms. Applications that
allow the database to be inspected are extremely useful, however they should not be used
to manipulate the database directly. Instead, Matlab scripts should be used to interact with
the database to ensure that the proper relationships between the different tables are
maintained (including the indexing of keyword labels).
91
The database structure
92
Populating the database with time series and operations
The following table summarizes the terminology used for each type of object in hctsa land:
Using SQL_add
SQL_add has two key inputs that specify:
2. The name of the input file that contains appropriately-formatted information about the
time series, master operations, or operations to be imported.
In this section, we describe how to use SQL_add to add master operations, operations, and
time series to the database.
Users wishing to run the default hctsa code library their own time-series dataset will only
need to add time series to the database, as the full operation library is added by default by
the install.m script. Users wishing to add additional features using custom time-series
code or different types of inputs to existing code files, can either edit the default INP_ops.txt
and INP_mops.txt files provided with the repository, or create new input files for their custom
analysis methods (as explained for operations and master operations).
REMINDER: Manually editing the database, including adding or deleting rows, is very
dangerous, as it can create inconsistencies and errors in the database structure. Adding
time series and operations to the database should only be done using SQL_add which sets
up the Results table of the database and ensures that the indexing relationships in the
database are properly maintained.
93
Populating the database with time series and operations
SQL_add('mops','INP_mops.txt');
Once added, each master operation is assigned a unique integer, mop_id, that can be used
to identify it. For example, when adding individual operations, the mop_id is used to map
each individual operation to a corresponding master operation.
code can be added to the current INP_mops.txt file, or by generating a new input file and
running SQL_add on the new input file. Once in the database, the software will then run the
new pieces of code. Note that SQL_add checks for repeats that already exist in the
database, so that duplicate database entries cannot be added with SQL_add .
New code added to the database should be checked for the following:
1. Output is a real number or structure (and uses an output of NaN to assign all outputs to
a NaN).
2. The function is accessible in the Matlab path.
3. Output(s) from the function have matching operations (or features), which also need to
be added to the database.
Corresponding operations (or features) will then need to added separately, to link to the
structured outputs of master operations.
SQL_add('ops','INP_ops.txt');
94
Populating the database with time series and operations
Every operation added to the database will be given a unique integer identifier, op_id, which
provides a common way of retrieving specific operations from the database.
Note that after (hopefully successfully) adding the operations to the database, the SQL_add
function indexes the operation keywords to an OperationKeywords table that produces a
unique identifier for each keyword, and another linking table that allows operations with each
keyword to be retrieved efficiently.
Time series are added using the same function used to add master operations and
operations to the database, SQL_add , which imports time series data (stored in time-series
data files) and associated keyword metadata (assigned to each time series) to the database.
Time series can be indexed by assigning keywords to them (which are stored in the
TimeSeriesKeywords table and associated index table, TsKeywordsRelate of the
database).
When added to the mySQL database, every time series added to the database is assigned a
unique integer identifier, ts_id, which can be used to retrieve specific time series from the
database.
Adding a set of time series to the database requires an appropriately formatted input file,
following either of the following:
% Add time series (stored in data files) using an input text file:
SQL_add('ts','INP_ts.txt');
We provide an example input file in the Database directory as INP_test_ts.txt, which can
be added to the database, following the syntax above, using
SQL_add('ts','INP_test_ts.txt') , as well as a sample .mat file input as INP_test_ts.mat,
95
Adding time series
Time series are added using the same function used to add master operations and
operations to the database, SQL_add , which imports time series data (stored in time-series
data files) and associated keyword metadata (assigned to each time series) to the database.
The time-series data files to import, and the keywords to assign to each time series are
specified in either: (i) an appropriately formatted matlab ( .mat ) file, or (ii) a structured input
text file, as explained below.
Time series can be indexed by assigning keywords to them (which are stored in the
TimeSeriesKeywords table and associated index table, TsKeywordsRelate of the
database). Assigning keywords to time series makes it easier to retrieve a set of time series
with a given set of keywords for analysis, and to group time series annotated with different
keywords for classification tasks.
When added to the mySQL database, every time series added to the database is assigned a
unique integer identifier, ts_id, which can be used to retrieve specific time series from the
database.
SQL_add syntax
Adding a set of time series to the database requires an appropriately formatted input file,
INP_ts.txt, for example, the appropriate code is:
% Add time series (stored in data files) using an input text file:
SQL_add('ts','INP_ts.txt');
We provide an example input file in the Database directory as INP_test_ts.txt, which can
be added to the database, following the syntax above, using
SQL_add('ts','INP_test_ts.txt') , as well as a sample .mat file input as INP_test_ts.mat,
96
Adding time series
97
Retrieving from the database
1. To calculate as-yet uncalculated entries to be stored back into the database, and
2. To analyze already-computed data stored in the database in Matlab.
The function SQL_retrieve performs both of these functions, using different inputs. Here we
describe the use of the SQL_retrieve function for the purposes of populating uncalculated
(NULL) entries in the Results table of the database in Matlab.
For calculating missing entries in the database, SQL_retrieve can be run as follows:
The third input, 'null' , retrieves ts_ids and op_ids from the sets provided that contain
(as-yet) uncalculated (i.e., NULL) elements in the database; these can then be calculated
and stored back in the database. An example usage is given below:
Running this code will retrieve null (uncalculated) data from the database for time series with
ts_ids between 1 and 5 (inclusive) and all operations in the database, keeping only the rows
and columns of the resulting time series x operations matrix that contain NULLs.
When calculations are complete and one wishes to analyze all of the data stored in the
database (not just NULL entries requiring computation), the third input should be set to all
to retrieve all entries in the Results table of the database, as described later.
SQL_retrieve writes to a local Matlab file, HCTSA.mat, that contains the data retrieved from
the database.
98
Computing operations and writing back to the database
Once calculations have been performed using Matlab on local files, the results must be
written back to the database. This task is performed by SQL_store , which reads the data in
HCTSA.mat , checks that the metadata still matches the database, and then begins updating
the Output, Quality, and CalculationTime columns of the Results table in the mySQL
database. This can be done by simply running:
SQL_store;
Depending on database latencies, this can be a relatively slow process, up to 20-25 s per
time series, updating each row in the Results table individually using mySQL UPDATE
statements. However, the delay in this step means that the computation can be distributed
across multiple compute nodes, and that stored data can be indexed and retrieved
systematically. Keeping results in local Matlab files can be extremely inefficient, and can
indeed be untenable for large datasets.
99
Cycling through computations using runscripts
1. Retrieve a set of time series and operations from (the Results table) of the database to
a local Matlab file, HCTSA.mat (using SQL_retrieve ).
2. Compute the operations on the retrieved time series in Matlab and store the results
locally (using TS_compute ).
3. Write the results back to the Results table of the database (using SQL_store ).
It is usually the most efficient practice to retrieve a small number of time series at each
iteration of the SQL_retrieve TS_compute SQL_store loop, and distribute this computation
across multiple machines if possible. An example runscript is given in the code that
accompanies this document, as sample_runscript_sql , which retrieves a single time series
at a time, computes it, and then writes the results back to the database in a loop. This can
be viewed as a template for runscripts that one may wish to use when performing time-
series calculations across the database.
This workflow is well suited to distributed computing for large datasets, whereby each node
can iterate over a small set of time series, with all the results being written back to a central
location (the mySQL database).
By designating different sections of the database to cycle through, this procedure can also
be used to (manually) distribute the computation across different machines. Retrieving a
large section of the database at once can be problematic because it requires large disk
reads and writes, uses a lot of memory, and if problems occur in the reading or writing
to/from files, one may have to abandon a large number of existing computations.
100
Clearing or removing data
This function takes in information about whether to clear (clear any calculated data) or
remove (completely delete the given time series or operations from the database).
101
Retrieving data from the database
for vectors ts_ids and op_ids , specifying the ts_ids and op_ids to be retrieved from the
database.
Sets of ts_ids and op_ids to retrieve can be selected by inspecting the database, or by
retrieving relevant sets of keywords using the SQL_getids function. Running the code in this
way, using the all tag, ensures that the full range of ts_ids and op_ids specified are
retrieved from the database and stored in the local file, HCTSA.mat, which can then form
the basis of subsequent analysis.
The database structure provides much flexibility in storing and indexing the large datasets
that can be analyzed using the hctsa approach, however the process of retrieving and
storing large amounts of data from a database can take a considerable amount of time,
depending on database latencies.
Note that missing, or NULL, entries in the database are converted to NaN entries in the local
Matlab matrices.
102
Error handling and maintenance
103