PCF-GAN: generating sequential data via the characteristic function of measures on the path space
Abstract
Generating high-fidelity time series data using generative adversarial networks (GANs) remains a challenging task, as it is difficult to capture the temporal dependence of joint probability distributions induced by time-series data. Towards this goal, a key step is the development of an effective discriminator to distinguish between time series distributions. We propose the so-called PCF-GAN, a novel GAN that incorporates the path characteristic function (PCF) as the principled representation of time series distribution into the discriminator to enhance its generative performance. On the one hand, we establish theoretical foundations of the PCF distance by proving its characteristicity, boundedness, differentiability with respect to generator parameters, and weak continuity, which ensure the stability and feasibility of training the PCF-GAN. On the other hand, we design efficient initialisation and optimisation schemes for PCFs to strengthen the discriminative power and accelerate training efficiency. To further boost the capabilities of complex time series generation, we integrate the auto-encoder structure via sequential embedding into the PCF-GAN, which provides additional reconstruction functionality. Extensive numerical experiments on various datasets demonstrate the consistently superior performance of PCF-GAN over state-of-the-art baselines, in both generation and reconstruction quality.
1 Introduction
Generative Adversarial Networks (GANs) have been a powerful tool for generating complex data distributions, e.g., image data. The original GAN suffers from optimisation instability and mode collapse, partially remedied later by an alternative training scheme using integral probability metric (IPM) in lieu of Jensen–Shannon divergence. The IPMs, e.g., metrics based on Wasserstein distances or Maximum Mean Discrepancy (MMD), consistently yield good measures between generated and real data distributions, thus resulting in more powerful GANs on empirical data ([14, 2, 24]).
More recently, [1] proposed an IPM based on the characteristic function (CF) of measures on , which has the characteristic property, boundedness, and differentiability. Such properties enable the GAN constructed using this IPM as discriminator (“CF-GAN”) to stabilise training and improve generative performance. However, ineffective in capturing the temporal dependency of sequential data, such CF-metric fails to address high-frequency cases due to the curse of dimensionality. To tackle this issue, we take the continuous time perspective of time series and lift discrete time series to the path space ([28, 29, 23]). This allows us to treat time series of variable length, unequal sampling, and high frequency in a unified approach. We propose a path characteristic function (PCF) distance to characterise distributions on the path space, and propose the corresponding PCF distance as a novel IPM to quantify the distance between measures on the path space.
Built on top of the unitary feature of paths ([26]), our proposed PCF has theoretical foundations deeply rooted in the rough path theory ([7]), which exploits the non-commutativity and the group structure of the unitary feature to encode information on order of paths. The CF may be regarded as the special case of PCF with linear random path and unitary matrix. We show that the PCF distance (PCFD) possesses favourable analytic properties, including boundedness and differentiability in model parameters, and we establish the linkages between PCFD and MMD. These results vastly generalise classical theorems on measures on ([1]), with much more technically involved proofs due to the infinite-dimensionality of path space.
On the numerical side, we design an efficient algorithm which, by optimising the trainable parameters of PCFD, maximises the discriminative power and improves the stability and efficiency of GAN training. Inspired by [25, 41], we integrate the proposed PCF into the IPM-GAN framework, utilising an auto-encoder architecture specifically tailored to sequential data. This model design enables our algorithm to generate and reconstruct realistic time series simultaneously, which has advantages in diverse applications, including privacy preservation ([35]) and semantic representation extraction for downstream tasks ([10]). To assess the efficacy of our PCF-GAN, we conduct extensive numerical experiments on several standard time series benchmarking datasets for both generation and reconstruction tasks.
We summarize key contributions of this work below:
-
•
proposing a new metric for the distributions on the path space via PCF;
-
•
providing theoretical proofs for analytic properties of the proposed loss metric which benefit GAN training;
-
•
introducing a novel PCF-GAN to generate reconstruct time series simultaneously; and
-
•
reporting substantial empirical results validating the out-performance of our approach, compared with several state-of-the-art GANs with different loss functions on various time series generation and reconstruction tasks.
Related work. Given the wide practical use of, and challenges for, realistic time series synthesis ([3, 4]), various approaches are proposed to improve the quality of GANs for synthetic time series generation. Several works, e.g., [43, 45, 36], are devoted to improving the discriminator of GANs to be better suited to distributions induced by time series. Among them, COT-GAN in [43] shares a similar philosophy with PCF-GAN by introducing a novel discriminator based on causal optimal transport (which can be seen as an improved variant of the Sinkhorn divergence tailored to sequential data), while TimeGAN ([45]) shares a similar auto-encoder structure, which improves the generator’s quality and enables time series reconstruction. Unlike PCF-GAN, the reconstruction and generation modules of TimeGAN are separated, whereas it has additional stepwise supervised loss and discriminative loss. In a different vein, CEGEN[36], GT-GAN [17], COSCI-GAN [39], and EWGAN[37] focus primarily on the design of network framework and generator architecture, which achieve state-of-the-art results on several benchmarking datasets.
2 Preliminaries
The characteristic function of a measure on , namely that the Fourier transform, plays a central role in probability theory and analysis. The path characteristic function (PCF) is a natural extension of the characteristic function to the path space.
2.1 Characteristic function distance (CFD) between random variables in
Let be an -valued random variable with the law . The characteristic function of , denoted as , maps each to the expectation of its complex unitary transform: . Here is the solution to the linear controlled differential equation:
(1) |
where is the zero vector in and is the Euclidean inner product on .
References [11, 16] studied the squared characteristic function distance (CFD) between two -valued random variables and with respect to another probability distribution on :
(2) |
It is proved in [25, 1] that if the support of is , then is a distance metric, so that if and only if and have the same distribution. This justifies the usage of as a discriminator for GAN training to learn finite-dimensional random variables from data.
2.2 Unitary feature of a path
Let be the space of -valued paths of bounded variation over . Consider
(3) |
For a discrete time series , where and (), we can embed it into some whose evaluation at coincides with . This is well suited for sequence-valued data in the high-frequency limit with finer time-discretisation and is often robust in practice ([27, 26]). Such embeddings are not unique. In this work, we adopt the linear interpolation for embedding, following [23, 18, 32].
Let , be the identity matrix, and be conjugate transpose. Write and for the Lie group of unitary matrices and its Lie algebra, resp.:
Definition 2.1.
Let be a continuous path and be a linear map. The unitary feature of under is the solution to the following equation:
(4) |
We write , i.e., the endpoint of the solution path.
By a slight abuse of notations, is also called the unitary feature of under . Unitary feature is a special case of the Cartan/path development, for which one may consider paths taking values in any Lie group . We take only here; in general ([6, 30]).
Example 2.2.
For and linear, In particular, when , is reduced to and for some .
Motivated by the universality and characteristic property of unitary features ([7], see Section A.3), we constructed a unitary layer which transforms any -dimensional time series to the unitary feature of its piecewise linear interpolation . It is a special case of the path development layer [26], when Lie algebra is chosen as . In fact, the explicit formula holds: , where and is the matrix exponential.
Convention 2.3.
The space in which of Eq. (4) resides is isomorphic to , where is Lie algebra isomorphic to . For each given by anti-Hermitian matrices , a linear map is uniquely induced: .
3 Path characteristic function loss
3.1 Path characteristic function (PCF)
The unitary feature of a path plays a role similar to that played by to an -valued random variable. Thus, for a random path , the expected unitary feature can be viewed as the characteristic function for measures on the path space ([7]).
Definition 3.1.
Let be an -valued random variable and be its measure. The path characteristic function (PCF) of of order is the map given by
The path characteristic function (PCF) is defined by the natural grading: for each .
In the above, is the unitary feature of the path under . See Definition 2.1.
Similarly to the characteristic function of -valued random variables, the PCF always exists. Moreover, we have the following important result, whose proof is presented in Appendix A.
Theorem 3.2 (Characteristicity).
Let and be -valued random variables. They have the same distribution (denoted as ) if and only if .
3.2 A new distance measure via PCF
We now introduce a novel and natural distance metric, which measures the discrepancy between distributions on the path space via comparing their PCFs. Throughout, denotes the metric associated with the Hilbert–Schmidt norm on :
Definition 3.3.
Let be stochastic processes and be a probability distribution on (recall Convention 2.3). Define the squared PCF-based distance (PCFD) between and with respect to as
(5) |
We shall not distinguish between and for simplicity.
PCFD exhibits several mathematical properties, which provide the theoretical justification for its efficacy as the discriminator on the space of measures on the path space, leading to empirical performance boost. First, PCFD has the characteristic property.
Lemma 3.4 (Separation of points).
Let and . Then there exists , such that if is a -valued random variable with full support, then .
Furthermore, has a simple uniform upper bound for any fixed :
Lemma 3.5.
Let be a -valued random variable. Then, for any -valued random variables and , it holds that
Under mild conditions, is a.e. differentiable with respect to a continuous parameter, thus ensuring the feasibility of gradient descent in training.
Theorem 3.6 (Lipschitz dependence on continuous parameter).
Let and be subsets of , be a metric space, be a Borel probability measure on , and be a Borel probability measure on . Assume that , is Lipschitz in such that . In addition, suppose that and . Then is Lipschitz in . Moreover, it holds that
for any , , , and .
Remark 3.7.
The parameter space is usually taken to be for some . In this case, by Rademacher’s theorem is a.e. differentiable in .
Similarly to metrics on measures over (cf. [2, 24]), we construct a metric based on PCFD, denoted as , on the space of Borel probability measures over the path space, and we prove that it metrises the weak-star topology on . Throughout, denotes the convergence in law.
Theorem 3.8 (Informal, convergence in law).
Let and be -valued random variables with measures supported in a compact subset of . Then .
3.3 Computing PCFD under empirical measures
Now, we shall illustrate how to compute the PCFD on the path space.
Let and be i.i.d. drawn respectively from -valued random variables and . First, for any linear map , the empirical estimator of is the average of unitary features of all observations , i.e., . We then parameterise the -valued random variable via the empirical measure , i.e., , where are the trainable model parameters. Finally, define the corresponding empirical path characteristic function distance (EPCFD) as
(6) |
Our approach to approximating via the empirical distribution differs from that in [25], where is parameterised by mixture of Gaussian distributions. In §4.1 and §5, it is shown that, by optimising the empirical distribution, a moderately sized is sufficient for achieving superior performance, in contrast to a larger sample size required by [25].
4 PCF-GAN for time series generation
4.1 Training of the EPCFD
In this subsection, we apply the EPCFD to GAN training for time series generation as the discriminator. We train the generator to minimise the EPCFD between true and synthetic data distribution, whereas the empirical distribution of characterised by is optimised by maximising EPCFD.
By an abuse of notation, let (, resp.) denote the data (noise, resp.) space, composed of (, resp.) time series of length . As discussed in §2.2, and can be viewed as path spaces via linear interpolation. Like the standard GANs, our model is comprised of a generator and the discriminator , where is the model parameter of the discriminator, which fully characterises the empirical measure of . The pre-specified noise random variable is the discretised Brownian motion on with time mesh . The induced distribution of the fake data is given by . Hence, the min-max objective of our basic version PCF-GAN is
We apply mini-batch gradient descent to optimise the model parameters of the generator and discriminator in an alternative manner. In particular, to compute gradients of the discriminator parameter , we use the efficient backpropagation algorithm through time introduced in [26], which effectively leverages the Lie group-valued outputs and the recurrence structure of the unitary feature. The initialisation of for the optimisation is outlined in the Section B.4.1.
Learning time-dependent Ornstein–Uhlenbeck process
Following [19], we apply the proposed PCF-GAN to the toy example of learning the distribution of synthetic time series data simulated via the time-dependent Ornstein–Uhlenbeck (OU) process. Let be an -valued stochastic process described by the SDE, i.e., where is 1D Brownian motion and is the standard normal distribution. We set , , and time discretisation . We generate 10000 samples from to , down-sampled at each integer time point. Figure 2 shows that the synthetic data generated by our GAN model, which uses the EPCFD discriminator, is visually indistinguishable from true data. Also, our model accurately captures the marginal distribution at various time points.
4.2 PCF-GAN: learning with PCFD and sequential embedding
In order to effectively learn the distribution of high-dimensional or complex time series, using solely the EPCF loss as the GAN discriminator fails to be the best approach, due to the computational limitations imposed by the sample size and the order of EPCFD. To overcome this issue, we adopt the approach [41, 25], and train a generator that matches the distribution of the embedding of time series via the auto-encoder structure. Figure 3 illustrates the mechanics of our model.
To proceed, let us first recall the generator and introduce the embedding layer , which maps to (the noise space). Here is the model parameters of the embedding layer and will be learned from data. To this end, it is natural to optimize the model parameters of the generator by minimising the generative loss , which is the EPCFD distance of the embedding between true distribution and synthetic distribution ; in formula,
(7) |
Encoder-decoder structure: The motivation to consider the auto-encoder structure is based on the observation that the embedding might be degenerated when optimizing . For example, no matter whether true and synthetic distribution agrees or not, could be simply a constant function to achieve the perfect generator loss 0. Such a degeneracy can be prohibited if is injective. In heuristic terms, the “good" embedding should capture essential information about real time series of and allows the reconstruction of time series from its embedding . This motivates us to train the embedding such that is close to the identity map. If this condition is satisfied, it implies that and are pseudo-inverses of each other, thereby ensuring the desired injectivity. In this way, and serve as the encoder and decoder of raw data, respectively.
To impose the injectivity of , we consider two additional loss functions for training as follows:
Reconstruction loss : It is defined as the samplewise distance between the original and reconstructed noise by , i.e., . Note that implies that , for any sample in the support of almost surely.
Regularization loss : It is proposed to match the distribution of the original noise variable and embedding of true distribution . It is motivated by the observation that if the perfect generator and is the identity map, then . Specifically,
(8) |
where we distinguish from in . The regularization loss effectively stabilises the training and resolves the mode collapse [41] due to the lack of infectivity of the embedding.
Training the embedding parameters : The embedding layer aims to not only discriminate the real and fake data distributions as a critic, but also preserve injectivity. Hence we optimise the embedding parameter by the following hybrid loss function:
(9) |
where and are hyper-parameters that balance the three losses.
Training the EPCFD parameters : Note that and have trainable parameters of EPCFD, i.e., and . Similar to the basic PCF-GAN, we optimize and by maximising the EPCFD to improve the discriminative power.
(10) |
By doing so, we enhance the discriminative power of and . Consequently, this facilitates the training of the generator such that the embedding of the true data aligns with both the noise distribution and the reconstructed noise distribution.
Differentiability of EPCFD with respect to parameters of the embedding layer and generators are guaranteed by Theorem 3.6, as long as satisfies the Lipschitz condition thereof. Let us also stress on two key advantages of our proposed PCF-GAN. First, it possesses the ability to generate synthetic time series with reconstruction functionality, thanks to the auto-encoder structure in PCF-GAN. Second, by virtue of the uniform boundedness of PCFD shown in Lemma 3.5, our PCF-GAN does not require any additional gradient constraints of the embedding layer and EPCFD parameters, in contrast to other MMD-based GANs and Wasserstein-GAN. It helps with the training efficiency and alleviates the vanishing gradient problem in training sequential networks like RNNs.
We provide the pseudo-code for the proposed PCF-GAN in Algorithm 1.
5 Numerical Experiments
To validate its efficacy, we apply our proposed PCF-GAN to a broad range of time series data and benchmark with state-of-the-art GANs for time series generation using various test metrics. Full details on numerics (dataset, evaluation metrics, and hyperparameter choices) are in Appendix C. Additional ablation studies and visualisations of generated samples are reported in Appendix D.
Baselines: We take Recurrent GAN (RGAN)[12], TimeGAN [45], and COT-GAN [43] as benchmarking models. These are representatives of GANs exhibiting strong empirical performance for time series generation. For fairness, we compare our model to the baselines while fixing the generators and embedding/discriminator to be the common sequential neural network (2 layers of LSTMs).
Dataset: We benchmark our model on four different time series datasets with various characteristics: dimensions, sample frequency, periodicity, noise level, and correlation. (1) Rough Volatility: High-frequency synthetic time series data with low noise-to-signal. (2) Stock: The daily historical data on ten publicly traded stocks from 2013 to 2021, including as features the volume and high, low, opening, closing, and adjusted closing prices. (3) Beijing Air Quality [47]: An UCI multivariate time series on hourly air pollutants data from different monitoring sites. (4) EEG Eye State [38]: An UCI dataset of a high frequency and continuous EEG eye measurement. We summarise the key statistics of the datasets in Table 1.
Dataset | Dimension | Length | Sample rate | Auto-cor (lag 1) | Auto-cor (lag 5) | Cross-cor |
---|---|---|---|---|---|---|
RV | 2 | 200 | - | 0.967 | 0.916 | -0.014 |
Stock | 5 | 20 | 1day | 0.958 | 0.922 | 0.604 |
Air | 10 | 24 | 1hour | 0.947 | 0.752 | 0.0487 |
EEG | 14 | 20 | 8ms | 0.517 | 0.457 | 0.418 |
Evaluation metrics: The following three metrics are used to assess the quality of generative models. For time series generation/reconstruction, we compare the true and fake/reconstructed distribution by via the below test metrics. (1) Discriminative score [45]: We train a post-hoc classifier to distinguish true and fake data. We report the classification error on the test data. The better generative model yields a lower classification error, as it means that the classifier struggles to differentiate between true and fake data. (2) Predictive score [45, 12]: We train a post-hoc sequence-to-sequence regression model to predict the latter part of a time series given the first part from the generated data. We then evaluate and report the mean square error (MSE) on the true time series data. The lower MSE indicates better the generated data can be used to train a predictive model. (3) Sig-MMD [9, 42]: We use MMD with the signature feature as a generic metric on time series distribution. Smaller the values, indicating closer the distributions, are better. To compute three evaluation metrics, we randomly generated 10,000 samples of true and synthetic (reconstructed) distribution resp. The mean and standard deviation of each metric based on 10 repeated random sampling are reported.
5.1 Time series generation
Table 2 indicates that PCF-GAN consistently outperforms the other baselines across all datasets, as demonstrated by all three test metrics. Specifically, in terms of the discriminative score, PCF-GAN achieves a remarkable performance with values of and on the Rough volatility and Stock datasets, respectively. These values are and lower than those achieved by the second-best model. Regarding the predictive score, PCF-GAN achieves the best result across all four datasets. While COT-GAN surpasses PCF-GAN in terms of the Sig-MMD metric on the EEG dataset, PCF-GAN consistently outperforms the other models in the remaining three datasets. Additionally, to assess the fitting on auto-correlation, cross-correlation and marginal distribution, we include the corresponding numerical results in Table 4 in Appendix D.4. For a qualitative analysis of generative quality, we provide the visualizations of generated samples for all models and datasets in Appendix D without selective bias. Furthermore, to showcase the effectiveness of our auto-encoder architecture for the generation task, we present an ablation study in Appendix D.
Task | Generation | Reconstruction | |||||
---|---|---|---|---|---|---|---|
Dataset | Test Metrics | RGAN | COT-GAN | TimeGAN | PCF-GAN | TimeGAN (R) | PCF-GAN(R) |
RV | Discriminative | .0271.048 | .0499.068 | .0327.019 | .0108.006 | .5000.000 | .2820.082 |
Predictive | .0393.000 | .0395.000 | .0395.001 | .0390.000 | .0590.003 | .0398.001 | |
Sig-MMD | .0163.004 | .0116.003 | .0027.004 | .0024.001 | 3.3081.34 | .0960.050 | |
Stock | Discriminative | .1283.015 | .4966.002 | .3286.063 | .0784.028 | .4943.002 | .3181.038 |
Predictive | .0132.000 | .0144.000 | .0139.000 | .0125.000 | .1180.012 | .0127.000 | |
Sig-MMD | .0248.008 | .0029 .000 | .0272.006 | .0017.000 | .7587.186 | .0078.004 | |
Air | Discriminative | .4549.012 | .4992.002 | .3460.025 | .2326.058 | .4999.000 | .4140.013 |
Predictive | .0261.001 | .0260.001 | .0256.000 | .0237.000 | .0619.004 | .0289.000 | |
Sig-MMD | .0456.015 | .0128.002 | .0146.026 | .0126.005 | .4141.078 | .0359.012 | |
EEG | Discriminative | .4908.003 | .4931.007 | .4771.008 | .3660.025 | .5000.000 | .4959.003 |
Predictive | .0315.000 | .0304.000 | .0342.001 | .0246.000 | .0499.001 | .0328.001 | |
Sig-MMD | .0602.010 | .0102.002 | .0640.025 | .0180.004 | .0700.021 | .0641.019 |
5.2 Time series reconstruction
As TimeGAN is the only baseline model incorporating reconstruction capability, for reconstruction tasks we only compare with TimeGAN. The reconstructed examples of time series using both PCF-GAN and TimeGAN are shown in Figure 4; see Appendix D for more samples.
Visually, the PCF-GAN achieves better reconstruction results than TimeGAN by producing more accurate reconstructed time series samples. Notably, the reconstructed samples from PCF-GAN preserve the temporal dependency of original time series for all four datasets, while some reconstructed samples from TimeGAN in EEG and Stock datasets are completely mismatched. This is further quantified in Table 2 on the reconstruction task, where the reconstructed samples from PCF-GAN consistently outperform those from TimeGAN in terms of all test metrics.
5.3 Training stability and efficiency
Figure 5 demonstrates the training progress of the PCF-GAN on RV dataset. Compared to the fluctuating generator loss typically observed in traditional GANs, the PCF-GAN yields better convergence by leveraging the autoencoder structure. This is achieved by minimising reconstruction and regularisation losses, which ensures the injectivity of and enables production of a semantic embedding throughout the training process. The decay of generator loss in the embedding space directly reflects the improvement in the quality of the generated time series. This is particularly useful for debugging and conducting hyperparameter searches. Furthermore, decay in both recovery and regularisation loss signifies the enhanced performance of the autoencoder.
By leveraging the effective critic , we achieve enhanced performance with a moderate increase in parameters (ranging from 1200 to 6400) within of EPCFD. The training of these additional parameters is highly efficient in PCF-GAN, while still outperforming all baseline models. Specifically, our algorithm is approximately twice as fast as TimeGAN (using three extra critic modules) and three times as fast as COT-GAN (with one additional critic module and the Sinkhorn algorithm). However, it takes 1.5 times as long as RGAN due to the extra training required on .
6 Conclusion & Broader impact
Conclusion We introduce a novel, principled and efficient PCF-GAN model based on PCF for generating high-fidelity sequential data. With theoretical support, it achieves state-of-the-art generative performance with additional reconstruction functionality in various tasks of time series generation.
Limitation and future work In this work, we use LSTM-based networks for the autoencoder and do not explore other sequential models (e.g., transformers). The suitable choice of network architecture for the autoencoder may further improve the efficacy of the proposed PCF-GAN on more complicated data, e.g., video and skeletal human action sequence, which merits further investigation. As a distance metric on time series, PCFD can be flexibly incorporated with other advanced generators of time series GAN models, hence may further improve the performance. For example, one can replace the average cross-entropy loss used in [17, 39] and the Wasserstein distance in [36] by PCFD, with some simple modifications on the discriminators. Furthermore, although we establish the link between PCFD and MMD, it is interesting to design efficient algorithms to compute the kernel specified in Section B.3.
Broader impact Like other GAN models, this model has the potential to aid data-hungry algorithms by augmenting small datasets. Additionally, it can enable data sharing in domains such as finance and healthcare, where sensitive time series data is plentiful. However, it is important to acknowledge that the generation of synthetic data also carries the risk of potential misuse (e.g. generating fake news).
Acknowledgments and Disclosure of Funding
The research of SL is supported by NSFC (National Natural Science Foundation of China) Grant No. 12201399, and the Shanghai Frontiers Science Center of Modern Analysis. This research project is also supported by SL’s visiting scholarship at New York University-Shanghai. HN is supported by the EPSRC under the program grant EP/S026347/1 and The Alan Turing Institute under the EPSRC grant EP/N510129/1. LH is supported by University College London and the China Scholarship Council under the UCL-CSC scholarship (No. 201908060002). SL and HN are supported by the SJTU-UCL joint seed fund WH610160507/067. HN and HL are supported by the Ecosystem Leadership Award under the EPSRC Grant OobfJ22100020 and The Alan Turing Institute in part. HN is grateful to Jiajie Tao and Zijiu Lyu for proofreading the paper.
References
- [1] Abdul Fatir Ansari, Jonathan Scarlett, and Harold Soh. A characteristic function approach to deep implicit generative modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7478–7487, 2020.
- [2] Martin Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862, 2017.
- [3] Samuel A Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E Tillman, Prashant Reddy, and Manuela Veloso. Generating synthetic data in finance: opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, pages 1–8, 2020.
- [4] Steven M Bellovin, Preetam K Dutta, and Nathan Reitinger. Privacy and synthetic datasets. Stan. Tech. L. Rev., 22:1, 2019.
- [5] Lukas Biewald. Experiment tracking with weights and biases, 2020. Software available from wandb.com.
- [6] Horatio Boedihardjo and Xi Geng. Sl_2 (r)-developments and signature asymptotics for planar paths with bounded variation. arXiv preprint arXiv:2009.13082, 2020.
- [7] Ilya Chevyrev, Terry Lyons, et al. Characteristic functions of measures on geometric rough paths. The Annals of Probability, 44(6):4049–4082, 2016.
- [8] Ilya Chevyrev, Vidit Nanda, and Harald Oberhauser. Persistence paths and signature features in topological data analysis. IEEE transactions on pattern analysis and machine intelligence, 42(1):192–202, 2018.
- [9] Ilya Chevyrev and Harald Oberhauser. Signature moments to characterize laws of stochastic processes. Journal of Machine Learning Research, 23(176):1–42, 2022.
- [10] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- [11] Kacper P Chwialkowski, Aaditya Ramdas, Dino Sejdinovic, and Arthur Gretton. Fast two-sample testing with analytic representations of probability measures. Advances in Neural Information Processing Systems, 28:1981–1989, 2015.
- [12] Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. Real-valued (medical) time series generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633, 2017.
- [13] Marián Fabian, Petr Habala, Petr Hájek, Vicente Montesinos Santalucía, Jan Pelant, and Václav Zizler. Functional analysis and infinite-dimensional geometry, volume 8 of CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer-Verlag, New York, 2001.
- [14] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. Advances in Neural Information Processing Systems, 30:5769–5779, 2017.
- [15] Ben Hambly and Terry Lyons. Uniqueness for the signature of a path of bounded variation and the reduced path group. Annals of Mathematics, pages 109–167, 2010.
- [16] Christopher R Heathcote. The integrated squared error estimation of parameters. Biometrika, 64(2):255–264, 1977.
- [17] Jinsung Jeon, Jeonghak Kim, Haryong Song, Seunghyeon Cho, and Noseong Park. GT-GAN: General purpose time series synthesis with generative adversarial networks. Advances in Neural Information Processing Systems, 35:36999–37010, 2022.
- [18] Patrick Kidger, Patric Bonnier, Imanol Perez Arribas, Cristopher Salvi, and Terry Lyons. Deep signature transforms. Advances in Neural Information Processing Systems, 32:3082–3092, 2019.
- [19] Patrick Kidger, James Foster, Xuechen Li, and Terry J Lyons. Neural SDEs as infinite-dimensional GANs. In International Conference on Machine Learning, pages 5453–5463. PMLR, 2021.
- [20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [21] Achim Klenke. Probability theory: a comprehensive course. Springer Science & Business Media, 2013.
- [22] Erich Leo Lehmann, Joseph P Romano, and George Casella. Testing statistical hypotheses, volume 3. Springer, 2005.
- [23] Daniel Levin, Terry Lyons, and Hao Ni. Learning from the past, predicting the statistics for the future, learning an evolving system. arXiv preprint arXiv:1309.0260, 2013.
- [24] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. MMD GAN: Towards deeper understanding of moment matching network. Advances in Neural Information Processing Systems, 30:2200–2210, 2017.
- [25] Shengxi Li, Zeyang Yu, Min Xiang, and Danilo Mandic. Reciprocal adversarial learning via characteristic functions. Advances in Neural Information Processing Systems, 33:217–228, 2020.
- [26] Hang Lou, Siran Li, and Hao Ni. Path development network with finite-dimensional Lie group representation. arXiv preprint arXiv:2204.00740, 2022.
- [27] Terry Lyons. Rough paths, signatures and the modelling of functions on streams. arXiv preprint arXiv:1405.4537, 2014.
- [28] Terry J Lyons. Differential equations driven by rough signals. Revista Matemática Iberoamericana, 14(2):215–310, 1998.
- [29] Terry J Lyons, Michael Caruana, and Thierry Lévy. Differential equations driven by rough paths. Springer, 2007.
- [30] Terry J Lyons and Weijun Xu. Hyperbolic development and inversion of signature. Journal of Functional Analysis, 272(7):2933–2955, 2017.
- [31] Hao Ni, Lukasz Szpruch, Marc Sabate-Vidales, Baoren Xiao, Magnus Wiese, and Shujian Liao. Sig-Wasserstein GANs for time series generation. In Proceedings of the Second ACM International Conference on AI in Finance, pages 1–8, 2021.
- [32] Hao Ni, Lukasz Szpruch, Magnus Wiese, Shujian Liao, and Baoren Xiao. Conditional sig-wasserstein gans for time series generation. arXiv preprint arXiv:2006.05421, 2020.
- [33] Kalyanapuram R. Parthasarathy. Probability measures on metric spaces, volume 3 of Probability and Mathematical Statistics. Academic Press, Inc., New York-London, 1967.
- [34] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
- [35] Vibhor Rastogi and Suman Nath. Differentially private aggregation of distributed time-series with transformation and encryption. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 735–746, 2010.
- [36] Carl Remlinger, Joseph Mikael, and Romuald Elie. Conditional loss and deep euler scheme for time series generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8098–8105, 2022.
- [37] Jinfu Ren, Yang Liu, and Jiming Liu. EWGAN: Entropy-based Wasserstein GAN for imbalanced learning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 10011–10012, 2019.
- [38] Oliver Roesler, Lucas Bader, Jan Forster, Yoshikatsu Hayashi, Stefan Heßler, and David Suendermann-Oeft. Comparison of eeg devices for eye state classification. Proc. of the AIHLS, 2014.
- [39] Ali Seyfi, Jean-Francois Rajotte, and Raymond Ng. Generating multivariate time series with COmmon Source CoordInated GAN (COSCI-GAN). Advances in Neural Information Processing Systems, 35:32777–32788, 2022.
- [40] Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research, 11:1517–1561, 2010.
- [41] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sutton. VEEGAN: Reducing mode collapse in GANs using implicit variational learning. Advances in Neural Information Processing Systems, 30:3310–3320, 2017.
- [42] Csaba Toth and Harald Oberhauser. Bayesian learning from sequential data using Gaussian processes with signature covariances. In International Conference on Machine Learning, pages 9548–9560. PMLR, 2020.
- [43] Tianlin Xu, Li Kevin Wenliang, Michael Munn, and Beatrice Acciaio. COT-GAN: Generating sequential data via causal optimal transport. Advances in Neural Information Processing Systems, 33:8798–8809, 2020.
- [44] Yasin Yazıcı, Chuan-Sheng Foo, Stefan Winkler, Kim-Hui Yap, Georgios Piliouras, and Vijay Chandrasekhar. The unusual effectiveness of averaging in GAN training. In International Conference on Learning Representations, 2019.
- [45] Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. Time-series generative adversarial networks. Advances in Neural Information Processing Systems, 32:5509–5519, 2019.
- [46] Kôsaku Yosida. Functional analysis. Sixth edition, volume 123 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin-New York, 1980.
- [47] Shuyi Zhang, Bin Guo, Anlan Dong, Jing He, Ziping Xu, and Song Xi Chen. Cautionary tales on air-quality improvement in beijing. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2205):20170457, 2017.
- [48] Robert J Zimmer. Essential results of functional analysis. University of Chicago Press, 1990.
In Appendix A, we collect some notations and properties for paths and unitary feature of a path. Appendix B gives a thorough introduction to the distance function via the path characteristics function. Detailed proofs for the theoretical results on PCFD are provided. Appendix C discusses experimental details and Appendix D presents supplementary numerical results.
Appendix A Preliminaries
A.1 Paths with bounded variation
Definition A.1.
Let be a continuous path. The total variation of on the interval is defined by
(11) |
where the supremum is taken over all finite partitions of . When is finite, say that is a path of bounded variation (BV-path) on and denote .
BV-paths can be defined without the continuity assumption, but we shall not seek for greater generality in this work. It is well-known that
defines a norm (the BV-norm). There is a more general notion of paths of finite -variation for (see [29]), where the case corresponds to BV-paths discussed above. We restrict ourselves to , as this is sufficient for the study of sequential data in practice as piecewise linear approximations of continuous paths.
Definition A.2.
(Concatenation of paths) Let and be two continuous paths. Their concatenation denoted as the path is defined by
Definition A.3 (Tree-like equivalence).
A continuous path is called tree-like if there is an -tree , a continuous function , and a function such that and .
Let denote the time-reversal of continuous path , namely that . We say that and are in tree-like equivalence (denoted as ) if is tree-like.
An important example is when path is a time re-parameterisation of . That is, for , take a nondecreasing surjection , and take .
A.2 Matrix groups and algebras
The unitary group and symplectic group are subsets of the space of matrices:
where and is the identity. Their corresponding Lie algebras are
The unitary group is compact and is a group of isometries of matrix multiplication with respect to the Hilbert–Schmidt norm. Such properties are crucial for establishing theorems and properties related to the path characteristic function (PCF), as discussed in subsequent sections.
The compact symplectic group is the simply-connected maximal compact real Lie subgroup of . It is the real form of , and satisfies
Note that and are both real Lie groups, albeit they have complex entries in general.
A.3 Unitary feature of a path
Recall Definition 2.1 for the unitary feature, reproduced below:
Definition A.4.
Let be a linear map and let be a BV-path. The unitary feature [a.k.a. the path development on the unitary group ] of under is the solution to the equation
We write .
Definition 2.1 is motivated by [7, §4]. Consider with ranging over all finite-dimensional complex Hilbert spaces. Extend by naturality to the tensor algebra over ; that is, define by linearity and the following rule:
Then denote by the totality of such . Any element in is a unitary representation of the Lie group . See [7, p.4059].
The following two lemmas are contained in [26].
Lemma A.5.
[Multiplicativity] Let and . Denote by their concatenation: for and for . Then for all .
We shall compute by Lemma A.5 and Example 2.2 the unitary feature of piecewise linear paths.
Lemma A.6 (Invariance under time-reparametrisation).
Let and let be a non-decreasing -diffeomorphism from onto . Define for . Then, for all and for every , it holds that
A key property of the unitary feature is that it completely determines the law of random paths:
Theorem A.7 (Uniqueness of unitary feature).
For any two paths in , there exists an with some such that .
Proof.
For in , by uniqueness of signature over BV-paths (cf. [15]) one has in . Here we use the fact that the signatures of BV-paths are group-like elements in the tensor algebra. Then, as separates points over (cf. [7, Theorem 4.8]), there is such that ; hence . Therefore, by considering the -valued equation with , we conclude that . ∎
Theorem A.8 (Universality of unitary feature).
Let be a compact subset. For any continuous function and any , there exists an and finitely many as well as , such that
(12) |
Proof.
It follows from [26, Theorem A.4] and the universality of signature in [8] that Eq. (12) holds with and and in place of . By a simple approximation via restricting the ranges of and domains of , we may obtain (without relabelling) and that verify Eq. (12). We conclude by the flag structure of and . ∎
Appendix B Path Characteristic loss
B.1 Path Characteristic function
Theorem B.1.
Let be a -valued random variable with associated probability measure . The path characteristic function uniquely characterises .
B.2 Distance metric via path characteristic function
Lemma B.2.
in Eq. (5) defines a pseudometric on the path space for any and . In addition, suppose that is a countable dense subset in . Then the following defines a metric on :
(13) |
In Lemma B.2 above, where is the completion of the projective tensor product and is a Banach space under the norm . Here denotes the -projection of on . Therefore, such a sequence always exists since , being the space of Borel probability measures over a Polish space, is itself a Polish space. See [33].
Proof.
Non-negativity, symmetry, and that are clear. That follows from the triangle inequality of the Hilbert–Schmidt norm and the linearity of expectation. This shows that is a pseudometric for each .
In addition, implies that
So, if is supported on the whole of , then for any .
Now, by density of in , there exists a subsequence such that has full support on for each . Thus, implies that on a dense subset of for every . We conclude by the characteristicity Theorem 3.2 and a continuity argument. ∎
Lemma B.3 (Lemma 3.5).
Let be an -valued random variable. Then for any -valued random variables and , it holds that
Proof.
As is a Hilbert space, from the Pythagorean theorem one deduces that . Both , are expectations of -valued random variables, and for . Thus . We take expectation over to conclude. ∎
The result below is formulated in terms of the Hilbert–Schmidt norm of matrices in . Any other norm on is equivalent to that, modulo a constant depending on only. In fact, the strict inequality for holds. See, e.g., [48, Lemma 3.1.10, p.55].
Lemma B.4.
Let be two linear paths, and let as before. Denote by the usual Euclidean norm on and the operator norm of . Then we have
Proof.
Let with and . This is the linear interpolation between and . Then we have
thanks to an identity for differentiation of matrix exponential and the inequality . Here and take values in , hence of operator norm 1 for any parameters . So we infer that
where the first inequality holds for Bochner integrals. See [46, Corollary 1, p.133]. ∎
Lemma B.5 (Subadditivity of unitary feature).
Let be BV-paths, and be the unitary feature associated with . For any we have
Proof.
We apply the multiplicative property of unitary feature in Lemma A.5, the triangle inequality, and the unitary invariance of the Hilbert–Schmidt norm to estimate that
∎
Proposition B.6.
For , the unitary feature with satisfies
where denotes the total variation over of the path .
Proof.
Given BV-paths with the same initial point, there are piecewise linear approximations , with common partition converging respectively to and in the -variation metric; . Applying Lemma B.5 recursively, we obtain that
By definition of unitary feature and Lemma B.4, one deduces that
We may now conclude by taking supremum over all partitions and sending with the limit. This is a consequence of continuity of the Itô map with respect to the driving path in -variation topology, since the vector field in (4) is Lipschitz ([27, Theorem 1.20]). ∎
Theorem B.7 (Dependence on continuous parameter).
Let and be subsets of , be a metric space, be a Borel probability measure on , and be a Borel probability measure on . Assume that , is Lipschitz in such that . In addition, suppose that and . Then is Lipschitz in .
Moreover, it holds that
for any , , , and .
Proof.
The unitary feature is universal in the spirit of the Stone–Weierstrass theorem; i.e., continuous functions on paths can be uniformly approximated by linear functionals on unitary features.
As metrises the weak topology on the space of path-valued random variables, it emerges as a more sensible distance metric for training time series generations than metrics without this property; e.g., the Jensen–Shannon divergence.
Theorem B.8 (Metrisation of weak-star topology).
Let be a compact subset. Suppose that is a countable dense subset in . Then defined by Eqn. 13 metrises the weak-star topology on . That is, as , where denotes convergence in distribution of random variables.
The metrisability of follows from general theorems in functional analysis: is a compact metric space, hence is separable ([13, Lemma 3.23]). Then, viewing as the unit circle in via Riesz representation, we infer from [13, Proposition 3.24] that is metrisable in the weak-star topology, which is equivalent to the distributional convergence of random variables.
Proof.
The backward direction is straightforward. By the Riesz representation theorem of Radon measures, the distributional convergence is equivalent to that for all continuous . Thus , namely that for each . The unitary feature is bounded as it is -valued for some , so we deduce from the dominated convergence theorem that .
Conversely, suppose that . Then
for any and , in particular for those with full support. In view of the universality Theorem A.8 proved above, for any fixed and any continuous function , by approximating with sum of finitely many (the notations are as in Theorem A.8), one infers that for and sufficiently large, it holds that
By considering those measures with , we deduce that
This is tantamount to the distributional convergence. ∎
Proof.
We first prove the ’if’ direction of the statement. By the Portmanteau theorem [21], convergence in distribution implies, for any bounded continuous map , we have . Therefore, for any , , which implies as . Hence, it follows that, as ,
which completes the proof of ’if’ direction.
Now we proceed with the ’only if’ direction. By the universality of the unitary path development from Theorem A.8, for any continuous function and , there exist and such that
(14) |
or equivalently. For simplicity, we denote and as and respectively. Therefore,
(15) | ||||
(16) | ||||
(17) |
where the operator norm. Since as and is arbitrary, for any continuous bounded function , which implies by the Portmanteau theorem [21]. ∎
B.3 Relation with MMD
We now discuss linkages between PCFD and MMD (maximum mean discrepancy) defined over , the space of Borel probability measures (equivalently, probability distributions) on .
Definition B.9.
Given a kernel function , the MMD associated to is the function given as follows: for independent random variables on , set
The PCFD can be interpreted as an MMD on measures of the path space with a specific kernel. Compare with [40] for the case of .
Proposition B.10 (PCFD as MMD).
Given and -valued random variables and with induced distributions and , resp. Then with kernel .
Throughout, designates concatenation of paths and is the path obtained by running backwards. The operation on the path space is analogous to on . If , then is the null path. See the Appendix for proofs and further discussions.
Remark B.11 (Computational cost complexity).
By Proposition B.10, PCFD is an MMD. However, to compute EPCFD, we may directly calculate the expected distance between the PCFs, without going over the kernel calculations in the MMD approach. Our method is significantly more efficient, especially for large datasets. The computational complexity of EPCFD is linear in sample size, whereas the MMD approach is quadratic.
Proof.
By definition of PCFD, we have
where and , respectively. Using Fubini’s theorem and observing that (as and are -valued, they indeed lie in as is a compact Lie group under the Hilbert–Schmidt metric), we deduce that
The first equality then follows from the identification and the definition of .
On the other hand, by Lemma A.5 and the definition of the Hilbert–Schmidt inner product on , one may rewrite the kernel function as follows:
where denotes the concatenation of paths. The second equality now follows. ∎
B.4 Empirical PCFD
B.4.1 Initialisation of
A linear map can be canonically represented by independent anti-Hermitian matrices . To sample empiracal distribution of from , we propose a sampling scheme over . This can also be used as an effective initialisation of model parameters for the empirical measure of .
In practice, when working with the Lie algebra , i.e., the vector space of complex-valued matrices that are anti-Hermitian (, where is the transpose conjugate of ), we view each anti-Hermitian matrix as an real matrix via the isomorphism of -vector spaces .
Under the above identification, we have the decomposition
(18) |
where is the Lie algebra of anti-symmetric real matrices, is the space of real symmetric matrices, consists of real diagonal matrices and denotes the quotient space of real symmetric matrices by the real diagonal matrices.
The sampling procedure of , is given as follows. First, we simulate valued and i.i.d random variables and , whose elements are i.i.d and satisfy the pre-specified distribution in . We have the decomposition , where D and E are a diagonal random matrix and a off-diagonal random matrix respectively. Then we construct the anti-symmetric matrix and matrix in the quotient space , , and diagonal matrix . Correspondingly, we simulate -valued random variables by virtue of Eq. (18). As the empirical measure of the can be fully characterised by the model parameters , we sample i.i.d. samples which take values in .
B.4.2 Hypothesis test
In the following, we illustrate the efficacy of the proposed trainable EPCFD metric in the context of the hypothesis test on stochastic processes.
Example B.12 (Hypothesis testing on fractional Brownian motion).
Consider the 3-dimensional Brownian motion and the fraction Brownian motion with the Hurst parameter . We simulated 5000 sample paths for both and with 50 discretized time steps. We apply the proposed optimized EPCFD metric to the two-sample testing problem: the null hypothesis against the alternative . We compare the optimized EPCFD metric with EPCFD metric with the prespecified distribution (PCF) and the characteristic function distance (CF) on the flattened time series [25]. The optimized PCFs are trained on a separate set of 5000 training samples to maximise the PCFD. The details of training can be found at Section C.2.
We conduct the permutation test to compute the power of a test (i.e. the probability of correct rejection of the null ) and Type I error (i.e. the probability of false acceptances of the null ) for varying . Note that when , and have the same distribution and hence are indistinguishable. Therefore, the better the test metric is, the test power should be closer to 0 when is close to 0.5, whereas it should be closer to 1 when is away from . We refer to [22] for more in-depth information on hypothesis testing and permutation test statistics.
The plot of the test power and Type 1 error in Figure 6 shows that CF fails in the two sample tests, whilst both EPCFD and optimised EPCFD can distinguish the samples from the stochastic process when . It indicates that the EPCFD captures the distribution of time series much more effectively than the conventional CF metric. Moreover, optimization of EPCFD increases the test power while decreasing the type1 error, particularly when is closer to .
Appendix C Numerical experiments
C.1 Experimental detail on PCF-GAN
C.1.1 General notes
Codes. The code for reproducing all experiments can be found in https://github.com/DeepIntoStreams/PCF-GAN.
Software. We conducted all experiments using PyTorch 1.13.1 [34] and performed hyperparameter tuning with Wandb [5]. To ensure reproducibility, we implemented benchmark models based on open-source code from [45, 43, 12]. We used the Ksig library [42] to calculate the Sig-MMD metrics. The codes in [25] were used to compute characteristics function distance in Example B.12.
Computing infrastructure. The experiments were performed on a computational system running Ubuntu 22.04.2 LTS, comprising three Quadro RTX 8000 and two RTX A6000 GPUs. Each experiment was run independently on a single GPU, with the training phase taking between 6 hours to 3 days, depending on the dataset and models used.
Architectures. To ensure a fair comparison, we employed identical network architectures, with two layers of LSTMs having 32 hidden units, for both the generator and discriminator across all models. For the generator, the output of the LSTM (full sequence) was passed through a Tanh activation function and a linear output layer. All generative models take a multi-dimensional discretized Brownian motion as the noise distribution, scaling it to ensure values were controlled within the range . The dimension and scaling factor varied based on the dataset and were specified in the individual sections as below.
The PCF-GAN uses the development layers on the unitary matrix [26] to calculate the PCFD distance. For all experiments, we fixed the unitary matrix size and coefficient for the regularization loss to 10 and 1, respectively. The number of unitary linear maps and the coefficient of the recovery loss were determined via hyper-parameter tuning, which varied depending on the dataset (see individual section for details).
Regarding TimeGAN, the following approach described in [45] and employed embedding, supervisor, and recovery modules. Each of these modules had two layers of LSTMs with 32 hidden units. For COT-GAN, we used two separate modules for discriminators, each with two layers of LSTMs with 32 hidden units. Based on the recommendation from COT-GAN [43] and informal hyperparameter tuning, we set and for all experiments.
Optimisation & training. We used the ADAM optimizer for all experiments [20], with a learning rate of 0.001 for both generators and discriminators. The learning rate for the unitary development network is 0.005. The initial decay rates in the ADAM optimizer are set , . The discriminator was trained for two iterations per iteration of the generator’s training. For TimeGAN, we followed the training scheme for each module as suggested in the original paper. The batch size was 64 for all experiments. These hyperparameters do not substantially affect the results.
To improve the training stability of GAN, we employed three techniques. Firstly, we applied a constant exponential decay rate of 0.97 to the learning rate for every 500 generator training iterations. Secondly, we clipped the norm of gradients in both generator and discriminator to 10. Thirdly, we used the Cesaro mean of the generator weights after certain iterations to improve the performance of the final model, as suggested by [44]. In all cases, we selected the number of training iterations such that all methods could produce stable generative samples. The optimal number of training iterations and weight averaging scheme varied for each dataset. More details can be found in the respective sections.
Test metrics. Discriminative score. The network architecture of the post-hoc classifier consists of two layers of LSTMs with 16 hidden units. The dataset was split into equal proportions of real and generated time series with labels 0 and 1, with an / train/test split for training and evaluation. The discriminative model was trained for 30 epochs using Adam with a learning rate of 0.001 and a batch size of 64. The best classification error on the test set was reported.
Predictive score. The network architecture of the post-hoc sequence-to-sequence regressor consists of two layers of LSTMs with 16 hidden units. The model was trained on the generated time series and evaluated on the real time series, using the first of the time series to predict the last . The predictive model was trained for 50 epochs using Adam with a learning rate of 0.001 and a batch size of 64. The best mean squared error on the test set was reported.
Sig-MMD. We directly computed the Sig-MMD by taking inputs of the real time series samples and generated time series samples. We used the radial basis function kernel applying to the truncated signature feature up to depth .
C.2 Time dependent Ornstein-Uhlenbeck process
On this dataset, we experimented with the basic version of PCF-GAN, which only utilized the EPCFD as the discriminator without the autoencoder structure. The batch size is 256. The model are trained with 20000 generator training iterations and weight averaging on the generator was performed over the final 4000 generator training iterations. We used the 2-dimensional discretized Brownian motion as the noise distribution.
C.2.1 Rough volatility model
We followed [31] considering a rough stochastic volatility model for an asset price process , which satisfies the below stochastic differential equation,
(19) | ||||
(20) |
where denotes the forward variance and denotes the frational Brownian motion (fBM) given by
where are (possibly correlated) Brownian motions. In our experiments, the synthetic dataset is sampled from Equation 19 with , , , and initial condition . Each sample path is sampled uniformly from with the time discretization , which consists of 200 time steps. We train the generators to learn the joint distribution of the log price and log volatility.
All methods are trained with 30000 generator training iterations and weight averaging on the generator was performed over the final 5000 generator training iterations. The input noise vectors have 5 dimension and 200 time steps.
For PCF-GAN, the coefficient for the recovery loss was 50, and the number of unitary linear maps was 6.
C.2.2 Stocks
We selected 10 large market cap stocks, which are Google, Apple, Amazon, Tesla, Meta, Microsoft, Nvidia, JP Morgan, Visa and P&G, from 2013 to 2021. The dataset consists of 5 features, including daily open, close, high, low prices and volume, available on https://finance.yahoo.com/lookup. We truncated the long stock time series into 20 days. The data were normalized with standard Min-Max normalisation on each feature channel. The Stock dataset used in our study is similar to the one employed in [25] but with a broader range of assets. Unlike the previous approach, we avoided sampling the time series using rolling windows with a stride of 1 to mitigate the presence of strong dependencies between samples.
All methods are trained with 30000 generator training iterations and weight averaging on the generator was performed over the final 5000 generator training iterations. The input noise vectors have 5 feature dimensions and 20 time steps.
For PCF-GAN, the coefficient for the recovery loss was 400, and the number of unitary linear maps was 6.
C.2.3 Beijing Air Quality
We used a dataset of the air quality in Beijing from the UCI repository [47] and available on https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data. Each sample is a 10-dimensional time series of the SO2, NO2, CO, O3, PM2.5, PM10 concentrations, temperature, pressure, dew point temperature and wind speed. Each time series is recorded hourly over the course of a day. The data were normalized with standard Min-Max normalisation on each feature channel.
All methods are trained with 20000 generator training iterations and weight averaging on the generator was performed over the final 4000 generator training iterations. The input noise vectors have 5 dimensions and 24 time steps.
For PCF-GAN, the coefficient for the recovery loss was 50, and the number of unitary linear maps was 6.
C.2.4 EEG
We obtained the EEG eye state dataset from https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State. The data is from one continuous EEG measurement on 14 variables with 14980 time steps. We truncated the long time series into smaller ones with 20 time steps. The data are subtracted by channel-wise mean, divided by three times the channel-wise standard deviation, and then passed through a tanh nonlinearity.
All methods are trained with 30000 generator training iterations and weight averaging on the generator was performed over the final 5000 generator training iterations. The input noise vectors have the 8 dimensional and 20 time steps.
For PCF-GAN, the coefficient for the recovery loss was 50, and the number of unitary linear maps was 8.
Appendix D Supplementary results
D.1 Ablation study
An ablation study was conducted on the PCF-GAN model to evaluate the importance of its various components. Specifically, the reconstruction loss and regularization loss were disabled in order to assess their impact on model performance across benchmark datasets and various test metrics. Table 3 consistently demonstrated that the PCF-GAN model outperformed the ablated versions, confirming the significance of these two losses in the overall model performance.
Dataset | Test Metrics | PCF-GAN | w/o | w/o | w/o & |
RV | Discriminative | .0108.006 | .0178.017 | .0152.020 | .0101.007 |
Predictive | .0390.000 | .0389.000 | .0390.003 | .0391.001 | |
Sig-MMD | .0024.001 | .0037.001 | .0036 .002 | .0027.001 | |
Stock | Discriminative | .0784.028 | .0963.011 | .2538.052 | .0815.001 |
Predictive | .0125.000 | .0123.000 | .0127.000 | .0126.001 | |
Sig-MMD | .0017.000 | .0062.002 | .0024.001 | .0021.001 | |
Air | Discriminative | .2326.058 | .3940.068 | .4783.029 | .3875.009 |
Predictive | .0237.000 | .0239.000 | .0283.001 | .0240.000 | |
Sig-MMD | .0126.005 | .0111.003 | .0232.004 | .0163.004 | |
EEG | Discriminative | .3660.025 | .4942.010 | .5000.000 | .4649.015 |
Predictive | .0246.000 | .0299.000 | .0636.007 | .0248.000 | |
Sig-MMD | .0180.004 | .0296.008 | 1.197.234 | .0278007 | |
‘ |
Notably, the inclusion of the two additional losses significantly improved model performance on high-dimensional time series datasets, such as Air Quality and EEG, indicating that the proposed auto-encoder architecture effectively learns meaningful low-dimensional sequential embeddings. Conversely, the exclusive use of the reconstruction loss led to a notable decrease in model performance, suggesting that the samplewise distance might not be suitable for time series data. However, the additional regularization loss helped overcome this issue by ensuring that the sequential embedding space is confined to a predetermined noise space, such as the discretized Brownian motion. As a result, the regularization loss helped to mitigate the problems that arose when relying solely on the reconstruction loss.
D.2 Generated samples
In this section, we present random samples from the four benchmark datasets generated by PCF-GAN, TimeGAN, RGAN, and COT-GAN. Although interpreting the sample plots of the generated time series poses a challenge, our observations reveal that PCF-GAN successfully generates time series that capture the temporal dependencies exhibited in the original time series across all datasets. Conversely, COT-GAN generates trajectories that are relatively smoother compared to the real time series samples, demonstrated on Stock and EEG datasets, by Figure 8 and Figure 10 respectively. Figure 10 shows that TimeGAN occasionally produces samples with higher oscillations than those found in the real samples.
D.3 Reconstructed samples
In this section, we present additional reconstructed time series samples generated by PCF-GAN and TimeGAN. Figure 11 illustrates that PCF-GAN consistently outperforms TimeGAN by producing higher-quality reconstructed samples across all datasets.
D.4 Test metrics on (auto-)correlation and marginal distribution
This subsection details the supplementary test metrics in terms of fitting the autocorrelation, cross-correlation, and marginal distribution, as presented in Table 4. This table confirms that our proposed PCF-GAN consistently outperforms the benchmarking models across all datasets.
Task | Generation | ||||
---|---|---|---|---|---|
Dataset | Test Metrics | RGAN | COT-GAN | TimeGAN | PCF-GAN |
RV | Auto-cor (lag 1) | .0393.001 | .0608.001 | .0031.001 | .0022.000 |
Auto-cor (lag 5) | .0134.002 | .119.002 | .0035.002 | .0030.002 | |
Cross-cor (lag 0) | .0193.007 | .0234.002 | .0187.011 | .0264.011 | |
Cross-cor (lag 5) | .0222.007 | .1441.012 | .0219.010 | .0158.011 | |
Marginal Dist | .3111.13 | .2157.306 | .1636.223 | .1234.126 | |
Stock | Auto-cor (lag 1) | .127.005 | .202.0035 | .210.005 | .0123.005 |
Auto-cor (lag 5) | .149.009 | .267.006 | .104.006 | .0187.006 | |
Cross-cor (lag 0) | .145.031 | .169.041 | .549.034 | .1815.058 | |
Cross-cor (lag 5) | .341.031 | .456.053 | .747.038 | .2510.062 | |
Marginal Dist | .3276.044 | .2826.061 | .4264.063 | .2730.033 | |
Air | Auto-cor (lag 1) | .1678.010 | .320.006 | .1949.006 | .0927.003 |
Auto-cor (lag 5) | .3226.016 | .520.028 | .5349.034 | .4739.023 | |
Cross-cor (lag 0) | 2.608.106 | 1.942.059 | 2.844.0812 | 2.687.149 | |
Cross-cor (lag 5) | 3.181.101 | 2.176.116 | 2.536.112 | 2.115.121 | |
Marginal Dist | .5527.523 | .5142.600 | .6229.595 | .5066.572 | |
EEG | Auto-cor (lag 1) | 5.918.116 | 6.202.111 | 5.754.083 | 5.668.079 |
Auto-cor (lag 5) | 4.285.074 | 5.911.107 | 5.265.083 | 4.467.127 | |
Cross-cor (lag 0) | 51.16.508 | 24.12.702 | 26.84.638 | 22.27.550 | |
Cross-cor (lag 5) | 47.97.354 | 31.31.920 | 25.95.466 | 19.43.412 | |
Marginal Dist | 15.1821.94 | 8.51813.6 | 13.3521.7 | 10.0916.6 |