Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Right on Time: Revising Time Series Models by Constraining their Explanations

Maurice Kraus
AI and ML Group
Technical University of Darmstadt
maurice.kraus@cs.tu-darmstadt.de
&David Steinmann11footnotemark: 1
AI and ML Group
Technical University of Darmstadt
david.steinmann@tu-darmstadt.de
&Antonia Wüst
AI and ML Group
Technical University of Darmstadt
&Andre Kokozinski
Technical University of Darmstadt
&Kristian Kersting
AI and ML Group
Technical University of Darmstadt
Centre for Cognitive Science
Hessian Center for AI (hessian.AI),
German Center for AI (DFKI)
Equal contribution
Abstract

The reliability of deep time series models is often compromised by their tendency to rely on confounding factors, which may lead to incorrect outputs. Our newly recorded, naturally confounded dataset named P2S from a real mechanical production line emphasizes this. To avoid “Clever-Hans” moments in time series, i.e., to mitigate confounders, we introduce the method Right on Time (RioT). RioT enables, for the first time interactions with model explanations across both the time and frequency domain. Feedback on explanations in both domains is then used to constrain the model, steering it away from the annotated confounding factors. The dual-domain interaction strategy is crucial for effectively addressing confounders in time series datasets. We empirically demonstrate that RioT can effectively guide models away from the wrong reasons in P2S as well as popular time series classification and forecasting datasets.

1 Introduction

Time series data is ubiquitous in our world today. Everything that is measured over time generates some form of time series, for example, energy load (Koprinska et al., 2018), sensor measurements in industrial machinery (Mehdiyev et al., 2017) or recordings of traffic data (Ma et al., 2022). Various neural models are often applied to complex time series data (Ruiz et al., 2021; Benidis et al., 2023). As in other domains, these can be subject to confounding factors ranging from simple noise or artifacts to complex shortcut confounders (Lapuschkin et al., 2019). Intuitively, a confounder, also called “Clever-Hans” moment, can be a pattern in the data which is not relevant for the task, but correlates with it during model training. A model can incorrectly pick up on this confounder (cf. Fig. 1) and use it instead of the relevant features to e.g. make a classification. A confounded model does not generalize well to data without the confounder, which is a problem when employing models in practice Geirhos et al. (2020). For time series, confounders and their mitigation have yet to receive attention, where existing works make specific assumptions about settings and data (Bica et al., 2020).

In particular, the mitigation of shortcut confounders, i.e., spurious patterns in the training data used for the prediction, is essential. If a model utilizes confounding factors in the training set, its decision relies on wrong reasons, which makes it fail to generalize to data without the confounder. Model explanations play a crucial role in uncovering confounding factors, but they are not enough on their own to address them. While an explanation can reveal that the model relies on incorrect factors, it does not alter the model’s outcome. To change this, we introduce Right on Time (RioT), a new method following the core ideas of explanatory interactive learning (XIL) (Teso and Kersting, 2019), i.e., using feedback on explanations to mitigate confounders. RioT uses traditional explanation methods like Integrated Gradients (IG) (Sundararajan et al., 2017) to detect whether models focus on the right or the wrong time steps and utilizes feedback on the latter to revise the model (Fig. 1, left).

Confounding factors in time series data are not limited to the time domain. A steady noise frequency in an audio signal can also be a confounder but cannot be pinned to a specific time step. To handle these kinds of confounders, RioT can also incorporate feedback in the frequency domain (Fig. 1, right). To further emphasize the importance of mitigating confounders in time series data, we introduce a new real-world, confounded dataset called Production Press Sensor Data (P2S). The dataset consists of sensor measurements from an industrial high-speed press, which is part of many important manufacturing processes in the sheet metal working industry. The sensor data used to detect faulty production is naturally confounded and thus causes incorrect predictions after training. Next to its immediate industrial relevance, P2S is the first time series dataset that contains explicitly annotated confounders, enabling evaluation and comparison of confounder mitigation strategies on real data.

Refer to caption
Figure 1: Left: Expert annotations marking the wrong reasons (red) are given based on model explanations to mitigate spatial confounders. The application of RioTsp revises the model, whose adjusted explanations reveal that its output is based on the right reasons (blue) instead. Right: As confounders can also occur in the frequency domain, RioTfreq can use explanations and feedback in this domain as well. Similar to the spatial domain, it guides the model’s focus, mitigating the influence of the confounders on the target prediction.

Altogether, we make the following contributions: (1) We show both on our newly introduced real-world dataset P2S and on several other manually confounded datasets that SOTA neural networks on time series classification and forecasting can be affected by confounders. (2) We introduce RioT to mitigate confounders for time series data. The method can incorporate feedback not only on the time domain but also on the frequency domain. (3) By incorporating explanations and feedback in the frequency domain, we enable a new perspective on XIL, overcoming the important limitation that confounders must be spatially separable.111https://github.com/ml-research/RioT

The remainder of the paper is structured as follows. In Sec. 2, a brief overview of related work on explanations for time series and revising model mistakes is given. In Sec. 3, we introduce our method before providing an in-depth evaluation and discussion of the results in Sec. 4. Lastly, in Sec. 5, we conclude the paper and provide some potential avenues for future research.

2 Related Work

Explanations for Time Series. Within the field of explainable artificial intelligence (XAI), various techniques to explain machine learning models and their outcomes have been proposed. While many techniques originate from image or text data, they were quickly adapted to time series Rojat et al. (2021). While backpropagation- and perturbation-based attribution methods provide explanations directly in the input space, other techniques like symbol aggregations (Lin et al., 2003) or shapelets (Ye and Keogh, 2011) aim to provide higher-level explanations. For a more in-depth discussion of explanations for time series, we refer to surveys by Rojat et al. (2021) or Schlegel et al. (2019). While it is essential to have explanation methods to detect confounding factors, they alone are insufficient to revise a model. Thus, explanations are the starting point of our method, as they enable users to detect confounders and provide feedback to overcome confounders in the model. In particular, we build upon Integrated Gradients (IG) (Sundararajan et al., 2017). This method computes attributions for the input by utilizing model gradients. We selected it because of its several desirable properties like completeness or implementation invariance and its wide use, also for time series data (Mercier et al., 2022; Veerappa et al., 2022).

Explanatory Interactive Learning (XIL). While not prevalent for time series data, there has been some work on confounders and how to overcome them for other domains, primarily the image domain. Most notable, there is explanatory interactive learning, which describes the general process of revising a model’s decision process based on human feedback (Teso and Kersting, 2019; Schramowski et al., 2020). Within XIL, the model’s explanations are used to incorporate the feedback back to the model, thus revising its mistakes (Friedrich et al., 2023a). XIL can be used on models that show Clever-Hans-like behavior (being affected by shortcuts in the data) to prevent them from using these shortcuts Stammer et al. (2020). Several methods apply the idea of XIL to image data. For example, Right for the Right Reasons (RRR) (Ross et al., 2017) and Right for Better Reasons (RBR) (Shao et al., 2021) use human feedback as a penalty mask on model explanations. Instead of penalizing wrong reasons, HINT (Selvaraju et al., 2019) rewards the model for focusing on the correct part of the input. Furthermore, Friedrich et al. (2023b) investigate the use of multiple explanation methods simultaneously. Although various XIL methods are often employed to address confounders in image data, their application to time series data remains unexplored. To bridge this gap, we introduce RioT, a method that incorporates the core principles of XIL to the unique characteristics of time series data.

Unconfounding Time Series. Next to approaches from interactive learning, there is also some other work on unconfounding time series models. This line of work is generally based on causal analysis of the time series model and data (Flanders et al., 2011). Methods like Time Series Deconfounder (Bica et al., 2020), SqeDec (Hatt and Feuerriegel, 2024) or LipCDE (Cao et al., 2023), perform estimations on the data while mitigating the effect of confounders in covariates of the target variable. They generally mitigate the effect of the confounders through casual analysis and specific assumptions about the data generation. On the other hand, in this work we tackle confounders within the target variate, and have no further assumption besides that the confounder is visible in the explanations of the model, where these previous methods cannot easily be applied.

3 Right on Time (RioT)

Refer to caption
Figure 2: This figure depicts the flow of explanation and revision, beginning with input data 𝒙𝒙\bm{x}bold_italic_x, through the model f(𝒙)𝑓𝒙f(\bm{x})italic_f ( bold_italic_x ) to explanations e(𝒙)𝑒𝒙e(\bm{x})italic_e ( bold_italic_x ), annotated feedback a(𝒙)𝑎𝒙a(\bm{x})italic_a ( bold_italic_x ), and finally back to the model. IG provides the spatial model explanation, and is transformed with the FFT to obtain the frequency explanation. Expert user annotations can be applied in either or both domains. They are utilized by the right-reason loss (RRspsuperscriptsubscript𝑅𝑅𝑠𝑝\mathcal{L}_{RR}^{sp}caligraphic_L start_POSTSUBSCRIPT italic_R italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_p end_POSTSUPERSCRIPT and RRfreqsuperscriptsubscript𝑅𝑅𝑓𝑟𝑒𝑞\mathcal{L}_{RR}^{freq}caligraphic_L start_POSTSUBSCRIPT italic_R italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_r italic_e italic_q end_POSTSUPERSCRIPT) to guide the model away from confounders in both the time and frequency domains.

The core intuition of Right on Time (RioT) is to utilize human feedback to steer a model away from wrong reasons. It follows the general structure of XIL, which has four main steps Friedrich et al. (2023a). In Select, instances for feedback and following model revision are selected. Following previous XIL methods, we select all samples by default while not necessarily requiring feedback for all of them. Afterwards, Explain covers how model explanations are generated, before in Obtain, a human provides feedback on the selected instances. Lastly, in Revise, the feedback is integrated into the model to overcome the confounders. We introduce RioT along these steps in the following (the entire process is illustrated in Fig. 2). But let us first establish some notation for the remainder of this paper.

Given is a dataset (𝒳,𝒴)𝒳𝒴(\mathcal{X},\mathcal{Y})( caligraphic_X , caligraphic_Y ) and a model f()𝑓f(\cdot)italic_f ( ⋅ ) for time series classification or forecasting. The dataset consists of D𝐷Ditalic_D many pairs of 𝒙𝒙\bm{x}bold_italic_x and 𝒚𝒚\bm{y}bold_italic_y. Thereby, 𝒙𝒳𝒙𝒳\bm{x}\in\mathcal{X}bold_italic_x ∈ caligraphic_X is a time series of length T𝑇Titalic_T, i.e., 𝒙T𝒙superscript𝑇\bm{x}\in\mathbb{R}^{T}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. For K𝐾Kitalic_K class classification, the ground-truth output is 𝒚{1,,K}𝒚1𝐾\bm{y}\in\{1,\dots,K\}bold_italic_y ∈ { 1 , … , italic_K } and for forecasting, the ground-truth output is the forecasting window 𝒚W𝒚superscript𝑊\bm{y}\in\mathbb{R}^{W}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT of length W𝑊Witalic_W. The ground-truth output of the full dataset is then described as 𝒴𝒴\mathcal{Y}caligraphic_Y in both cases. For a datapoint 𝒙𝒙\bm{x}bold_italic_x, the model generates the output 𝒚^=f(𝒙)^𝒚𝑓𝒙\hat{\bm{y}}=f(\bm{x})over^ start_ARG bold_italic_y end_ARG = italic_f ( bold_italic_x ), where the dimensions of 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG are the same as of 𝒚𝒚\bm{y}bold_italic_y for both tasks.

3.1 Explain

Given a pair of input 𝒙𝒙\bm{x}bold_italic_x and model output 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG for time series classification, the explainer generates an explanation ef(𝒙)Tsubscript𝑒𝑓𝒙superscript𝑇e_{f}(\bm{x})\in\mathbb{R}^{T}italic_e start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in the form of attributions to explain 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG w.r.t. 𝒙𝒙\bm{x}bold_italic_x. For an element of the input, a large attribution value means a large influence on the output. In the remainder of the paper, explanations refer to the model f𝑓fitalic_f, but we drop f𝑓fitalic_f from the notation to declutter it, resulting in e(𝒙)𝑒𝒙e(\bm{x})italic_e ( bold_italic_x ). We use IG (Sundararajan et al., 2017) as an explainer, an established gradient-based attribution method. However, we make some adjustments to the base method to make it more suitable for time series and model revision (Eq. 1, further details in SubSec. A.2). In the following, we introduce the modifications to use attributions for forecasting and to obtain explanations in the frequency domain.

e(𝒙)=|𝒙𝒙¯|01f(𝒙~)𝒙~|𝒙~=𝒙¯+α(𝒙𝒙¯)dα𝑒𝒙evaluated-at𝒙¯𝒙superscriptsubscript01𝑓~𝒙~𝒙~𝒙¯𝒙𝛼𝒙¯𝒙𝑑𝛼e(\bm{x})=|\bm{x}-\bar{\bm{x}}|\cdot\int_{0}^{1}\frac{\partial f(\tilde{\bm{x}% })}{\partial\tilde{\bm{x}}}\Biggr{|}_{\tilde{\bm{x}}=\bar{\bm{x}}+\alpha(\bm{x% }-\bar{\bm{x}})}d\alphaitalic_e ( bold_italic_x ) = | bold_italic_x - over¯ start_ARG bold_italic_x end_ARG | ⋅ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_f ( over~ start_ARG bold_italic_x end_ARG ) end_ARG start_ARG ∂ over~ start_ARG bold_italic_x end_ARG end_ARG | start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG = over¯ start_ARG bold_italic_x end_ARG + italic_α ( bold_italic_x - over¯ start_ARG bold_italic_x end_ARG ) end_POSTSUBSCRIPT italic_d italic_α (1)
e(𝒙)=1Wi=1Wei(𝒙)𝑒𝒙1𝑊superscriptsubscript𝑖1𝑊subscriptsuperscript𝑒𝑖𝒙e(\bm{x})=\frac{1}{W}\sum_{i=1}^{W}e^{\prime}_{i}(\bm{x})italic_e ( bold_italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) (2)

Attributions for Forecasting. In a classification setting, attributions are generated by propagating gradients back from the model output (of its highest activated class) to the model inputs. However, there is often no single model output in time series forecasting. Instead, the model generates one output for each timestep of the forecasting window simultaneously. Naively, one could use these W𝑊Witalic_W outputs and generate as many explanations e1(𝒙),eW(𝒙)subscriptsuperscript𝑒1𝒙subscriptsuperscript𝑒𝑊𝒙e^{\prime}_{1}(\bm{x}),\dots e^{\prime}_{W}(\bm{x})italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) , … italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_italic_x ). This number of explanations would, however, make it even harder for humans to interpret the results, as the size of the explanation increases with W𝑊Witalic_W(Miller, 2019). Therefore, we propose to aggregate the individual explanations by averaging in Eq. 2. Averaging attributions over the forecasting window provides a simple yet robust aggregation of the explanations. Other means of combining them, potentially even weighted based on distance of the forecast in the future are also imaginable. Overall, this allows attributions for time series classification and forecasting to be generated similarly.

Attributions in the Frequency Domain. Time series data is often given in the frequency representation. Sometimes, this format is more intuitive for humans to understand than the spatial representations. As a result, providing explanations in this domain is essential. Vielhaben et al. (2023) showed how to obtain frequency attributions of the method Layerwise Relevance Propagation (Bach et al., 2015), even if the model does not operate directly on the frequency domain. We transfer this idea to IG: for an input sample 𝒙𝒙\bm{x}bold_italic_x, we generate attributions with IG, resulting in e(𝒙)T𝑒𝒙superscript𝑇e(\bm{x})\in\mathbb{R}^{T}italic_e ( bold_italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (Eq. 1 for classification or Eq. 2 for forecasting). We then interpret the explanation as a time series, with the attribution scores as values. To obtain the frequency explanation, we perform a Fourier transformation of e(𝒙)𝑒𝒙e(\bm{x})italic_e ( bold_italic_x ), resulting in the frequency explanation e^(𝒙)T^𝑒𝒙superscript𝑇\hat{e}(\bm{x})\in\mathbb{C}^{T}over^ start_ARG italic_e end_ARG ( bold_italic_x ) ∈ blackboard_C start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT with E^^𝐸\hat{E}over^ start_ARG italic_E end_ARG for the entire set.

3.2 Obtain

The next step of RioT is to obtain user feedback on confounding factors. For an input 𝒙𝒙\bm{x}bold_italic_x, a user can mark parts that are confounded, resulting in a feedback mask a(𝒙){0,1}T𝑎𝒙superscript01𝑇a(\bm{x})\in\{0,1\}^{T}italic_a ( bold_italic_x ) ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. In this binary mask, a 1111 signals a potential confounder at this time step. Thereby, it is not necessary to have feedback for every sample of the dataset, as a mask a(𝒙)=(0,,0)T𝑎𝒙superscript00𝑇a(\bm{x})=(0,\dots,0)^{T}italic_a ( bold_italic_x ) = ( 0 , … , 0 ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT corresponds to no feedback. Feedback can also be given on the frequency explanation in a similar manner, marking which elements in the frequency domain are potential confounders. The resulting feedback mask a^(𝒙)=(a^(𝒙)re,a^(𝒙)im)^𝑎𝒙^𝑎subscript𝒙𝑟𝑒^𝑎subscript𝒙𝑖𝑚\hat{a}(\bm{x})=(\hat{a}(\bm{x})_{re},\hat{a}(\bm{x})_{im})over^ start_ARG italic_a end_ARG ( bold_italic_x ) = ( over^ start_ARG italic_a end_ARG ( bold_italic_x ) start_POSTSUBSCRIPT italic_r italic_e end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG ( bold_italic_x ) start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT ) can be different for the real a^(𝒙)re{0,1}T^𝑎subscript𝒙𝑟𝑒superscript01𝑇\hat{a}(\bm{x})_{re}\in\{0,1\}^{T}over^ start_ARG italic_a end_ARG ( bold_italic_x ) start_POSTSUBSCRIPT italic_r italic_e end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and imaginary part a^(𝒙)im{0,1}T^𝑎subscript𝒙𝑖𝑚superscript01𝑇\hat{a}(\bm{x})_{im}\in\{0,1\}^{T}over^ start_ARG italic_a end_ARG ( bold_italic_x ) start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. For the whole dataset, we then have spatial annotations A𝐴Aitalic_A and frequency annotations A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG. As the annotated feedback masks have to come from human experts, obtaining them can become costly. In many cases, however, confounders occur systematically and it is therefore possible to apply the same annotation mask to multiple samples. This can drastically reduce the number of annotations required in practice.

3.3 Revise

The last step of RioT is integrating the feedback into the model. We apply the general idea of using a loss-based model revision (Schramowski et al., 2020; Ross et al., 2017; Stammer et al., 2020) based on the explanations and the annotation mask. Given the input data (𝒳,𝒴)𝒳𝒴(\mathcal{X},\mathcal{Y})( caligraphic_X , caligraphic_Y ), we define the original task (or right-answer) loss as RA(𝒳,𝒴)subscript𝑅𝐴𝒳𝒴\mathcal{L}_{RA}(\mathcal{X},\mathcal{Y})caligraphic_L start_POSTSUBSCRIPT italic_R italic_A end_POSTSUBSCRIPT ( caligraphic_X , caligraphic_Y ). This loss measures the model performance and is the primary learning objective. To incorporate the feedback, we further use the right-reason loss RR(A,E)subscript𝑅𝑅𝐴𝐸\mathcal{L}_{RR}(A,E)caligraphic_L start_POSTSUBSCRIPT italic_R italic_R end_POSTSUBSCRIPT ( italic_A , italic_E ). This loss aligns model explanations E={e(𝒙)|𝒙𝒳}𝐸conditional-set𝑒𝒙𝒙𝒳E=\{e(\bm{x})|\bm{x}\in\mathcal{X}\}italic_E = { italic_e ( bold_italic_x ) | bold_italic_x ∈ caligraphic_X } and user feedback A𝐴Aitalic_A by penalizing the model for explanations in the annotated areas. To achieve model revision and a good task performance, both losses are combined, where λ𝜆\lambdaitalic_λ is a hyperparameter to balance both parts of the combined loss (𝒳,𝒴,A,E)=RA(𝒳,𝒴)+λRR(A,E)𝒳𝒴𝐴𝐸subscriptRA𝒳𝒴𝜆subscriptRR𝐴𝐸\mathcal{L}(\mathcal{X},\mathcal{Y},A,E)=\mathcal{L}_{\mathrm{RA}}(\mathcal{X}% ,\mathcal{Y})+\lambda\mathcal{L}_{\mathrm{RR}}(A,E)caligraphic_L ( caligraphic_X , caligraphic_Y , italic_A , italic_E ) = caligraphic_L start_POSTSUBSCRIPT roman_RA end_POSTSUBSCRIPT ( caligraphic_X , caligraphic_Y ) + italic_λ caligraphic_L start_POSTSUBSCRIPT roman_RR end_POSTSUBSCRIPT ( italic_A , italic_E ). Together, the combined loss simultaneously optimizes the primary training objective (e.g. accuracy) and feedback alignment.

Time Domain Feedback. Masking parts of the time domain as feedback is an easy way to mitigate spatially locatable confounders (Fig. 1, left). We use the explanations E𝐸Eitalic_E and annotations A𝐴Aitalic_A in the spatial version of the right-reason loss:

RRsp(A,E)=1D𝒙𝒳(e(𝒙)a(𝒙))2subscriptsuperscript𝑠𝑝𝑅𝑅𝐴𝐸1𝐷subscript𝒙𝒳superscript𝑒𝒙𝑎𝒙2\mathcal{L}^{sp}_{RR}(A,E)=\frac{1}{D}\sum_{\bm{x}\in\mathcal{X}}(e(\bm{x})*a(% \bm{x}))^{2}caligraphic_L start_POSTSUPERSCRIPT italic_s italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R italic_R end_POSTSUBSCRIPT ( italic_A , italic_E ) = divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ( italic_e ( bold_italic_x ) ∗ italic_a ( bold_italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (3)

As the explanations and the feedback masks are element-wise multiplied, this loss minimizes the explanation values in marked parts of the input. This effectively trains the model to disregard the marked parts of the input for its computation. Thus, using the loss in Eq. 3 as right-reason component for the combined loss allows to effectively steer the model away from points or intervals in time.

Frequency Domain Feedback. However, feedback in the time domain is insufficient to handle every type of confounder. If the confounder is not locatable in time, giving spatial feedback cannot be used to revise the models’ behavior. Therefore, we utilize explanations and feedback in the frequency domain to tackle confounders like in Fig. 1, (right). Given the frequency explanations E^^𝐸\hat{E}over^ start_ARG italic_E end_ARG and annotations A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG, the right-reason loss for the frequency domain is:

RRfr(A^,E^)=1D𝒙𝒳((Re(e^(𝒙))a^re(𝒙))2+(Im(e^(𝒙))a^im(𝒙))2)subscriptsuperscript𝑓𝑟𝑅𝑅^𝐴^𝐸1𝐷subscript𝒙𝒳superscriptRe^𝑒𝒙subscript^𝑎𝑟𝑒𝒙2superscriptIm^𝑒𝒙subscript^𝑎𝑖𝑚𝒙2\mathcal{L}^{fr}_{RR}(\hat{A},\hat{E})=\frac{1}{D}\sum_{\bm{x}\in\mathcal{X}}% \Bigl{(}(\text{Re}(\hat{e}(\bm{x}))*\hat{a}_{re}(\bm{x}))^{2}+(\text{Im}(\hat{% e}(\bm{x}))*\hat{a}_{im}(\bm{x}))^{2}\Bigr{)}caligraphic_L start_POSTSUPERSCRIPT italic_f italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_R italic_R end_POSTSUBSCRIPT ( over^ start_ARG italic_A end_ARG , over^ start_ARG italic_E end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ( ( Re ( over^ start_ARG italic_e end_ARG ( bold_italic_x ) ) ∗ over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_r italic_e end_POSTSUBSCRIPT ( bold_italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( Im ( over^ start_ARG italic_e end_ARG ( bold_italic_x ) ) ∗ over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT ( bold_italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (4)

The Fourier transformation is invertible and differentiable, so we can backpropagate gradients to parameters directly from this loss. Intuitively, the frequency right-reason loss causes the masked frequency explanations of the model to be small while not affecting any specific point in time.

Depending on the problem at hand, it is possible to use RioT either in the spatial or frequency domain. Moreover, it is also possible to combine feedback in both domains, thus including two right-reason terms in the final loss. This results in two parameters λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to balance the right-answer and both right-reason losses.

(𝒳,𝒴,A,E)=RA(𝒳,𝒴)+λ1RRsp(A,E)+λ2RRfr(A^,E^)𝒳𝒴𝐴𝐸subscriptRA𝒳𝒴subscript𝜆1subscriptsuperscript𝑠𝑝RR𝐴𝐸subscript𝜆2subscriptsuperscript𝑓𝑟RR^𝐴^𝐸\mathcal{L}(\mathcal{X},\mathcal{Y},A,E)=\mathcal{L}_{\mathrm{RA}}(\mathcal{X}% ,\mathcal{Y})+\lambda_{1}\mathcal{L}^{sp}_{\mathrm{RR}}(A,E)+\lambda_{2}% \mathcal{L}^{fr}_{\mathrm{RR}}(\hat{A},\hat{E})caligraphic_L ( caligraphic_X , caligraphic_Y , italic_A , italic_E ) = caligraphic_L start_POSTSUBSCRIPT roman_RA end_POSTSUBSCRIPT ( caligraphic_X , caligraphic_Y ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_s italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RR end_POSTSUBSCRIPT ( italic_A , italic_E ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_f italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RR end_POSTSUBSCRIPT ( over^ start_ARG italic_A end_ARG , over^ start_ARG italic_E end_ARG ) (5)

It is important to note that giving feedback in the frequency domain allows a new form of model revision through XIL. Even if we effectively still apply masking in the frequency domain, the effect in the original input domain is entirely different. Masking out a single frequency affects all time points without preventing the model from looking at any of them. In general, having an invertible transformation from the input domain to a different representation allows to give feedback more flexible than before. The Fourier transformation is a prominent example but not the only possible choice for this. Using other transformations like wavelets (Graps, 1995), is also possible.

Computational Costs. Including RioT in the training of a model increases the computational cost. Computing the right reason loss term requires the computation of a mixed partial derivative: 2fθ(x)θxsuperscript2subscript𝑓𝜃𝑥𝜃𝑥\frac{\partial^{2}f_{\theta}(x)}{\partial\theta\partial x}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_θ ∂ italic_x end_ARG. Even though this is a second-order derivative, it does not result in any substantial cost increases, as the second-order component of our loss can be formalized as a Hessian-vector product (cf. SubSec. A.4), which is known to be fast to compute (Martens, 2010). We also observed this in our experimental evaluation, as even the naive implementation of our loss in PyTorch scales to large models.

4 Experimental Evaluations

In this section, we investigate the effectiveness of RioT to mitigate confounders in time series classification and forecasting. Our evaluations include the potential of revising in the spatial domain (RioTsp) and the frequency domain (RioTfreq), as well as both jointly.

4.1 Experimental Setup

Data. We perform experiments on various datasets. For classification, we focus mainly on the UCR/UEA repository (Dau et al., 2018), which holds a wide variety of datasets for this task. The data originates from different domains, e.g., health records, industrial sensor data, and audio signals. We select all available datasets of a minimal size (cf. SubSec. A.3), which results in Fault Detection A, Ford A, Ford B, and Sleep. We omit experiments on the very small datasets of UCR, as these are generally less suited for deep learning (Ismail Fawaz et al., 2020). We use the splits provided by the UCR archive. For time series forecasting, we evaluate on three popular datasets of the Darts repository (Herzen et al., 2022): ETTM1, Energy, and Weather with 70%/30% train/test splits. These datasets are sufficiently large, allowing us to investigate the impact of confounding behavior in isolation without the risk of overfitting. We standardize all datasets as suggested by Wu et al. (2021), i.e., rescaling the distribution of values to zero mean and a standard deviation of one.

Production Press Sensor Data (P2S). RioT aims to mitigate confounders in time series data. To assess our method, we need datasets with annotated real-world confounders. So far, there are no such datasets available. To fill this gap, we introduce Production Press Sensor Data (P2S)222https://huggingface.co/datasets/AIML-TUDA/P2S, a dataset of sensor recordings with naturally occurring confounders. The sensor data comes from a high-speed press production line for metal parts, one of the sheet metal working industry’s most economically significant processes. The task is to predict whether a run is defective based on the sensor data. The recordings include different production speeds, which, although not affecting part quality, influence process friction and applied forces. Fig. 3 shows samples recorded at different speeds from normal and defect runs, highlighting variations even within the same class. An expert identified regions in the time series that vary with production speed, potentially distracting models from relevant classification indicators, especially when no defect and normal runs of the same speed are in the training data. Thus, the run’s speed is a confounder, challenging models to generalize beyond training. The default P2S setting includes normal and defect runs of different speeds, with an unconfounded setting of runs at the same speed. Further details on the dataset are available in App. B.

Refer to caption
Figure 3: Samples of P2S with normal (left) and defect (right) setting at 80 and 225 strokes per minute. Areas that vary depending on the stroke rate and are considered confounding and marked red.

Models. For time series classification, we use the FCN model of Ma et al. (2023), with a slightly modified architecture for Sleep to achieve a better unconfounded performance (cf.  SubSec. A.2). Additionally, we use the OFA model by Zhou et al. (2023). For forecasting, we use the recently introduced TiDE model (Das et al., 2023), PatchTST (Nie et al., 2023) and NBEATS (Oreshkin et al., 2020) to highlight the applicability of our method to a variety of model classes.

Metrics. In our evaluations, we compare the performance of models on confounded and unconfounded datasets with and without RioT. For classification, we report balanced (multiclass) accuracy (ACC), and for forecasting the mean squared error (MSE). The corresponding mean absolute error (MAE) results can be found in SubSec. A.6. We report average and standard deviation over 5 runs.

Table 1: Applying RioT mitigates confounders in time series classification. Performance before and after applying RioT for spatial (SP Conf) and frequency (Freq Conf) confounders separately. Unconfounded represents the ideal scenario where the model is not affected by any confounder.
Model Config (ACC \uparrow) Fault Detection A FordA FordB Sleep
Train Test Train Test Train Test Train Test
FCN Unconfounded 0.99 ±0.00 0.99 ±0.00 0.92 ±0.01 0.91 ±0.00 0.93 ±0.00 0.76 ±0.01 0.68 ±0.00 0.62 ±0.00
SP Conf 1.00 ±0.00 0.74 ±0.06 1.00 ±0.00 0.71 ±0.08 1.00 ±0.00 0.63 ±0.03 1.00 ±0.00 0.10 ±0.03
+ RioTsp 0.98 ±0.01 \bullet 0.93 ±0.03 0.99 ±0.01 \bullet 0.84 ±0.02 0.99 ±0.00 \bullet 0.68 ±0.02 0.60 ±0.06 \bullet 0.54 ±0.05
Freq Conf 0.98 ±0.01 0.87 ±0.03 0.98 ±0.00 0.73 ±0.01 0.99 ±0.01 0.60 ±0.01 0.98 ±0.00 0.27 ±0.02
+ RioTfreq 0.94 ±0.00 \bullet 0.90 ±0.03 0.83 ±0.02 \bullet 0.83 ±0.02 0.94 ±0.00 \bullet 0.65 ±0.01 0.67 ±0.05 \bullet 0.45 ±0.07
OFA Unconfounded 1.00 ±0.00 0.98 ±0.02 0.92 ±0.01 0.87 ±0.04 0.95 ±0.01 0.70 ±0.04 0.69 ±0.00 0.64 ±0.01
SP Conf 1.00 ±0.00 0.53 ±0.02 1.00 ±0.00 0.50 ±0.00 1.00 ±0.00 0.52 ±0.01 1.00 ±0.00 0.21 ±0.05
+ RioTsp 0.96 ±0.08 \bullet 0.98 ±0.01 0.92 ±0.03 \bullet 0.85 ±0.02 0.94 ±0.01 \bullet 0.65 ±0.04 0.52 ±0.22 \bullet 0.58 ±0.05
Freq Conf 1.00 ±0.00 0.72 ±0.02 1.00 ±0.00 0.65 ±0.01 1.00 ±0.00 0.56 ±0.02 0.99 ±0.00 0.24 ±0.03
+ RioTfreq 0.96 ±0.02 \bullet 0.98 ±0.02 0.78 ±0.04 \bullet 0.85 ±0.04 1.00 ±0.00 \bullet 0.64 ±0.03 0.50 ±0.16 \bullet 0.49 ±0.04

Confounders. To evaluate how well RioT can mitigate confounders in a more controlled setting, we add spatial (sp) or frequency (freq) shortcuts to the datasets from the UCR and Darts repositories. These confounders create spurious correlations between patterns and class labels or forecasting signals in the training data, but are absent in validation or test data. We generate an annotation mask based on the confounder area or frequency to simulate human feedback. More details on the confounders can be found in SubSec. A.5.

4.2 Evaluations

Refer to caption
Figure 4: Applying RioT makes the model ignore annotated confounder areas. While FCN primarily focuses on confounder areas, applying RioT with partial feedback (middle) or full feedback (bottom) causes the model to ignore the confounder and focus on the remainder of the input.

Removing Confounders for Time Series Classification. We evaluate the effectiveness of RioT (spatial: RioTsp, frequency: RioTfreq) in addressing confounders in classification tasks by comparing balanced accuracy with and without RioT. As shown in Tab. 1, without RioT, both FCN and OFA overfit to shortcuts, achieving \approx100% training accuracy, while having poor test performance. Applying RioT significantly improves test performance for both models across all datasets. In some cases, RioT even reaches the performance of the ideal reference (unconfounded) scenario as if there would be no confounder in the data. Even on FordB, where the drop in training-to-test performance of the reference indicates a distribution shift, RioTsp is still beneficial. Similarly, RioTfreq enhances performance on frequency-confounded data, though the improvement is less pronounced for FCN on Ford B, suggesting essential frequency information is sometimes obscured by RioTfreq. In summary, RioT (both RioTsp and RioTfreq) successfully mitigates confounders, enhancing test generalization for FCN and OFA models.

Removing Confounders for Time Series Forecasting. Confounders are not exclusive to time series classification and can significantly impact other tasks, such as forecasting. In Tab. 2 we outline that spatial confounders cause models to overfit, but applying RioTsp reduces MSE across datasets, especially for Energy, where MSE drops by up to 56%. In the frequency-confounded setting, the training data includes a recurring Dirac impulse as a distracting confounder (cf. SubSec. A.5 for details). RioTfreq alleviates this distraction and improves the test performance significantly. For example, TiDE’s test MSE on ETTM1 decreases by 14% compared to the confounded model.

In general, RioT effectively addresses spatial and frequency confounders in forecasting tasks. Interestingly, for TiDE on the Energy dataset, the performance with RioTfreq even surpasses the unconfounded model. Here, the added frequency acts as a form of data augmentation, enhancing model robustness. A similar behavior can also be observed for NBEATS and ETTM1, where the confounded setting actually improves the model slightly, and RioT even improves upon that.

Table 2: RioT can successfully overcome confounders in time series forecasting. MSE values (MAE values cf. Tab. 7) on the confounded training set and the unconfounded test set with Unconfounded being the ideal scenario where the model is not affected by any confounder.
Model Config (MSE \downarrow) ETTM1 Energy Weather
Train Test Train Test Train Test
NBEATS Unconfounded 0.30 ±0.02 0.47 ±0.02 0.34 ±0.03 0.26 ±0.02 0.08 ±0.01 0.03 ±0.01
SP Conf 0.24 ±0.01 0.55 ±0.01 0.33 ±0.03 0.94 ±0.02 0.09 ±0.01 0.16 ±0.04
+ RioTsp 0.30 ±0.01 \bullet 0.50 ±0.01 0.45 ±0.03 \bullet 0.58 ±0.01 0.11 ±0.01 \bullet 0.09 ±0.02
Freq Conf 0.30 ±0.02 0.46 ±0.01 0.33 ±0.04 0.36 ±0.04 0.11 ±0.02 0.32 ±0.09
+ RioTfreq 0.31 ±0.02 \bullet 0.45 ±0.01 0.33 ±0.04 \bullet 0.34 ±0.04 0.81 ±0.48 \bullet 0.17 ±0.01
PatchTST Unconfounded 0.46 ±0.03 0.47 ±0.01 0.26 ±0.01 0.23 ±0.00 0.26 ±0.03 0.08 ±0.01
SP Conf 0.40 ±0.02 0.55 ±0.01 0.29 ±0.01 0.96 ±0.03 0.20 ±0.03 0.19 ±0.01
+ RioTsp 0.40 ±0.03 \bullet 0.53 ±0.01 0.44 ±0.00 \bullet 0.45 ±0.01 0.55 ±0.20 \bullet 0.14 ±0.01
Freq Conf 0.45 ±0.03 0.91 ±0.16 0.04 ±0.00 0.53 ±0.05 0.63 ±0.09 0.24 ±0.04
+ RioTfreq 0.91 ±0.07 \bullet 0.66 ±0.04 2.45 ±4.59 \bullet 0.38 ±0.06 0.96 ±0.02 \bullet 0.17 ±0.00
TiDE Unconfounded 0.27 ±0.01 0.47 ±0.01 0.27 ±0.01 0.26 ±0.02 0.25 ±0.02 0.03 ±0.00
SP Conf 0.22 ±0.01 0.54 ±0.03 0.28 ±0.01 1.19 ±0.03 0.22 ±0.03 0.15 ±0.01
+ RioTsp 0.23 ±0.01 \bullet 0.48 ±0.01 0.53 ±0.02 \bullet 0.52 ±0.02 0.25 ±0.03 \bullet 0.11 ±0.01
Freq Conf 0.06 ±0.01 0.69 ±0.08 0.07 ±0.01 0.34 ±0.08 0.79 ±0.09 0.31 ±0.09
+ RioTfreq 0.07 ±0.01 \bullet 0.49 ±0.07 0.07 ±0.01 \bullet 0.21 ±0.02 1.12 ±0.36 \bullet 0.22 ±0.01

Removing Confounders in the Real-World. So far, our experiments have demonstrated the ability to counteract confounders within controlled environments. However, real-world scenarios often have more complex confounder structures. Our new proposed dataset P2S presents such real-world conditions. The model explanations for a sample in Fig. 4 (top) reveal a focus on distinct regions of the sensor curve, specifically the two middle regions. With domain knowledge, it’s clear that these regions shouldn’t affect the model’s output. By applying RioT, we can redirect the model’s attention away from these regions. New model explanations highlight that the model still focuses on incorrect regions, which can be mitigated by extending the annotated area. In Tab. 4, the model performance (exemplarly with FCN) in these settings is presented. Without RioT, the model overfits to the confounder. the test performance improves already with partial feedback (2) and improves even more with full feedback (4). These results highlight the effectiveness of RioT in real-world scenarios, where not all confounders are initially known.

Removing Multiple Confounders at Once. In the previous experiments, we illustrated that RioT is suitable for addressing individual confounding factors, whether spatial or frequency-based. Real-world time series data, however, often present a blend of multiple confounding factors that simultaneously may influence model performance.

Table 3: Applying RioT overcomes the confounder in P2S. Performance on confounded train set and the unconfounded test set. FCN learns the train confounder, resulting in a drop in test performance. Applying RioT with partial feedback (2) already yields good improvements, while adding feedback on the full confounder area (4) is even better. Unconfounded is the ideal scenario, specifically curated so that there is no confounder.
P2S (ACC \uparrow) Train    Test
FCNUnconfounded 0.97 ±0.01 0.95 ±0.01
FCNsp 0.99 ±0.01 0.66 ±0.14
FCNsp + RioTsp (2) 0.96 ±0.01 0.78 ±0.05
FCNsp + RioTsp (4) 0.95 ±0.01 \bullet 0.82 ±0.06
Table 4: RioT can combine spatial and frequency feedback. If the data is confounded in the time and frequency domain, RioT can combine feedback on both domains to mitigate confounders, superior to feedback on only one domain. Unconfounded represents the ideal scenario when the model is not affected by any confounder.
Sleep (Classification ACC \uparrow) Train    Test
FCNUnconfounded 0.68 ±0.00 0.62 ±0.00
FCNfreq,sp 1.00 ±0.00 0.10 ±0.04
FCNfreq,sp + RioTsp 0.94 ±0.00 0.24 ±0.02
FCNfreq,sp + RioTfreq 1.00 ±0.00 0.04 ±0.00
FCNfreq,sp + RioTfreq,sp 0.47 ±0.00 \bullet 0.48 ±0.03
Energy (Forecasting MSE \downarrow) Train    Test
TiDEUnconfounded 0.28 ±0.01 0.26 ±0.02
TiDEfreq,sp 0.16 ±0.01 0.74 ±0.02
TiDEfreq,sp + RioTsp 0.20 ±0.01 0.61 ±0.02
TiDEfreq,sp + RioTfreq 0.22 ±0.01 0.55 ±0.02
TiDEfreq,sp + RioTfreq,sp 0.25 ±0.01 \bullet 0.47 ±0.01

We thus investigate the impact of applying RioT to both spatial and frequency confounders simultaneously (cf. Tab. 4), exemplary using FCN and TiDE. When Sleep is confounded in both domains, FCN without RioT overfits and fails to generalize. Addressing only one confounder does not mitigate the effects, as the model adapts to the other. However, combining feedback for both domains (RioTfreq,sp) significantly improves test performance, matching the frequency-confounded scenario (cf. Tab. 1). Tab. 4 (bottom) shows the impact of multiple confounders on the Energy dataset for forecasting. When faced with both spatial shortcut and noise confounders, the model overfits, indicated by lower training MSE. While applying either spatial or frequency feedback individually already shows some effect, utilizing both types of feedback simultaneously (RioTfreq,sp) results in the largest improvements, as both confounders are addressed. The performance gap between RioTfreq,sp and the non-confounded model is more pronounced than in single confounder cases (cf. Tab. 2), suggesting a compounded challenge. Optimize the deconfounding process in highly complex data environments thus remains an important challenge.

Refer to caption
Figure 5: RioT uses feedback efficiently. Even with feedback on only a small percentage of the data, RioT can overcome confounders.

Feedback Generalization. As human feedback is an essential aspect of RioT, we investigate the required annotations and the potential to generalize annotations across samples. Our findings indicate that not every sample needs annotation. Fig. 5 shows that we can significantly reduce the amount of annotated data for classification and forecasting (cf. App. Tab. 6 and Tab. 5 for results on the other datasets). Even minimal feedback, such as annotating just 5% of the samples, substantially improves performance compared to no feedback. Furthermore, the results on P2S highlights that annotations can be generalized across multiple samples. Once the confounder on P2S has been identified on a couple of samples, the expert annotations can be used on full dataset. The systematic nature of shortcut confounders suggest that generalizing annotations is an effective possibility to obtain feedback efficiently. While RioT does rely on human annotations, these findings highlight that it can work without extensive manual human interactions, and that obtained annotations can be utilized efficiently.

Limitations. An important aspect of RioT is the human feedback provided in the Obtain step. Integrating human feedback into the model is a key advantage of RioT, but can also be a limitation. While we have shown that a small fraction of samples with annotations can be sufficient, and that annotations can be applied for many samples, they are still necessary for RioT. Additionally, like many other (explanatory) interactive learning methods, RioT assumes correct human feedback. Thus, considering possible repercussions of inaccurate feedback when applying RioT in practice is important. Another potential drawback of RioT are increased training costs. RioT requires computation of a mixed-partial derivative to optimize the model’s explanation when using gradient-based attributions. While this affects training cost, the loss can be formulated as a Hessian-vector product, which is fast to compute in practice, making the additional overhead easy to manage.

5 Conclusion

In this work, we present Right on Time a method to mitigate confounding factors in time series data with the help of human feedback. By revising the model, RioT significantly diminishes the influence of these factors, steering the model to align with the correct reasons. Using popular time series models on several manually confounded datasets and the newly introduced, naturally confounded, real-world dataset P2S showcases that they are indeed subject to confounders. Our results, however, demonstrate that applying RioT to these models can mitigate confounders in the data. Furthermore, we have unveiled that addressing solely the time domain is insufficient for revising the model to focus on the correct reasons, which is why we extended our method beyond it. Feedback in the frequency domain provides an additional way to steer the model away from confounding factors and towards the right reasons. Extending the application of RioT to multivariate time series represents a logical next step, and exploring the integration of various explainer types is another promising direction. Additionally, we aim to apply RioT, especially RioTfreq, to other modalities as well, offering a more nuanced approach to confounder mitigation. It should be noted that while our method shows potential in its current iteration, interpreting attributions in time series data remains a general challenge.

Acknowledgment

This work received funding by the EU project EXPLAIN, funded by the Federal Ministry of Education and Research (grant 01—S22030D). Additionally, it was funded by the project "The Adaptive Mind" from the Hessian Ministry of Science and the Arts (HMWK), the "ML2MT" project from the Volkswagen Stiftung, and the Priority Program (SPP) 2422 in the subproject “Optimization of active surface design of high-speed progressive tools using machine and deep learning algorithms“ funded by the German Research Foundation (DFG). The latter also contributed the data for P2S. Furthermore, this work benefited from the HMWK project ”The Third Wave of Artificial Intelligence - 3AI”.

References

  • Bach et al. (2015) Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE, 10(7):e0130140, 2015.
  • Benidis et al. (2023) Konstantinos Benidis, Syama Sundar Rangapuram, Valentin Flunkert, Yuyang Wang, Danielle Maddix, Caner Turkmen, Jan Gasthaus, Michael Bohlke-Schneider, David Salinas, Lorenzo Stella, François-Xavier Aubet, Laurent Callot, and Tim Januschowski. Deep Learning for Time Series Forecasting: Tutorial and Literature Survey. ACM Computing Surveys, 55(6):1–36, 2023.
  • Bica et al. (2020) Ioana Bica, Ahmed M. Alaa, and Mihaela Van Der Schaar. Time series deconfounder: estimating treatment effects over time in the presence of hidden confounders. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
  • Cao et al. (2023) Defu Cao, James Enouen, Yujing Wang, Xiangchen Song, Chuizheng Meng, Hao Niu, and Yan Liu. Estimating treatment effects from irregular time series observations with hidden confounders. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023.
  • Das et al. (2023) Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan Mathur, Rajat Sen, and Rose Yu. Long-term Forecasting with TiDE: Time-series Dense Encoder. ArXiv:2304.08424, 2023.
  • Dau et al. (2018) Hoang Anh Dau, Anthony J. Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn J. Keogh. The UCR time series archive. ArXiv:1810.07758, 2018.
  • Flanders et al. (2011) W. Dana Flanders, M. Klein, L.A. Darrow, M.J. Strickland, S.E. Sarnat, J.A. Sarnat, L.A. Waller, A. Winquist, and P.E. Tolbert. A Method for Detection of Residual Confounding in Time-Series and Other Observational Studies. Epidemiology, 22(1):59–67, 2011.
  • Friedrich et al. (2023a) Felix Friedrich, Wolfgang Stammer, Patrick Schramowski, and Kristian Kersting. A typology for exploring the mitigation of shortcut behaviour. Nature Machine Intelligence, 5(3):319–330, 2023a.
  • Friedrich et al. (2023b) Felix Friedrich, David Steinmann, and Kristian Kersting. One explanation does not fit XIL. ArXiv:2304.07136, 2023b.
  • Geirhos et al. (2020) Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
  • Graps (1995) A. Graps. An introduction to wavelets. IEEE Computational Science and Engineering, 2(2):50–61, 1995.
  • Hatt and Feuerriegel (2024) Tobias Hatt and Stefan Feuerriegel. Sequential deconfounding for causal inference with unobserved confounders. In Proceedings of the Conference on Causal Learning and Reasoning (CLeaR), 2024.
  • Herzen et al. (2022) Julien Herzen, Francesco Lässig, Samuele Giuliano Piazzetta, Thomas Neuer, Léo Tafti, Guillaume Raille, Tomas Van Pottelbergh, Marek Pasieka, Andrzej Skrodzki, Nicolas Huguenin, Maxime Dumonal, Jan Kościsz, Dennis Bader, Frédérick Gusset, Mounir Benheddi, Camila Williamson, Michal Kosinski, Matej Petrik, and Gaël Grosch. Darts: User-Friendly Modern Machine Learning for Time Series. Journal of Machine Learning Research (JMLR), 23(124):1–6, 2022.
  • Ismail Fawaz et al. (2020) Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier, Daniel F Schmidt, Jonathan Weber, Geoffrey I Webb, Lhassane Idoumghar, Pierre-Alain Muller, and François Petitjean. Inceptiontime: Finding alexnet for time series classification. Data Mining and Knowledge Discovery (DMKD), 34(6):1936–1962, 2020.
  • Koprinska et al. (2018) Irena Koprinska, Dengsong Wu, and Zheng Wang. Convolutional Neural Networks for Energy Time Series Forecasting. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2018.
  • Lapuschkin et al. (2019) Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Unmasking Clever Hans predictors and assessing what machines really learn. Nature Communications, 10(1):1096, 2019.
  • Lin et al. (2003) Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the ACM SIGMOD workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), 2003.
  • Ma et al. (2022) Changxi Ma, Guowen Dai, and Jibiao Zhou. Short-Term Traffic Flow Prediction for Urban Road Sections Based on Time Series Analysis and LSTM_bilstm Method. IEEE Transactions on Intelligent Transportation (T-ITS), 23(6):5615–5624, 2022.
  • Ma et al. (2023) Qianli Ma, Zhen Liu, Zhenjing Zheng, Ziyang Huang, Siying Zhu, Zhongzhong Yu, and James T Kwok. A survey on time-series pre-trained models. ArXiv:2305.10716, 2023.
  • Martens (2010) James Martens. Deep learning via hessian-free optimization. In Proceedings of the International Conference on Machine Learning (ICML), 2010.
  • Mehdiyev et al. (2017) Nijat Mehdiyev, Johannes Lahann, Andreas Emrich, David Enke, Peter Fettke, and Peter Loos. Time Series Classification using Deep Learning for Process Planning: A Case from the Process Industry. Procedia Computer Science, 114:242–249, 2017.
  • Mercier et al. (2022) Dominique Mercier, Jwalin Bhatt, Andreas Dengel, and Sheraz Ahmed. Time to Focus: A Comprehensive Benchmark Using Time Series Attribution Methods. ArXiv:2202.03759, 2022.
  • Miller (2019) Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence (AIJ), 267:1–38, 2019.
  • Nie et al. (2023) Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
  • Oreshkin et al. (2020) Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  • Rojat et al. (2021) Thomas Rojat, Raphaël Puget, David Filliat, Javier Del Ser, Rodolphe Gelin, and Natalia Díaz-Rodríguez. Explainable Artificial Intelligence (XAI) on TimeSeries Data: A Survey. ArXiv:2104.00950, 2021.
  • Ross et al. (2017) Andrew Slavin Ross, Michael C. Hughes, and Finale Doshi-Velez. Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2017.
  • Ruiz et al. (2021) Alejandro Pasos Ruiz, Michael Flynn, James Large, Matthew Middlehurst, and Anthony Bagnall. The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery (DMKD), 35(2):401–449, 2021.
  • Schlegel et al. (2019) Udo Schlegel, Hiba Arnout, Mennatallah El-Assady, Daniela Oelke, and Daniel A. Keim. Towards a Rigorous Evaluation of XAI Methods on Time Series. ArXiv:1909.07082, 2019.
  • Schramowski et al. (2020) Patrick Schramowski, Wolfgang Stammer, Stefano Teso, Anna Brugger, Franziska Herbert, Xiaoting Shao, Hans-Georg Luigs, Anne-Katrin Mahlein, and Kristian Kersting. Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nature Machine Intelligence, 2(8):476–486, 2020.
  • Selvaraju et al. (2019) Ramprasaath Ramasamy Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded. In Proceedings of the International Conference on Computer Vision (ICCV), 2019.
  • Shao et al. (2021) Xiaoting Shao, Arseny Skryagin, Wolfgang Stammer, Patrick Schramowski, and Kristian Kersting. Right for Better Reasons: Training Differentiable Models by Constraining their Influence Functions. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021.
  • Shrikumar et al. (2017) Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not Just a Black Box: Learning Important Features Through Propagating Activation Differences. ArXiv:1605.01713, 2017.
  • Stammer et al. (2020) Wolfgang Stammer, Patrick Schramowski, and Kristian Kersting. Right for the right concept: Revising neuro-symbolic concepts by interacting with their explanations. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017.
  • Teso and Kersting (2019) Stefano Teso and Kristian Kersting. Explanatory Interactive Machine Learning. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2019.
  • Veerappa et al. (2022) Manjunatha Veerappa, Mathias Anneken, Nadia Burkart, and Marco F. Huber. Validation of XAI explanations for multivariate time series classification in the maritime domain. Journal of Computational Science, 58:101539, 2022.
  • Vielhaben et al. (2023) Johanna Vielhaben, Sebastian Lapuschkin, Grégoire Montavon, and Wojciech Samek. Explainable AI for Time Series via Virtual Inspection Layers. ArXiv:2303.06365, 2023.
  • Wu et al. (2021) Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with Auto-Correlation for long-term series forecasting. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2021.
  • Ye and Keogh (2011) Lexiang Ye and Eamonn Keogh. Time series shapelets: a novel technique that allows accurate, interpretable and fast classification. Data Mining and Knowledge Discovery (DMKD), 22:149–182, 2011.
  • Zhou et al. (2023) Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. One fits all: Power general time series analysis by pretrained LM. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2023.

Appendix A Appendix

A.1 Impact Statement

Our research advances machine learning by enhancing the interpretability and reliability of time series models, significantly impacting human interaction with AI systems. By developing Right on Time (RioT), which guides models to focus on correct reasoning, we improve the transparency and trust in machine learning decisions. While human feedback can provide many benefits, one has also to be aware that it could be incorrect, and evaluate the consequences carefully.

A.2 Implementation and Experimental Details

Adaption of Integrated Gradients (IG). A part of IG is a multiplication of the model gradient with the input itself, improving the explanation’s quality [Shrikumar et al., 2017]. However, this multiplication makes some implicit assumptions about the input format. In particular, it assumes that there are no inputs with negative values. Otherwise, multiplying the attribution score with a negative input would flip the attribution’s sign, which is not desired. For images, this is unproblematic because they are always equal to or larger than zero. In time series, negative values can occur and normalization to make them all positive is not always suitable. To avoid this problem, we use only the input magnitude and not the input sign to compute the IG attributions.

Computing Explanations. To compute explanations with Integrated Gradients, we followed the common practice of using a baseline of zeros. The standard approach worked well in our experiments, so we did not explore other baseline choices in this work. For the implementation, we utilized the widely-used Captum333https://github.com/pytorch/captum library, where we patched the captum._utils.gradient.compute_gradients function to allow for the propagation of the gradient with respect to the input to be propagated back into the parameters.

Model Training and Hyperparameters. To find suitable parameters for model training, we performed a hyperparameter search over batch size, learning rate, and the number of training epochs. We then used these parameters for all model trainings and evaluations, with and without RioT. In addition, we selected suitable λ𝜆\lambdaitalic_λ values for RioT with a hyperparameter selection on the respective validation sets. The exact values for the model training parameters and the λ𝜆\lambdaitalic_λ values can be found in the provided code.

To avoid model overfitting on the forecasting datasets, we performed shifted sampling with a window size of half the lookback window.

All experiments were executed using our Python 3.11 and PyTorch code, which is available in the provided code. To ensure reproducibility and consistency, we utilized Docker. Configurations and Python executables for all experiments are provided in the repository.

Hardware. To conduct our experiments, we utilized single GPUs from Nvidia DGX2 machines equipped with A100-40G and A100-80G graphics processing units.

By maintaining a consistent hardware setup and a controlled software environment, we aimed to ensure the reliability and reproducibility of our experimental results.

A.3 UCR Dataset selection

We focused our evaluation on a subset of UCR datasets with a minimum size. Our selection process was as follows: First, we discarded all multivariate datasets, as we only considered univariate data in this paper. Then we removed all datasets with time series of different length or missing values. We further excluded all datasets of the category SIMULATED, to avoid datasets which were synthetic from the beginning. We furthermore considered only datasets with less than 10 classes, as having a per-class confounder on more than 10 classes would result in a very high number of different confounders, which would probably rarely happen. Besides these criteria, we discarded all datasets with less than 1000 training samples or a per sample length of less than 100, to avoid the small datasets of UCR, which leads to the resulting four datasets: Fault Detection A, Ford A, Ford B and Sleep.

A.4 Computational Costs of RioT

Training a model with RioT induces additional computational costs. The right-reason term requires computations of additional gradients. Given a model fθ(x)subscript𝑓𝜃𝑥f_{\theta}(x)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ), parameterized by θ𝜃\thetaitalic_θ and input x𝑥xitalic_x, then computing the right reason loss with a gradient-based explanation method requires the computation of the mixed-partial derivative 2fθ(x)θxsuperscript2subscript𝑓𝜃𝑥𝜃𝑥\frac{\partial^{2}f_{\theta}(x)}{\partial\theta\partial x}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_θ ∂ italic_x end_ARG, as a gradient-based explanation includes the derivative fθ(x)xsubscript𝑓𝜃𝑥𝑥\frac{\partial f_{\theta}(x)}{\partial x}divide start_ARG ∂ italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_x end_ARG. While this mixed partial derivative is a second order derivative, this does not substantially increase the computational costs of our method for two main reasons. First, we are never explicitly materializing the Hessian matrix. Second, the second-order component of our loss can be formulated as a Hessian-vector product:

θ=g+λ2Hθx(e(x)a(x))𝜃𝑔𝜆2subscript𝐻𝜃𝑥𝑒𝑥𝑎𝑥\frac{\partial\mathcal{L}}{\partial\theta}=g+\frac{\lambda}{2}H_{\theta x}(e(x% )-a(x))divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_θ end_ARG = italic_g + divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG italic_H start_POSTSUBSCRIPT italic_θ italic_x end_POSTSUBSCRIPT ( italic_e ( italic_x ) - italic_a ( italic_x ) ) (6)

where g=RAθ𝑔subscriptRA𝜃g=\frac{\partial\mathcal{L}_{\mathrm{RA}}}{\partial\theta}italic_g = divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT roman_RA end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG is the partial derivative of the right answer loss and if H𝐻Hitalic_H is the full joint Hessian matrix of the loss with respect to θ𝜃\thetaitalic_θ and x𝑥xitalic_x, then Hθxsubscript𝐻𝜃𝑥H_{\theta x}italic_H start_POSTSUBSCRIPT italic_θ italic_x end_POSTSUBSCRIPT is the sub-block of this matrix mapping x𝑥xitalic_x into θ𝜃\thetaitalic_θ (cf. Fig. 6), given by Hθx=2fθ(x)θxsubscript𝐻𝜃𝑥superscript2subscript𝑓𝜃𝑥𝜃𝑥H_{\theta x}=\frac{\partial^{2}f_{\theta}(x)}{\partial\theta\partial x}italic_H start_POSTSUBSCRIPT italic_θ italic_x end_POSTSUBSCRIPT = divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∂ italic_θ ∂ italic_x end_ARG. Hessian-vector products are known to be fast to compute [Martens, 2010], enabling the right-reason loss computation to scale to large models and inputs.

Hθθsubscript𝐻𝜃𝜃H_{\theta\theta}italic_H start_POSTSUBSCRIPT italic_θ italic_θ end_POSTSUBSCRIPTHθxsubscript𝐻𝜃𝑥H_{\theta x}italic_H start_POSTSUBSCRIPT italic_θ italic_x end_POSTSUBSCRIPTHxθsubscript𝐻𝑥𝜃H_{x\theta}italic_H start_POSTSUBSCRIPT italic_x italic_θ end_POSTSUBSCRIPTHxxsubscript𝐻𝑥𝑥H_{xx}italic_H start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPTθ𝜃\thetaitalic_θx𝑥xitalic_xx𝑥xitalic_xθ𝜃\thetaitalic_θ
Figure 6: Illustration of the Hessian matrix with its respective sub-blocks. The mapping from x𝑥xitalic_x into θ𝜃\thetaitalic_θ is highlighted in blue.

A.5 Details on Confounding Factors

In the datasets which are not P2S, we added synthetic confounders to evaluate the effectiveness of confounders. In the following, we provide details on the nature of these confounders in the four settings:

Classification Spatial. For classification datasets, spatial confounders are specific patterns for each class. The pattern is added to every sample of that class in the training data, resulting in a spurious correlation between the pattern and the class label. Specifically, we replace T𝑇Titalic_T time steps with a sine wave according to:

𝑐𝑜𝑛𝑓𝑜𝑢𝑛𝑑𝑒𝑟:=sin(t(2+j)π)assign𝑐𝑜𝑛𝑓𝑜𝑢𝑛𝑑𝑒𝑟𝑡2𝑗𝜋\mathit{confounder}:=\sin(t\cdot(2+j)\pi)italic_confounder := roman_sin ( italic_t ⋅ ( 2 + italic_j ) italic_π )

while t{0,1,,T}𝑡01𝑇t\in\{0,1,\dots,T\}italic_t ∈ { 0 , 1 , … , italic_T } and j𝑗jitalic_j represents the class index, simulating a spurious correlation between the confounder and class index.

Classification Frequency. Similar to the spatial case, frequency confounders for classification are specific patterns added to the entire series, altering all time steps by a small amount. The confounder is represented as a sine wave and is applied additively to the full sequence (T=S𝑇𝑆T=Sitalic_T = italic_S):

𝑐𝑜𝑛𝑓𝑜𝑢𝑛𝑑𝑒𝑟:=sin(t(2+j)π)Aassign𝑐𝑜𝑛𝑓𝑜𝑢𝑛𝑑𝑒𝑟𝑡2𝑗𝜋𝐴\mathit{confounder}:=\sin(t\cdot(2+j)\pi)\cdot Aitalic_confounder := roman_sin ( italic_t ⋅ ( 2 + italic_j ) italic_π ) ⋅ italic_A

where A𝐴Aitalic_A resembles the confounder amplitude.

Forecasting Spatial. For forecasting datasets, spatial confounders are shortcuts that act as the actual solution to the forecasting problem. Periodically, data from the time series is copied back in time. This “back-copy” is a shortcut for the forecast, as it resembles the time steps of the forecasting window. Due to the windowed sampling from the time series, this shortcut occurs at every second sample. The exact confounder formulation is outlined in the sketch below (Fig. 7), with an exemplary lookback length of 9999, forecasting horizon of 3333 and window stride of 6666. This results in a shortcut confounder in samples 1 and 3 (marked red) and overlapping in sample 2 (marked orange).

1. SampleLookbackHorizonUnconfounded:01234567891011Confounded:9101134567891011Feedback:1110000002. SampleUnconfounded:67891011121314151617Confounded:67891011212223151617Feedback:0000000003. SampleUnconfounded:121314151617181920212223Confounded:212223151617181920212223Feedback:111000000Confounder:Overlapping Confounder:
Figure 7: Schematic overview of how the time series were confounded in the spatial forecasting experiments

Forecasting Frequency. This setting differs from the previous shortcut confounders. The frequency confounder for forecasting is a recurring Dirac impulse with a certain frequency, added every k𝑘kitalic_k time steps over the entire sequence (of length S𝑆Sitalic_S), including the forecasting windows. This impulse is present throughout all of the training data, distracting the model from the real forecast. The confounder is present at all time steps: i{nk|nnkS}𝑖conditional-set𝑛𝑘𝑛𝑛𝑘𝑆i\in\{n\cdot k|n\in\mathbb{N}\wedge n\cdot k\leq S\}italic_i ∈ { italic_n ⋅ italic_k | italic_n ∈ blackboard_N ∧ italic_n ⋅ italic_k ≤ italic_S } with a strength of A𝐴Aitalic_A:

𝑐𝑜𝑛𝑓𝑜𝑢𝑛𝑑𝑒𝑟:=AΔiassign𝑐𝑜𝑛𝑓𝑜𝑢𝑛𝑑𝑒𝑟𝐴subscriptΔ𝑖\mathit{confounder}:=A\cdot\Delta_{i}italic_confounder := italic_A ⋅ roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

In conclusion, confounders are only present in the training data, not validation or test data. We generate an annotation mask based on the confounder area or frequency to simulate human feedback. This mask is applied to all confounded samples except in our feedback scaling experiment.

A.6 Additional Experimental Results

Table 5: Feedback percentage for forecasting across all datasets, reported for the TiDE model. Corresponding to (test) results shown in Fig. 5, a higher percentage indicates more feedback, lower is better.
Metric Feedback ETTM1 Energy Weather
Spatial Freq Spatial Freq Spatial Freq
MAE (\downarrow) 0% 0.54 ±0.01 0.74 ±0.06 0.85 ±0.01 0.53 ±0.07 0.29 ±0.01 0.49 ±0.09
5% 0.52 ±0.00 0.63 ±0.03 0.62 ±0.01 0.40 ± 0.02 0.28 ±0.01 0.43 ±0.03
10% 0.52 ±0.00 0.63 ±0.03 0.61 ±0.01 0.40 ± 0.02 0.27 ±0.01 0.43 ±0.03
25% 0.52 ±0.00 0.63 ±0.03 0.58 ±0.01 0.41 ±0.01 0.25 ±0.01 0.43 ±0.04
50% 0.52 ±0.00 0.63 ±0.03 0.57 ± 0.01 0.41 ±0.01 0.24 ± 0.01 0.44 ±0.05
75% 0.52 ±0.01 0.63 ±0.03 0.57 ± 0.01 0.41 ±0.01 0.24 ± 0.01 0.45 ±0.06
100% 0.51 ± 0.01 0.60 ± 0.05 0.58 ±0.01 0.40 ± 0.03 0.24 ± 0.01 0.41 ± 0.02
MSE (\downarrow) 0% 0.54 ±0.03 0.69 ±0.08 1.19 ±0.03 0.34 ±0.08 0.15 ±0.01 0.31 ±0.09
5% 0.54 ±0.01 0.52 ±0.03 0.60 ±0.02 0.20 ± 0.01 0.14 ±0.01 0.24 ±0.02
10% 0.53 ±0.01 0.52 ±0.03 0.57 ±0.02 0.20 ± 0.01 0.14 ±0.01 0.24 ±0.02
25% 0.53 ±0.01 0.52 ±0.03 0.53 ±0.02 0.22 ±0.01 0.11 ± 0.01 0.24 ±0.03
50% 0.53 ±0.01 0.52 ±0.03 0.51 ± 0.02 0.22 ±0.01 0.11 ± 0.01 0.25 ±0.04
75% 0.52 ±0.01 0.51 ±0.03 0.52 ±0.02 0.22 ±0.01 0.11 ± 0.01 0.26 ±0.05
100% 0.48 ± 0.01 0.49 ± 0.07 0.52 ±0.02 0.21 ±0.02 0.11 ± 0.01 0.22 ± 0.01
Table 6: Feedback percentage for classification across all datasets, reported for the FCN model. Corresponding to results shown in Fig. 5, a higher percentage indicates more feedback, higher is better.
Feedback Fault Detection A (ACC \uparrow) FordA (ACC \uparrow) FordB (ACC \uparrow) Sleep (ACC \uparrow)
Spatial Freq Spatial Freq Spatial Freq Spatial Freq
0% 0.74 ±0.06 0.87 ±0.03 0.71 ±0.08 0.73 ±0.01 0.63 ±0.03 0.60 ±0.01 0.10 ±0.03 0.27 ±0.02
5% 0.88 ±0.00 0.88 ±0.01 0.81 ±0.03 0.80 ±0.03 0.66 ±0.03 0.66 ± 0.02 0.53 ±0.03 0.49 ± 0.00
10% 0.89 ±0.02 0.89 ±0.01 0.82 ±0.04 0.79 ±0.02 0.66 ±0.03 0.64 ±0.03 0.48 ±0.09 0.48 ±0.02
25% 0.92 ±0.01 0.89 ±0.01 0.83 ±0.02 0.78 ±0.01 0.67 ±0.02 0.65 ±0.01 0.49 ±0.08 0.42 ±0.08
50% 0.95 ± 0.01 0.88 ±0.01 0.82 ±0.03 0.81 ±0.05 0.67 ±0.02 0.65 ±0.00 0.55 ± 0.03 0.44 ±0.07
75% 0.95 ± 0.01 0.88 ±0.01 0.81 ±0.03 0.80 ±0.04 0.65 ±0.03 0.64 ±0.00 0.54 ±0.04 0.44 ±0.07
100% 0.93 ±0.03 0.90 ± 0.03 0.84 ± 0.02 0.83 ± 0.02 0.68 ± 0.02 0.65 ±0.01 0.54 ±0.05 0.45 ±0.07
Table 7: RioT can successfully overcome confounders in time series forecasting. MAE values on the confounded training set and the unconfounded test set with Unconfounded being the ideal scenario where the model is not affected by any confounder.
Model Config (MAE \downarrow) ETTM1 Energy Weather
Train Test Train Test Train Test
NBEATS Unconfounded 0.39 ±0.01 0.48 ±0.01 0.44 ±0.02 0.38 ±0.01 0.21 ±0.01 0.12 ±0.01
SP Conf 0.34 ±0.01 0.54 ±0.01 0.44 ±0.03 0.77 ±0.01 0.21 ±0.01 0.30 ±0.04
+ RioTsp 0.40 ±0.01 \bullet 0.52 ±0.01 0.53 ±0.02 \bullet 0.62 ±0.01 0.23 ±0.01 \bullet 0.22 ±0.01
Freq Conf 0.39 ±0.01 0.47 ±0.01 0.45 ±0.03 0.45 ±0.03 0.21 ±0.03 0.45 ±0.06
+ RioTfreq 0.40 ±0.01 \bullet 0.47 ±0.01 0.45 ±0.03 \bullet 0.44 ±0.02 0.59 ±0.22 \bullet 0.39 ±0.01
PatchTST Unconfounded 0.50 ±0.01 0.49 ±0.01 0.39 ±0.00 0.38 ±0.01 0.38 ±0.03 0.18 ±0.00
SP Conf 0.46 ±0.00 0.53 ±0.01 0.41 ±0.00 0.78 ±0.01 0.32 ±0.04 0.33 ±0.00
+ RioTsp 0.46 ±0.01 \bullet 0.52 ±0.01 0.51 ±0.00 \bullet 0.53 ±0.01 0.54 ±0.12 \bullet 0.28 ±0.00
Freq Conf 0.53 ±0.01 0.81 ±0.07 0.15 ±0.00 0.64 ±0.03 0.58 ±0.03 0.41 ±0.05
+ RioTfreq 0.92 ±0.05 \bullet 0.80 ±0.02 0.97 ±0.86 \bullet 0.57 ±0.02 0.65 ±0.01 \bullet 0.40 ±0.01
TiDE Unconfounded 0.36 ±0.01 0.48 ±0.01 0.40 ±0.01 0.38 ±0.02 0.36 ±0.02 0.13 ±0.00
SP Conf 0.32 ±0.01 0.54 ±0.01 0.40 ±0.01 0.85 ±0.01 0.32 ±0.03 0.29 ±0.01
+ RioTsp 0.34 ±0.01 \bullet 0.51 ±0.01 0.57 ±0.01 \bullet 0.58 ±0.01 0.35 ±0.03 \bullet 0.24 ±0.01
Freq Conf 0.18 ±0.01 0.74 ±0.06 0.18 ±0.01 0.53 ±0.07 0.65 ±0.05 0.49 ±0.09
+ RioTfreq 0.19 ±0.01 \bullet 0.60 ±0.05 0.18 ±0.01 \bullet 0.40 ±0.03 0.79 ±0.16 \bullet 0.41 ±0.02
Table 8: RioT can combine spatial and frequency feedback. MAE results when applying feedback in time and frequency with RioT. Combining both feedback domains is superior to feedback on only one of the domains. Reference values represent the ideal scenario when the model is not affected by any confounder (mean and std over 5 runs).
Energy (MAE \downarrow) Train    Test
TiDEUnconfounded 0.40 ±0.01 0.38 ±0.02
TiDEfreq,sp 0.30 ±0.01 0.70 ±0.02
TiDEfreq,sp + RioTsp 0.34 ±0.01 0.64 ±0.01
TiDEfreq,sp + RioTfreq 0.36 ±0.01 0.60 ±0.01
TiDEfreq,sp + RioTfreq,sp 0.39 ±0.01 \bullet 0.55 ±0.01

This section provides further insights into our experiments, covering both forecasting and classification tasks. Specifically, it showcases performance through various metrics such as MAE, MSE, and accuracy, and explores different feedback configurations.

Feedback Generalization.: Tab. 6 and Tab. 5 detail provided feedback percentages for forecasting and classification across all datasets, respectively. These tables report the performance of the TIDE and FCN models, highlighting how different levels of feedback impact model outcomes on various datasets. Tab. 5 focuses on MAE and MSE for forecasting, while Tab. 6 presents ACC for classification.

Removing Confounders for Time Series Forecasting. Tab. 7 reports the MAE results for our forecasting experiment across different models, datasets and configurations. It emphasizes how well each model performs on both the confounded training set and after applying RioT, with the Unconfounded configuration representing the ideal scenario unaffected by confounders.

Removing Multiple Confounders at Once. Tab. 8 reports the MAE values and illustrates the effectiveness of combining spatial and frequency feedback via RioT for the TiDE model. The results demonstrate significant improvements in forecasting accuracy when integrating both feedback domains compared to using them separately.

Appendix B Confounded Dataset from a High-speed Progressive Tool

The presence of confounders is a common challenge in practical settings, affecting models in diverse ways. As the research community strives to identify and mitigate these issues, it becomes imperative to test our methodologies on datasets that mirror the complexities encountered in actual applications. However, for the time domain, datasets with explicitly labeled confounders are not present, highlighting the challenge of assessing model performance against the complex nature of practical confounding factors.

To bridge this gap, we introduce P2S, a dataset that represents a significant step forward by featuring explicitly identified confounders. This dataset originates from experimental work on a production line for deep-drawn sheet metal parts, employing a progressive die on a high-speed press. The sections below detail the experimental approach and the process of data collection.

B.1 Real-World setting

The production of parts in multiple progressive forming stages using stamping, deep drawing and bending operations with progressive dies is generally one of the most economically significant manufacturing processes in the sheet metal working industry and enables the production of complex parts on short process routes with consistent quality. For the tests, a four-stage progressive die was used on a Bruderer BSTA 810-145 high-speed press with varied stroke speed. The strip material to be processed is fed into the progressive die by a BSV300 servo feed unit, linked to the cycle of the press, in the stroke movement while the tools are not engaged. The part to be produced remains permanently connected to the sheet strip throughout the entire production run. The stroke height of the tool is 63 mm and the material feed per stroke is 60 mm. The experimental setup with the progressive die set up on the high-speed press is shown in Fig. 8.

Refer to caption
Figure 8: Experimental setup with high-speed press and tool as well as trigger for stroke-by-stroke recording of the data

The four stages include a pilot punching stage, a round stamping stage, deep drawing and a cut-out stage. In the first stage, a 3 mm hole is punched in the metal strip. This hole is used by guide pins in the subsequent stages to position the metal strip. During the stroke movement, the pilot pin always engages in the pilot hole first, thus ensuring the positioning accuracy of the components. In the subsequent stage, a circular blank is cut into the sheet metal strip. This is necessary so that the part can be deep-drawn directly from the sheet metal strip. This is a round geometry that forms small arms in the subsequent deep-drawing step that hold the component on the metal strip. In the final stage, the component is then separated from the sheet metal strip and the process cycle is completed. The respective process steps are performed simultaneously so that each stage carries out its respective process with each stroke and therefore a part is produced with each stroke. Fig. 9 shows the upper tool unfolded and the forming stages associated with the respective steps on the continuous sheet metal strip.

Refer to caption
Figure 9: Upper tool unfolded and the forming stages associated with the respective steps on the passing sheet metal strip as well as the positions of the piezoelectric force sensors.

B.2 Data collection

An indirect piezoelectric force sensor (Kistler 9240A) was integrated into the upper mould mounting plate of the deep-drawing stage for data acquisition. The sensor is located directly above the punch and records not only the indirect process force but also the blank holder forces which are applied by spring assemblies between the upper mounting plate and the blank holder plate. The data is recorded at a sampling frequency of 25 kHz. The material used is DC04 with a width of 50 mm and a thickness of 0.5 mm. The voltage signals from the sensors are digitised using a "CompactRIO" (NI cRIO 9047) with integrated NI 9215 measuring module (analogue voltage input ±plus-or-minus\pm± 10 V). Data recording is started via an inductive proximity switch when the press ram passes below a defined stroke height during the stroke movement and is stopped again as it passes the inductive proximity switch during the return stroke movement. Due to the varying process speed caused by the stroke speeds that have been set, the recorded time series have a different number of data points. Further, there are slight variations in the length of the time series withing one experiment. For this reason, all time series are interpolated to a length of 4096 data points and we discard any time series that deviate considerably from the mean length of a run (i.e., threshold of 3). A total of 12 series of experiments, shown in Tab. 9, were carried out with production rates from 80 to 225 spm. To simulate a defect, the spring hardness of the blank holder was manipulated in the test series that were marked as defect. The manipulated experiments result in the component bursting and tearing during production. In a real production environment, this would lead directly to the parts being rejected.

B.3 Data characteristics

Fig. 10 shows the progression of the time series recorded with the indirect force sensor. The force curve characterises the process cycle during a press stroke. The measurement is started by the trigger which is activated by the ram moving downwards. The downholer plates touch down at point A and press the strip material onto the die. Between point A and point B, the downholder springs are compressed, causing the applied force to increase linearly. The deep drawing process begins at point B. At point C, the press reaches its bottom dead centre and the reverse stroke begins so that the punch moves out of the material again. At point D, the deep-drawing punch is released from the material and now the hold-down springs relax linearly up to point E. At point E, the downholder plate lifts off again, the component is fed to the next process step and the process is complete.

Refer to caption
Figure 10: Force curve for one stroke. A) set down downholder plate B) start of deep drawing C) bottom dead centre D) deep drawing process completed E) downholder plates lift off F) measurement stops.
Table 9: Overview of conducted runs on the high-speed press with normal and defect states at different stroke rates.
Experiment # State Stroke Rate Samples
1 Normal 80 193
2 Normal 100 193
3 Normal 150 189
4 Normal 175 198
5 Normal 200 194
6 Normal 225 188
7 Defect 80 149
8 Defect 100 193
9 Defect 150 188
10 Defect 175 196
11 Defect 200 193
12 Defect 225 190
Total 2264

B.4 Confounders

The presented dataset P2S is confounded by the speed with which the progressive tool is operated. The higher the stroke rate of the press, the more friction is occurring and the higher is the impact of the downholder plate. The differences can be observed in Fig. 3. Since we are aware of these physics-based confounders, we are able to annotate them in our dataset. As the process speed increases, the friction that occurs between the die and the material in the deep-drawing stage changes, as the frictional force is dependent on the frictional speed. This is particularly evident in the present case, as deep-drawing oils, which can optimize the friction condition, were not used in the experiments. The affected area from friction of the punch are in 1380 to 1600 (start of deep drawing) and 2080 to 2500 (end of deep drawing). In addition, the impulse of the downholder plate affecting the die increases due to the increased process dynamics. If the process speed is increased, the process force also increases in the ranges of the time series from 800 to 950 (downholder plate sets down) and 3250 to 3550 (downholder plate lifts).

In the experiment setting of Tab. 4, the training data set is selected in such a way that the stroke rate correlates with the class label, i.e., there are only normal experiments with small stroke rates and defect ones with high stroke rate. Experiment 1, 2, 3, 10, 11, 12 are the training data and the remaining experiments are the test data. To get a unconfounded setting where the model is not affected by any confounder, we use normal and defect experiments with the same speed in training and respectively test data. This results in experiments 1, 3, 5, 7, 9, 11 in the training set and the remaining in the test set.