research-article

Open access

Data Validation Utilizing Expert Knowledge and Shape Constraints

Authors:

Florian Bachinger,

Lisa Ehrlinger,

Gabriel Kronberger,

Wolfram WössAuthors Info & Claims

ACM Journal of Data and Information Quality, Volume 16, Issue 2

Article No.: 13, Pages 1 - 27

https://doi.org/10.1145/3661826

Published: 25 June 2024 Publication History

PDF eReader

Abstract

Data validation is a primary concern in any data-driven application, as undetected data errors may negatively affect machine learning models and lead to suboptimal decisions. Data quality issues are usually detected manually by experts, which becomes infeasible and uneconomical for large volumes of data.

To enable automated data validation, we propose “shape constraint-based data validation,” a novel approach based on machine learning that incorporates expert knowledge in the form of shape constraints. Shape constraints can be used to describe expected (multivariate and nonlinear) patterns in valid data and enable the detection of invalid data that deviates from these expected patterns. Our approach can be divided into two steps: (1) shape-constrained prediction models are trained on data, and (2) their training error is analyzed to identify invalid data. The training error can be used as an indicator for invalid data because shape-constrained models can fit valid data better than invalid data.

We evaluate the approach on a benchmark suite consisting of synthetic datasets, which we have published for benchmarking similar data validation approaches. Additionally, we demonstrate the capabilities of the proposed approach with a real-world dataset consisting of measurements from a friction test bench in an industrial setting. Our approach detects subtle data errors that are difficult to identify even for domain experts.

1 Introduction

Over the last decades, the quantity of collected data in sciences and enterprises is continuously increasing. One driving factor is the widespread application of sensor platforms in modern manufacturing. Consequently, data scientists and domain experts spend a substantial amount of time validating and improving the quality of data to utilize these large data quantities for analytics [34, 45, 51]. Manual data validation is not feasible for such highly complex applications with large data volumes; hence, an automated and continuous approach for data validation is required [7, 10, 19, 49, 55].

Motivating example. An example of such an application scenario is the development and industrial production of friction materials. The friction performance of new materials is difficult to determine without experiments on prototypes. The validation of the resulting measurements is challenging, even for experts, as each material exhibits different characteristics. However, these measurements share behavior that is governed by physical processes, which are understood by the experts and may be described mathematically. Currently, these experts manually validate data based on this knowledge.

State of the art and research gap. Traditional data validation frameworks (as discussed in [20]) allow specifying a set of rules that describe the characteristics of valid datasets. However, due to the complexity of many real-world applications, it is infeasible to manually specify a sufficient set of rules to determine data quality, especially when those rules have to describe combined effects in multiple variables. Hence, automated data validation frameworks apply machine learning (ML) to handle the complexity [52, 56]. Existing ML-based approaches in the context of data quality (DQ) are mainly concerned with (1) data preparation for ML tasks [12, 13] or (2) the detection of deviation (drift), where ML models are trained on data representing one concept and are subsequently evaluated on new data to detect if the data deviates from the established concept [23].

Our approach is similar, as we also utilize the advantages of ML (i.e., automation and the ability to capture complex patterns in data) but circumvent the problems of having no manual labels or established baselines by using expert knowledge of the expected patterns. Our approach does not require manually validated baselines to detect invalid data, but instead uses shape constraints (SCs)to define expected patterns shared by all valid data.

Contribution. Our contribution is two-fold: (1) We propose a novel ML-based data validation approach that utilizes SCs to integrate the knowledge of domain experts. It can validate the quality of new data without established baselines, as invalid data violate the defined constraints and lead to increased training error of the constrained predictive models. (2) We present an open-source benchmark suite for data validation containing synthetic datasets with labeled errors. This benchmark suite is used to analyze our approach and facilitates comparison with future approaches.

Approach. SC-based data validation integrates expert knowledge to constrain the function shape of a prediction model. The constraints are used to enforce expected, valid patterns and prohibit the ML model from fitting to invalid data. Hence, the training error of the constrained prediction models can indicate whether training data matches the expected patterns of the underlying (physical) system or process. In other words, a high training error indicates invalid data because models could not be fit to invalid data without violating the constraints. We identify invalid data based on the following assumption:

When prediction models are constrained to expected behavior of the underlying physical system and fit and predict valid data well under these constraints, then we can assume that new data is invalid when the trained model exhibits a high training error.

Structure. Section 2 highlights related work and the differences to our proposed approach. In Section 3, we describe the SC-based data validation approach and provide background information on shape-constrained regression (SCR). Section 4 contains a presentation of our benchmark suite. It is divided into Section 4.1, which describes the data generation process including artificial errors, followed by an introduction of the experimental setup in Section 4.2, and the results of SC-based data validation in Section 4.3. In Section 5, we cover the real-world application scenario of industrial friction experiments, where we provide background on the application scenario and data in Section 5.1, describe the experimental setup in Section 5.2, and discuss the results in Section 5.3.

2 Related Work

Data quality is traditionally described as “fitness for use” [15, 59], which refers to the high subjectivity and context dependency of the topic. While early research into DQ focuses on DQ dimensions and metrics (cf. [59]), the majority of research from the database community focuses on the detection of specific data errors [1, 47]. Abedjan et al. [2] distinguish between four types of data errors and classify data validation tools based on the errors they can detect: (1) outliers [37, 43, 63], (2) duplicates [41, 57], (3) rule violations, and (4) pattern violations.

The major aim of our SC-based data validation approach is the usage of patterns that describe valid data. Hence, our approach can be primarily compared to (4), and in a broader sense to (3) due to the expert-defined constraints. Since our approach applies statistical learning, it is also able to detect outliers (1) in the prediction target when they cannot be explained by corresponding changes in the inputs. In the following, we compare and distinguish our approach to outlier and anomaly detection (Section 2.1), rule-based data validation (Section 2.2), and data quality concerns or topics stemming from the ML community (Section 2.3), which include pattern-based error detection as well as concept drift detection.

2.1 Outlier or Anomaly Detection

Aggarwal [3] defines outliers as observations that differ from the majority of observations and therefore arise the suspicion that they were generated by a different underlying system. Anomaly or outlier detection approaches often apply statistical methods (such as ensemble algorithms, support vector machines, or isolation forests [29, 63]) to determine if individual values deviate from the overall data distribution.

Existing algorithms (e.g., [37, 43, 63]) usually focus on the detection of a small percentage of abnormal values or records (e.g., 5 %) but cannot determine if these outliers are still correct data and explained by changes of other correlated observables.

In industrial applications (cf. Section 5), we experienced additional challenges (C1–C3) that cannot be solved with traditional outlier detection algorithms: (C1) a large amount of outliers in the data, which can be up to 100 % of a dataset; (C2) consecutive outliers with several abnormal values in a row; and (C3) subtle outliers, which can only be detected by monitoring multiple observables. The third category is typical for industrial time-series data from highly volatile systems. Existing surveys [2, 20, 38] indicate that available algorithms for outlier detection are mainly implemented as statistical analysis of individual observables (features) and do not consider interactions of multiple observables. Approaches that do consider multiple observables are limited to linear correlations. We can distinguish our approach from outlier detection by explicitly tackling challenges (C1), (C2), and (C3), as we provide an ML-based solution that allows to define multivariate and nonlinear patterns for valid data.

2.2 Rule-based Data Validation

The traditional way to assess data quality uses DQ rules [20], which are often expressed as constraints (e.g., functional dependencies, denial constraints [2]). An example is DC-Clean [16], which focuses on denial constraints. Such rule-based systems have the disadvantage of a high manual effort to set up all DQ rules [20]. In practice, domain experts cannot fully specify all constraints for a complex system [56].

In contrast to traditional rule-based approaches, our proposed approach does not require specifying a large number of instance-specific DQ constraints. Instead, only a few general rules, in the form of shape constraints provided by domain experts, are sufficient to define the expected behavior of the (physical) system and can be used to assess the quality of all datasets of the domain. The instance-specific characteristics are learned by the ML model, and the model’s ability to fit to data while adhering to this shared expected behavior provides assessment of the data quality. In the remaining work, “constraints” refers to shape constraints (cf. Section 3.1), not to DQ constraints.

2.3 Machine Learning and Data Quality

Applications of ML are highly sensitive to data. This is true in the relation of data quality and resulting performance of prediction models [30], and the ability of prediction models to detect changes in the concepts represented by data [22].

2.3.1 Improving Data Quality for Machine Learning Applications.

In ML applications, the quality of new data is an important concern to ensure the functional safety of deployed models. ML models carry a high technical debt toward data, its schema, and data pipelines [53, 54] since every model has implicit assumptions and depends on the structure and distribution of input data [44]. Recent advances in the DQ and ML communities strive to ensure the functional safety of ML models via metadata (or schemas) that specify characteristics of valid data [12, 13, 28, 49, 50, 51] or the model’s intended application domain [60]. Other techniques test ML models with frequently occurring data errors before deployment [52]. Yet another type of approach is to repair data automatically before it is used for model predictions [32].

Schemas describe the structure and distribution of valid data and are either declared manually [50] or automatically learned from valid data [12, 13]. They are subsequently used to detect changes in the distribution of individual input features [46]. Existing approaches focus on the schemas of data or their distributions and do not apply the ML models to assess the quality of data. Therefore, these approaches are unable to detect issues in the correlation (patterns) of multiple observables.

2.3.2 Using Machine Learning to Assess Data Quality.

ML prediction models can capture complex interactions (patterns) in data and have long been applied to detect deviations from initially learned concepts [23, 29]. Such deviations are most commonly referred to as concept drift [61] and are defined as a change in relationship between multiple inputs and a target feature, which can occur abruptly or incrementally [65]. Concept drift detectors calculate the difference between actual, measured target values and the model’s predictions to detect changes in system behavior [22]. Concept drift detectors are either trained and analyzed continuously on (streaming) data or trained on data from established baselines to detect subsequent deviations [23].

Therefore, concept drift detectors only alert to a change in concept, disregarding if this change is for the better or worse. To not only detect concept drift but also label data as valid or invalid, the ML models of concept drift detectors must be trained on manually labeled baseline data. This is often infeasible, as each individual concept requires manual cleaning and labeling to establish a baseline.

Our real-world use case serves as an example. The friction performance of new materials may differ significantly from existing materials and, therefore, from any prior established baselines (e.g., some materials are more sensitive to temperature than others). Yet, the underlying physical effects on friction performance are the same for all materials and can be expressed mathematically in the form of SCs for prediction models. We know, for example, that overheating brakes result in a loss of braking power, or, in other words, that the friction coefficient is monotonic decreasing over temperature.

2.3.3 Augmenting Machine Learning with Prior Knowledge for Assessing Data Quality.

In an initial exploration of the SC-based data validation approach [6], we compared the viability of three different shape-constrained regression algorithms on our real-world use case based on their classification performance. In this work, we extend the finding of [6] by (1) providing a detailed description and well-grounded discussion of SC-based data validation, (2) evaluating our approach on a real-world use case as well as a comprehensive benchmark suite, and (3) describing the published benchmark suite itself to facilitate future comparisons.

3 SC-based Data Validation

SC-based data validation can verify if datasets correspond to the patterns expected of valid data. We achieve this by analyzing how well a prediction model, constrained by expert knowledge, can fit previously unseen data, i.e., how well the data fit our expectations.

In this section, we give an introduction to SCR, clarify how constraints are defined, and indicate which types of constraints are available. This primer is followed by a description of our approach, detailing the necessary steps to implement the approach and an illustrative example to visualize the general idea.

3.1 A Primer on Shape-constrained Regression

SCs enable the integration of expert knowledge during the training of ML models and therefore lead to more trustable models [17]. SCs have long been employed by ML researchers [62] and have been added to many regression models including cubic smoothing splines [48], piece-wise linear models (lattice regression) [24], support vector machines [35], kernel regression [5], neural networks [36], gradient boosted trees [14], polynomial regression [18, 26], and symbolic regression [11, 33]. SCR allows enforcing shape properties of the model function, e.g., that the function output must be positive, monotone, or convex/concave over one or multiple inputs. Table 1 lists some examples of available constraints. These properties can be expressed through bound constraints on the output of the function or its partial derivatives for a given input domain. We know, for example, that the identity function \(id(x) = x\) is positive for only a positive input domain \((\forall _{x \gt 0 }~id(x) \gt 0)\) and that it is monotonically increasing for the full input domain \((\forall _{x} \tfrac{\partial }{\partial x} id(x) \ge 0)\). Appendix A provides further details on the definition of constraints.

Table 1.

Property	Mathematical formulation
Positivity	\(f(X)\ge 0\)
Negativity	\(f(X) \le 0\)
Monotonically increasing	\(\frac{\partial }{\partial x_i}f(X) \ge 0\)
Monotonically decreasing	\(\frac{\partial }{\partial x_i}f(X) \le 0\)
Convexity	\(\frac{\partial ^2}{\partial x_i^2}f(X) \ge 0\)
Concavity	\(\frac{\partial ^2}{\partial x_i^2}f(X) \le 0\)

Table 1. Examples of Shape Constraints for the Function Image and First- and Second-order Partial Derivatives

SCs are helpful when expert knowledge is available that can be used for model fitting, in addition to observations of the input variables and the target variable. Fitting a model with SCs can be especially useful when training data is limited or to enforce extrapolation behavior.

We investigated different SCR algorithms in [6] and selected shape-constrained polynomial regression (SCPR) for our SC-based data validation approach, because it (1) is deterministic and therefore ensures repeatability and (b) produces reliable results in a short runtime. Moreover, SCPR supports constraints for any order of derivative and subspaces of the input space, whereas other algorithms such as gradient boosted trees [14] only support monotonicity constraints over the full input space [6].

Appendix B describes SCPR in detail, including a description of the hyper-parameters \(\alpha , \lambda\), and \(\text{total degree}~d\). The valid ranges of the hyper-parameters are listed in Table 2.

Table 2.

Parameter	Description
C	Set of constraints describing valid, expected patterns. Shared by all datasets.
t	The model training error threshold, discerning between valid and invalid data.
\(\alpha \in [0,1]\)	The tradeoff between L1 and L2 regularization.
\(\lambda \in \mathbb {R}\)	The strength of regularization.
\(d \in \mathbb {N}\)	The (total) degree of the polynomial.

Table 2. The Parameters Required for SC-based Data Validation

The last three parameters are specific to SCPR, the SCR algorithm applied in this article.

3.2 The Data Validation Process

The application of SC-based data validation is divided into two phases. During the initialization phase, the prerequisites for SC-based data validation are determined. The second phase consists of the validation of new data and employs the initially gathered information to assess data quality. Phase (1) establishes how well constrained models are able to describe valid data, whereas phase (2) investigates the fit of constrained models on new data, where a bad fit is indicative of erroneous data.

Phase (1): Initialization.

The initialization consists of three steps (1.1, 1.2, and 1.3) and is executed only once initially. The main goal of phase (1) is to define and describe the underlying system for SC-based data validation. Here, we define the parameters of SC-based data validation, as listed in Table 2, and establish the expected patterns that are used to assess new data in phase (2). The underlying system represents the source of data to be assessed. Each produced dataset can represent an unseen concept for which no baseline of manually validated data exists. However, all data produced by the same system share the expected patterns defined in phase (1). These patterns are represented by the combination of (1.1) the constraints describing experts’ knowledge of valid behavior, (1.2) the complexity of the underlying system (number of inputs and their interactions) represented by suitable hyper-parameters for the SCR algorithm, and (1.3) the classification threshold t to distinguish between valid and invalid data.

Step (1.1): Determine Constraints.

Shape constraints can be defined in cooperation with experts who are familiar with the causal or statistical dependencies between observables of a system. For example, they might know that certain relationships are monotonic or convex/concave. Another example would be knowledge that a variable approaches a certain equilibrium value for large inputs. This can be formulated as using a combination of a monotonicity and a concavity constraint on the function shape. More examples for shape constraints are provided in the description of the real-world application scenario in Section 5.1.2.

Each constraint is specified for an input domain that can be a subspace of the full input domain (e.g., only for input \(x_i \in [\text{lower}, \text{upper}]\)). The domains for multiple constraints may overlap. Therefore, shape constraints are a highly expressive language for describing expected patterns for a system.

Step (1.2): Determine Algorithm Parameters.

The purpose of this step is to determine suitable hyper-parameters for the SCR algorithm and to adapt the approach to the specific application (cf. Figure 1). In contrast to the example discussed in Section 3.3, the complexity of the observed system is not known in real-world applications. Thus, a wide range of hyper-parameters needs to be investigated to find the necessary expressiveness of a model to capture the interactions of the observed system.

Fig. 1.

This step of SC-based data validation represents a standard ML hyper-parameter search, as prediction models trained by the SCR algorithm need to capture the interactions represented in valid data with a low training error.

We execute a grid search for the best hyper-parameters using cross-validation. In the case of SCPR (see Appendix B), a polynomial of higher degree d is able to capture more complex interactions but might overfit, and thus the hyper-parameters \(\alpha\) and \(\lambda\) are used for regularization to counter overfitting.

Step (1.3): Determine an Error Threshold Value.

Once suitable hyper-parameters for the SCR algorithm are established, a value for the error threshold t can be determined. The maximum allowable training error (threshold t) is used to distinguish between valid and invalid datasets. There are two different variants to determine a suitable value for t: (1.3a) via an estimate of the unexplained noise or (1.3b) empirically using statistics of valid and invalid data, if available.

Variant (1.3.a): The parameters \(\alpha , \lambda , d\) affect the fit of the models and thus the training error (e.g., RMSE, see Equation (1)). Hence, these parameters must be determined prior to t. The value of t can be understood as the level of accepted unexplained variance (noise) (cf. Figure 1). Section 3.3 provides a detailed example for this variant.

\begin{align} RMSE(y,f(X)) &= \sqrt {\frac{1}{n}\sum _{i=1}^{n}{\left(y_i - f(X_i)\right)^2}} \end{align}

(1)

Variant (1.3.b): If labeled valid and invalid datasets are available (cf. Figure 2), the effect of changing t and the corresponding classification results can be visualized as a receiver operator characteristic (ROC) curve, as shown in Figure 11. The ROC curve enables an informed selection of t. The threshold for our real-world example was selected this way, whereas our benchmark suite visualizes the area under the ROC curve (AUC) to compare classification accuracy. Alternatively, t may be tuned automatically (cf. [66]) to optimally discriminate valid and invalid datasets. This is especially relevant in cases where there is an imbalance in the occurrence of classes or when we can assign a different cost per label [64].

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 9.

Fig. 10.

Fig. 11.

Phase (2): Data Validation.

This phase implements the validation of new data of unknown quality. Here, we analyze how well a constrained model can be fit to the new dataset and therefore how well the dataset adheres to our expected patterns for valid data, illustrated in Figure 3.

The constraints and the best hyper-parameters determined in phase (1) are applied to train a constrained prediction model on each new dataset. The training error is calculated, and the data are marked as invalid if the error threshold t is exceeded. The assumption that valid data can be well described by constrained models while fitting invalid data would violate constraints and lead to higher training error that exceeds t. Critically, this step is executed without the need to establish a baseline of valid data for each new concept represented by new data.

In practice, datasets investigated in phase (2) can even differ widely from prior data (used in phase (1) or (2)) in the value ranges and distributions of observables or their size. SC-based data validation can still reliably detect deviations from expected patterns.

3.3 Demonstrating SC-based Data Validation

The idea of our approach is illustrated in Figure 4. Here, we define three third-degree polynomials \(f(x)\), \(g(x)\), and \(h(x)\) as our base functions from which we sample data. In this example, we expect valid data to be monotonically increasing. Both base functions f and g fulfill this constraint, whereas the polynomial h is designed to deviate from this pattern, as it contains a decreasing section for \(x \in [-2,1.5]\).

From these base functions, we sample observations and add Gaussian noise with a standard deviation of \(\sigma = 5\), thus creating the training datasets F, G, and H. We assume these are the data that require quality assessment and that the functions used for data generation (f, g, and h) are unknown to the data validation process.

This setup emulates real-world applications where the details of the underlying system are unknown. Yet, we can define shared expected patterns in the form of a monotonicity constraint that describes valid data.

The random noise imitates measurement uncertainty, which is usually present in real-world scenarios. Because of noisy data, simple detection rules such as the assertion of increasing values or only positive numeric differences cannot be used. While a sliding average with a tuned window size can detect the decreasing section in this simple problem, it cannot be applied as a detector for multivariate problems or more complex constraints.

For the quality assessment, we apply the SCR algorithm to train two models for each dataset F, G, and H: once without constraints, resulting in the models \(f_1\), \(g_1\), and \(h_1\), and a second time resulting in \(f_2\), \(g_2\), and \(h_2\), which are constrained to be monotonically increasing \((\tfrac{\partial }{\partial x}f(x) \ge 0)\). All six model predictions are plotted in Figure 4.

We assume suitable algorithm parameters (degree d, \(\lambda\), \(\alpha\)) for SCPR as shown in the figure. In the context of real-world applications, more effort is required, as the hyper-parameters must be determined by a grid search with cross-validation.

On the datasets F and G, generated from valid functions, the unconstrained and constrained models both have an equally low root mean squared error (RMSE, Equation (1)), as the data matches the expected patterns. The RMSE of about 4.9 (see Figure 4) matches the standard deviation of the unexplained Gaussian noise (\(\sigma = 5\)). However, the model \(h_2\) is restricted by the constraint defined for valid behavior and has higher training error, which indicates that the dataset H deviates from the expected patterns of valid data. A threshold value of \(t=5\) would be able to distinguish valid from invalid data, and would also establish our accepted noise level.

This example highlights the capabilities of SC-based data validation. We require only the monotonicity constraint \((\tfrac{\partial }{\partial x}f(x) \ge 0)\) and the algorithm parameters of degree d, \(\lambda\), \(\alpha\) to assess the quality of the datasets F, G, and H. It may also be used to validate data from any other third-degree polynomial (the same complexity of the established underlying system, see phase (1.2) of Section 3.2) to be monotonically increasing. Thereby, our approach contrasts existing ML-based data quality assessment approaches, as discussed in Section 2.

4 Evaluation On Synthetic Data

In this section, we validate our approach using a large set of synthetic datasets with varying complexity and error severity, which demonstrates the broad applicability. To support the development and comparison of future data validation approaches, we have published the full experimental setup and scripts used to calculate the results on our public GitHub repository.¹

In this systematic evaluation, we generate data from known expressions using different noise levels and manually introduce errors of different types and severities. The data generation process is detailed in Section 4.1, followed by a description of the experimental setup in Section 4.2 and a discussion of the achieved results in Section 4.3.

4.1 Synthetic Data Generation

The AI-Feynman [58] symbolic regression database² is a list of equations presented in Richard Feynman’s lectures on physics. It was established to benchmark ML algorithms for symbolic regression. The 120 equations listed in the database range from simple one-dimensional problems \(y = f(x_1)\) to multivariate equations \(y = f(x_1, x_2,\ldots , x_n)\). The database defines ranges for each input, e.g., \(x_i \in [\text{lower}, \text{upper}]\), to ensure that ML algorithms are evaluated on similar input domains and training data. Many of these equations describe dependencies between variables that can be described by shape constraints.

The equations are closed-form expressions in functional form, which means that we can sample data for (input) variables on the right-hand side of the equation and calculate the (output) variable on the left-hand side. As the generating expressions are known, we can determine shape constraints algorithmically by taking partial derivatives and sampling the input space. This results in a set of shape constraints for each expression/dataset that can be used for SC-based data validation. We describe this process in Appendix C, with the results listed in Table 3.

Table 3.

Equation	Variable domains \([\text{lower},\text{uppper}] \subseteq \mathbb {R}\)	\(\frac{\partial f}{\partial \mathit {var}}\)	\(\frac{\partial ^2 f}{\partial ^2 \mathit {var}}\)
I.6.2	\(\sigma \in [1,3]\)
	\(\theta \in [1,3]\)	\(\lt 0\)
I.9.18	(\(\mathit {m1}, \mathit {m2}, \mathit {G}) \in [1,2]^3\)	\(\gt 0\)	\(= 0\)
	\((\mathit {x1}, \mathit {y1}, \mathit {z1}) \in [3,4]^3\)	\(\lt 0\)
	\((\mathit {x2}, \mathit {y2}, \mathit {z2})\in [1,2]^3\)	\(\gt 0\)
I.15.3x	\(\mathit {x} \in [5,10]\)	\(\gt 0\)	\(= 0\)
	\(\mathit {u} \in [1,2]\)
	\(\mathit {c} \in [3,20]\)	\(\lt 0\)	\(\gt 0\)
	\(\mathit {t} \in [1,2]\)	\(\lt 0\)	\(= 0\)
I.30.5	\(\mathit {lambd} \in [1,2]\)	\(\gt 0\)	\(\gt 0\)
	\(\mathit {d} \in [2,5]\)	\(\lt 0\)	\(\gt 0\)
	\(\mathit {n} \in [1,5]\)	\(\lt 0\)	\(\gt 0\)
I.32.17	\((\epsilon ,\mathit {c}) \in [1,2]^2\)	\(\gt 0\)	\(= 0\)
	\((\mathit {Ef}, \mathit {r}, \omega) \in [1,2]^3\)	\(\gt 0\)	\(\gt 0\)
	\(\omega _0 \in [3,5]\)	\(\lt 0\)	\(\gt 0\)
I.41.16	\(\omega \in [1,5]\)
	\((\mathit {T}, \mathit {kb}) \in [1,5]^2\)	\(\gt 0\)	\(\gt 0\)
	\((\mathit {h}, \mathit {c}) \in [1,5]^2\)	\(\lt 0\)	\(\gt 0\)
I.48.2	\(\mathit {m} \in [1,5]\)	\(\gt 0\)	\(= 0\)
	\(\mathit {v} \in [1,2]\)	\(\gt 0\)	\(\gt 0\)
	\(\mathit {c} \in [3,10]\)	\(\gt 0\)	\(\gt 0\)
II.6.15a	\((\epsilon , \mathit {r}) \in [1,3]^2\)	\(\lt 0\)	\(\gt 0\)
	\((\mathit {p_d}, \mathit {z}) \in [1,3]^2\)	\(\gt 0\)	\(= 0\)
	\((\mathit {x}, \mathit {y}) \in [1,3]^2\)	\(\gt 0\)	\(\gt 0\)
II.11.27	\((\mathit {n}, \alpha) \in [0,1]^2\)	\(\gt 0\)	\(\gt 0\)
	\((\epsilon , \mathit {Ef}) \in [1,2]^2\)	\(\gt 0\)	\(= 0\)
II.11.28	\((\mathit {n}, \alpha) \in [0,1]^2\)	\(\gt 0\)	\(\gt 0\)
II.35.21	\(\mathit {n_{rho}} \in [1,5]\)	\(\gt 0\)	\(= 0\)
	\(\mathit {mom} \in [1,5]\)	\(\gt 0\)
	\(\mathit {B} \in [1,5]\)	\(\ge 0\)	\(\le 0\)
	\((\mathit {kb}, \mathit {T}) \in [1,5]^2\)	\(\le 0\)
III.10.19	\(\mathit {mom} \in [1,5]\)	\(\gt 0\)	\(= 0\)
	\((\mathit {Bx}, \mathit {By}, \mathit {Bz}) \in [1,5]^3\)	\(\gt 0\)	\(\gt 0\)

Table 3. The Selected Equation Instances, Input Domains, and Determined Constraints for First- and Second-order Partial Derivatives

Appendix A provides a more detailed explanation of the notation used.

In this section, we explain our selection of equation instances (4.1.1), describe the synthetic error functions and the scaling of error severity (4.1.2), and show how we generate data for our benchmark suite (4.1.3)

4.1.1 Selection of Equation Instances.

From the 120 total equations listed in the AI-Feynman symbolic regression database, we selected 12 instances for analysis, as listed in Table 3, which were reported as hard in the original publication [58]. Therefore, data generated from these instances provides a challenge for model training by SCPR. By extension, SC-based data validation is confronted with non-trivial benchmark datasets. The selection is based on [25], in which different SCR algorithms are benchmarked (see Appendix C).

4.1.2 Introducing Synthetic Errors.

The artificially introduced data errors must be detectable and similarly severe for all 12 equations. The selected equations vary widely in their number of inputs, complexity, and output ranges. We consider these factors in data generation to ensure comparability.

We use three different types of simulated errors by adding an unknown error function to the original function, which has only a random local effect. Figure 5 shows the one-dimensional representation of all three error functions: spike, offset, and bell-shaped. Figure 6 shows the three error functions on a two-dimensional space. All error functions can be calculated for any given input dimensionality. The output of the error functions is normalized to a range of \([0,1]\) and subsequently scaled for each equation to facilitate comparability.

The errors affect a randomly selected subspace of the input domain. The size of the subspace is determined by a ratio \(\psi\) of the volume of the full input space. The normalized error function output is calculated for each point of the selected subspace and is multiplied by the standard deviation \(\sigma _y\) to scale the error to the function output range. This results in errors of similar severity for all equations. The error severity is subsequently varied in scale by multiplying the error scaling factor \(\phi\). Therefore, we can investigate the capabilities of our approach by having comparable yet varied error severity. The errors are added onto the (noisy) target values.

Suppose, Equation (3), I.6.2, is defined for \((\sigma , \theta) \in [1,3]^2\), then an error width of \(\psi =0.125\) constitutes a square with edges of about 0.7 in length. Figure 7 shows the function output with added noise \(\zeta =0.01\), placed randomly at \(\sigma = 1.5\) and \(\theta = 2.1\), the scaled output of the square offset error function, and the final dataset with artificially created errors. The introduction of Gaussian noise and the noise level \(\zeta\) are detailed in Appendix C.

The size of the affected area is specified by a percentage value \(\psi\) of the total dataset volume. For a one-dimensional problem, this corresponds to the width of the error function (cf. Figure 5).

The placement of the error is random but designed to be fully contained within the input space. The volume of the subspace affected by the error is determined as a fraction of the original input space.

4.1.3 Univariate and Multivariate Problem Generation.

For the experimental evaluation of SC-based data validation, we generate valid and invalid datasets from the selected equations. We sample the input space for each function uniformly (cf. Table 3) and calculate the corresponding function output y. Onto this output y, we add a normally distributed random error for the selected noise level \(\zeta\). The resulting dataset is saved and labeled as valid. We copy this dataset thrice and apply each error function. The resulting three datasets are labeled as invalid.

All 12 selected instances are multivariate equations that expect multiple inputs. For our investigation, we generated additional one-dimensional variants of the 12 equations. For this, we fix all but one input variable to the lower bound of the input range and vary the remaining variable by uniformly sampling from the one-dimensional input range.

We do not use a one-dimensional dataset if there is no constraint defined for the varied variable or if the output is constant.

Therefore, we can generate a large database for benchmarking consisting of valid and invalid datasets, which include univariate and multivariate problems, different levels of simulated measurement noise, and different types of systematic data errors. The way the error functions are defined guarantees that any monotonicity and concavity/convexity constraints are violated, and therefore the invalid datasets are detectable by our approach.

4.2 Validating Synthetic Data

We investigate the capabilities of SC-based data validation by applying our approach to synthetic benchmark datasets with artificially introduced errors and, therefore, known classification labels.

4.2.1 Initialization of SC-based Data Validation and Experimental Setup.

Following our approach, as described in Section 3.2, we begin with a hyper-parameter search for suitable algorithm parameters. We use the 12 equations to generate datasets with 2,000 rows, without synthetic errors (valid data), and a noise level of \(\zeta =0.1\), and apply a random 80 % training and 20 % test split. Table 4 lists the search space for grid search and the best parameter values, whereas Figure 8 reports the full search results.

Table 4.

Parameter	Univariate best value	Multivariate best value	Description
\(\alpha\)	0.0	0.5	\(\alpha \in \lbrace 0,0.5,1\rbrace\)
\(\lambda\)	\(10^{-4}\)	\(10^{-5}\)	\(\lambda \in \lbrace 10^i \vert i \in \mathbb {Z}^{\le 0}_{\ge -7}\rbrace\)
d	7	5	\(d \in \mathbb {N}^{\le 7}_{\ge 3}\)

Table 4. A List of Parameters Used for SC-based Data Validation for the Synthetic Benchmark Datasets

After establishing suitable algorithm parameters for valid data, we can apply SC-based data validation on the variety of benchmark data to investigate the classification capabilities of our approach. Table 5 lists the parameter values used for data generation.

Table 5.

Parameter	Values
Size (univariate)	\(\vphantom{\zeta }\ \ \in \lbrace 50,100\rbrace\)
Size (multivariate)	\(\vphantom{\zeta }\ \ \in \lbrace 200,500\rbrace\)
Noise level	\(\zeta \in \lbrace 1~\%, 2~\%, 5~\%, 10~\%, 15~\%, 20~\%, 25~\%\rbrace\)
Error scaling	\(\phi \in \lbrace 0.5, 1, 1.5\rbrace\)
Error width (% of volume)	\(\psi \in \lbrace 5~\%, 7.5~\%, 10~\%, 12.5~\%, 15~\%\rbrace\)
Error type	\(\vphantom{\zeta }\ \ \, \in \lbrace \mathit {None, Spike, Square, Normal}\rbrace\)
Possible parameter combinations and index	\(i_{param} \in \mathbb {N}^{\le 840}\)

Table 5. The Parameter Values Used to Generate the Synthetic Benchmark Suite

For our 12 selected equations and 840 possible parameter configurations, this results in 10,080 benchmark datasets for our multivariate experiments. With 44 viable inputs (constraints are available and equation output is not constant), we generate 36,960 univariate benchmark datasets. SCPR models are trained with the best parameters established by grid searches, whereby we use a different set of parameters for univariate and multivariate problems. To analyze the classification capabilities of the SCPR models, we calculate the RMSE and vary the error threshold t from the lowest to the highest value. It is assumed that erroneous datasets exhibit a higher training error than valid data. Therefore, a threshold should be able to distinguish these two cases. However, each combination of error parameters \(i_{param}\) in Table 5 is assigned an individual threshold value t, since, e.g., a higher noise level leads to a higher training error even for valid datasets.

4.2.2 Evaluation of Results.

Based on the varied threshold parameter, we can calculate the ROC curve for each parameter combination. If a dataset with a synthetic error (invalid) is actually labeled as invalid by our approach, it counts as a true positive, whereas valid datasets with no synthetic error that are incorrectly classified as invalid are counted as false positives.

When comparing multiple ROC curves, the AUC can be used as a quantitative measure. The AUC value lies between zero and one, whereby a value of one represents perfect discrimination of the two classes. The baseline value for a random discriminator depends on the class imbalance (ratio between positive and negative examples). Figure 9 shows the AUC results for our benchmark set as a heatmap. The heatmap shows the results for dataset sizes of 50 and 200 data rows for univariate and multivariate tests, respectively. An increase of the row count to 100 and 500 entries shows no significant effect on the classification results in our tests.

4.3 Discussion of Results

Figure 9 shows that for univariate problems with low noise levels, our proposed SC-based data validation approach was able to classify the datasets without any false labels (\(\text{AUC} = 1\)). As expected, our approach begins to struggle with increasing noise levels, lower error severity, and more complex multivariate problems, whereas the size of the affected data area (the error width \(\psi\)) seems to have a smaller effect.

This small effect of the error width \(\psi\) can be explained by the fact that the defined constraints are violated regardless of the size of the subspace affected by the error. A tuned threshold t can distinguish well between valid and invalid data if the model was able to fit to the hidden ground truth. A larger affected area will only increase the training error and ease the distinction for the harder-to-fit equations.

The results are especially good, considering that the same hyper-parameters were used for all 12 equations. In real-world applications of SC-based data validation, each equation would constitute a distinct system for which the hyper-parameters can be tuned specifically.

The experiments in Section 4 and Section 5 were conducted on consumer-grade hardware with an Intel Xeon E3-1270 8\(\times\)3.7 GHz CPU and 32 GB of RAM. Larger memory size might be necessary to execute SCPR for a larger number of inputs or a larger total degree. Table 6 reports the average execution time of one training run during the hyper-parameter search with 2,000 entries per dataset and parameter ranges listed in Table 4, and the average execution time of SC-based data validation for the datasets generated from the benchmark suite parameters reported in Table 5.

Table 6.

Task	Average duration per dataset in seconds
Multivariate grid search	11.071
Univariate grid search	0.026
Multivariate validation	1.579
Univariate validation	0.016

Table 6. Average Execution Time of the SCPR Grid Search and SC-based Data Validation for the Benchmark Suite

5 Evaluation On A Real-world Application Scenario

In this section, we evaluate our SC-based data validation approach on labeled friction experiment data provided by Miba Frictec GmbH.³ Section 5.1 provides an introduction to friction, describes the datasets and constraints for valid data, and discusses common data errors in industrial friction experiments. In Section 5.2, we detail the application of SC-based data validation on the data, and further discuss the results in Section 5.3.

5.1 Data from Friction Experiments

Friction is essential in mechanics of force transfer interfaces such as transmission units or gearboxes. Here, friction materials are applied together with a lubricant (oil) as wet friction systems.

The development of new friction materials is especially challenging, as the characteristic of new materials cannot be determined beforehand. Instead, exhaustive testing in long-running experiments is required. Various faults can occur on the test equipment, which can render the experimental data erroneous. These invalid data are currently detected through manual validation by experts.

The friction coefficient \(\mu\) denotes the ratio of friction force and normal load. We distinguish between static and dynamic friction, which describe the force required to initiate motion and maintain relative motion, respectively [9]. In this work, we analyze the validity of measurements of the dynamic friction coefficient \(\mu _{\it dyn}\).

Friction characteristics are primarily dependent on the friction material, its surface and wear, and the lubricant (oil). The exact performance of a given setup is unknown before empiric experiments on test benches. Additional friction parameters such as the sliding velocity v, normal force or pressure p, and temperature of friction surfaces T also affect the friction characteristics [8]. The friction coefficient \(\mu\) is therefore not constant for any material but is instead described as a function \(\mu _{\it dyn}({\it fm}, {\it oil}, p, v, T)\).

Each new combination of the friction material and oil type represents a distinct configuration with possibly completely different friction characteristics. Yet, all configurations share behavior and physical rules (causal relationships) that govern all friction systems (tribology). These rules can be formulated as constraints on \(\mu _{\it dyn}({\it fm}, {\it oil}, p, v, T)\). A well-known example of such a constraint is the inverse relationship of T and \(\mu\), which can be observed in the reduced friction and the resulting loss of braking power in overheating brakes.

5.1.1 Empirical Testing of Friction Systems.

The friction performance of manufactured components is tested and continuously controlled on industrial test benches with a standardized procedure. In these experiments, the test bench follows the testing procedure to vary and control the load parameters v and p. Pressure p is varied in six steps, from low to high. Within each pressure step, the sliding velocity v is varied in four steps from low to high speeds, resulting in 24 load parameter segments. This behavior can be observed in the lower half of Figure 10. Values of p, v, and T have been normalized to the range \([0,1]\) to ensure data confidentiality.

For each load parameter segment, the friction plates are engaged and disengaged 25 times, resulting in 600 data points of aggregated sensor values. Within each engagement, the discs are first accelerated to the programmed velocity and then pressed against fixed discs while keeping the velocity constant. The resulting friction energy causes an increase in disc temperature, which allows us to observe the effect of temperature T on \(\mu _{\it dyn}\). Higher pressure and velocity result in higher energy and therefore a larger temperature increase. After each parameter segment, the system is cooled down to a controlled start temperature.

\begin{align} \forall _{v,p,T} \hspace{5.0pt}(v, p, T) \in [0,1]^3 \Rightarrow &\Big (\mu _{\it dyn} \in [0, 1] \wedge \frac{\partial \mu _{\it dyn}}{\partial v} \in [-0.01,0.01] \nonumber \nonumber\\ &\wedge \frac{\partial \mu _{\it dyn}}{\partial p} \le 0 \wedge \frac{\partial ^2 \mu _{\it dyn}}{\partial p^2} \ge 0 \wedge \frac{\partial \mu _{\it dyn}}{\partial T} \le 0 \wedge \frac{\partial ^2 \mu _{\it dyn}}{\partial T^2} \ge 0 \Big) \end{align}

(2)

5.1.2 Data Validation Rules and Constraints.

The expert’s knowledge about the physical phenomenon of friction and its dependency on p, v, and T allows us to specify constraints in the form of partial derivatives of the model \(\mu _{\it dyn}({\it fm}, {\it oil}, p,v,T)\) that describe the expected behavior of clean and valid data in Equation (2) (Appendix A provides an alternative representation and a more detailed explanation of the used notation).

The phenomenon of reduced friction caused by overheating is well known. This phenomenon can be observed as a decrease in \(\mu _{\it dyn}\) over increasing T within each of the 24 segments of Figure 10. The pattern of \(\mu _{\it dyn}\) over temperature is expected to be monotonically non-increasing and convex. The same restriction is true for pressure p and can be observed in the overall decrease of \(\mu _{\it dyn}\) in the six increasing pressure steps of Figure 10.

Moreover, we know that the friction coefficient for the tested materials can only be in the range of \(\mu _{\it dyn} \in [0,1]\) and that sliding velocity v is expected to have only a small effect on \(\mu _{\it dyn}\). A high variation of measured \(\mu _{\it dyn}\) over velocity would therefore indicate erroneous data.

5.1.3 Data Recording and Possible Issues.

Various known issues during a test can render measurements unusable for downstream applications. Our goal is to automatically detect whether an experiment was affected by a detectable issue.

The manual validation of friction experiments requires immense time and effort. Hence, we received only a set of 53 labeled datasets, consisting of 18 known valid and 35 known invalid datasets. We further categorized the 35 invalid datasets into the following three types: (T1) physical damage to the friction discs, (T2) detectable issues affecting the expected pattern of valid \(\mu _{\it dyn}\) measurements, and (T3) issues that are not covered by the constraints and are therefore undetectable.

(T1) Physical damage to the friction discs is the most frequent error in our dataset and is immediately obvious to the test bench operator when removing the friction discs. This error category includes destroyed discs, deformed discs, and other damages. Such damages cause a deviation from expected patterns of valid friction measurements and therefore are detectable. The consequences of this category range from obvious data outliers to only subtle changes in behavior depending on the severity of the damage.

(T2) Issues affecting the function shape of \(\mu _{\it dyn}({\it fm},{\it oil},p,v)\) include problems with the test bench or friction discs that affect friction performance and therefore are detectable. The example plotted in Figure 10 is indicative of this error type. Datasets of this category are most representative of our proposed approach, as these issues (generally) cannot be detected by a quick visual inspection of the friction discs or the data. Instead, they require an analysis of the functional dependency of the parameters and \(\mu _{\it dyn}\). This category includes issues with friction surfaces, such as unusual abrasion, defects, or loosening of the temperature sensor during tests, as well as lubrication problems, e.g., clogging or sudden loss of oil. The last two examples affect heat dissipation or the total thermal mass and therefore the shape of \(f({\it fm},{\it oil},p,v)\).

(T3) Issues that do not affect the shape of \(\mu _{\it dyn}({\it fm},{\it oil},p,v)\) describe problems that reduce data validity but are not observable by analysis of the identified expected patterns. Therefore, these problems are undetectable by our approach. Examples include:

–

Typos in the descriptors of oils or materials, or mislabeling of materials or oils

–

Data synchronization issues, which can result in null values or missing rows

–

Problems with the controller unit that can be caused by faulty sensors or malfunctioning equipment (e.g., measured values of \(p, v, T\) do not match the configured target value)

Errors of type T3 represent severe issues, some of which are present in the 53 provided datasets. Established rule-based data validation approaches, as discussed in Section 2, are best suited to detect these types of errors.

5.2 Validating Wet Friction Data

In this section, we evaluate the efficacy of SC-based data validation on data from industrial friction experiments. We present the final classification results for all manually labeled datasets and discuss what types of errors were detected by SC-based data validation.

5.2.1 Applying SC-based Data Validation.

We use the constraints listed in Equation (2) and execute a grid search for suitable hyper-parameters on the available valid datasets. The investigated parameters are listed in Table 7 and the results are visualized in Figure 12.

Table 7.

Parameter	Best value/ amount	Description
C	6	Constraints defined by domain experts. Detailed in Equation (2).
t	0.00145	The threshold for the RMSE value, selected from Figure 11.
\(\alpha\)	1	\(\alpha \in \lbrace 0,0.5,1\rbrace\)
\(\lambda\)	0.01	\(\lambda \in \lbrace 10^i \vert i \in \mathbb {Z}^{\le 1}_{\ge -9}\rbrace\)
d	4	\(d \in \mathbb {N}^{\le 8}_{\ge 1}\)
Valid datasets	18	Valid datasets, used in grid search for \(\alpha , \lambda , d\).
Invalid datasets	35	Invalid datasets, divided by error type (Section 5.1.3) in Table 8.

Table 7. A List of Parameters Used for the Execution of SC-based Data Validation for Our Real-world Application Scenario

Fig. 12.

As we have manual labels for all 53 datasets, we can analyze the effects of variations in threshold t on the classification results (cf. Figure 11). To generate this graph, we vary the threshold t in small steps from high to low, as described in step (1.3).

The ROC curve provides a basis for analysis and subsequent selection of a suitable value for t. The classification results for our real-world application scenario are quite desirable. We selected a value for t together with our domain experts who accepted a false-positive rate of \(11~\%\) to achieve a true-positive classification rate of \(68~\%\). The final classification results are detailed in Table 8.

Table 8.

Actual label	(All/true/false predictions)	Dataset comment/problem description
Valid	(18/16/2)	Validated by domain expert
Invalid	(35/24/11)	All invalid datasets combined
Invalid	(16/13/3)	(T1) physical damage on tested friction package
Invalid	(12/10/2)	(T2) test setup issues that affect results
Invalid	(7/1/6)	(T3) test setup issues with no effect on results

Table 8. Results of Data Validation from the Presented Application Scenario

All 11 false negatives are grouped by their root cause.

5.2.2 Analysis of One Exemplary Validation Result.

One of the validation results is fully detailed in Figure 10. This specific dataset is correctly predicted as invalid. The upper line chart shows the comparison of measured versus predicted \(\mu _{\it dyn}\). The position and size of the data segments coincide with changes in the input parameters \(p, v\) and are colored according to their RMSE value.

Segments 1 and 5 show the highest training error. The first segment shows a rapid decline of measured \(\mu _{\it dyn}\) that is not motivated by the expected behavior of the friction system, especially since the temperature remains steady and even drops during this segment. Similarly, segment 5 shows a decline in measured \(\mu _{\it dyn}\) while the temperature remains mostly steady. The defined constraints prohibit the prediction model from fitting to these data, resulting in a high training error for the two segments.

From the error description provided, we know that the unexpected behavior of \(\mu _{\it dyn}\) in this experiment was caused by contamination of the test setup with remnants of destroyed friction pads from a previous run. This example illustrates the main strength of SC-based data validation, which is the identification of data that does not adhere to expected patterns.

5.3 SC-based Data Validation for Friction Data

In this section, we investigated only the single target variable \(\mu _{\it dyn}\). In reality, friction experiments record many features for which we can formulate constraints. This increases the execution time of SC-based data validation but significantly improves classification accuracy. We argue that any scenario that allows the application of SC-based data validation provides possibilities for multiple constraints for different target features and hence multiple classification results. The combination of these indicators increases confidence in the predicted labels.

Our SC-based data validation approach has been fully integrated into the data pipeline of the company partner. Each year, several million new records from friction experiments are added to the database. After loading (cf. ETL) the data into the central data warehouse, SC-based data validation is automatically triggered and executed. A validation report, similar to Figure 10, is sent to the engineers. The data ingestion including import and data preprocessing of one experiment takes about 5 minutes with an average of less than 10 seconds of additional runtime for validation of the singular target \(\mu _{\it dyn}\) (using the exact application case, hyper-parameters, and constraints as listed in this article). SC-based data validation, therefore, requires little runtime, making this approach a suitable and valuable additional detector for data validation frameworks. Here, the dataset visualized in Figure 10 serves as an example for one case, where SC-based data validation was able to detect an invalid dataset where the subtle error was missed by the domain experts.

6 Discussion of SC-based Data Validation

In this work, we extensively evaluated our SC-based data validation approach on a comprehensive synthetic benchmark dataset suite, as well as on real-world data from industrial friction experiments. In the qualitative evaluation with real-world data, our approach identified subtle types of errors that were difficult to detect even for the domain experts. Right now, SC-based data validation is fully integrated in the data ingestion pipeline of our company partner and provides automated data quality assessment for friction experiments. It has been proven as stable and reliable, and serves as an invaluable tool by reducing the strain on domain experts.

6.1 Shortcomings of SC-based Data Validation

SC-based data validation does not detect common data errors such as missing values or typos, which is why it does not replace existing data validation methods and tools like [13, 28, 49] but complements them. However, SC-based data validation detects deviations from valid interactions in data utilizing domain knowledge and without the need for established baselines.

6.2 Viability of SC-based Data Validation in Other Application Domains

We demonstrated the wide applicability of our approach by evaluating it on a new benchmark suite consisting of synthetic datasets with randomly introduced errors (see Section 4). The selected Feynman equations stem from various areas of physics and thereby show the applicability of SC-based data validation in a wide range of application domains, especially engineering.

Shape constraints have been successfully applied in other domains as well. They have been shown to improve prediction models used to control the motion of robots [39], models of economic functions [4], or trust in model classifications [31]. Curmei and Hall [18] discuss further successful applications of shape-constrained regression. These examples show that domain knowledge can be expressed as SCs in various domains. Integrating this knowledge in ML training improves the performance of prediction models compared to unconstrained alternatives. By implication, invalid data in these domains that contradicts a constraint will increase the training error of models trained on this data. Consequently, SC-based data validation can be applied in these application domains to assess data quality based on deviations from expected behavior.

7 Conclusion

We presented “SC-based data validation,” a novel ML-based data validation approach that incorporates expert knowledge in the form of SCs. These shape constraints model expected patterns that are shared by all valid data from a system or process. The main strength of our approach is the ability to model and validate expected interactions of observables, e.g., causal relationships, without the need to manually establish baselines of valid data to compare to.

We demonstrated the real-world applicability of SC-based data validation on industrial friction experiment data from our company partner. In addition, we demonstrated the wide applicability of our approach by evaluating it on a new benchmark suite with a wide range of synthetic data, which we publish as part of this work for future comparison.

In future work, we aim to investigate the extent to which constraints can be determined automatically from new data. Such an approach could detect previously unknown interactions in data, or could suggest suitable validation constraints to experts. Additionally, future work related to our proposed SC-based data validation approach could investigate if deliberate deactivation of individual constraints could provide root cause analysis for error detection.

Footnotes

https://github.com/florianBachinger/SC-based-Data-Validation-JDIQ

https://space.mit.edu/home/tegmark/aifeynman.html

https://www.miba.com

⁴

https://www.mosek.com

⁵

https://github.com/florianBachinger/FeynmanEquations-Python-JDIQ

A Constraint Notation

In this section, we provide a step-by-step analysis and explanation of the notations used to describe the constraints in, e.g., Table 3 and Equation (2). We selected the Feynman expression I.6.2 as an example, as it has a two-dimensional input space and can therefore be easily visualized. The Feynman expression with instance number I.6.2,

\begin{align} f(\sigma , \theta) = \frac{1}{\sigma \sqrt {2\pi }} \exp \left(-\frac{1}{2}\left(\frac{\theta }{\sigma }\right)^{\!2}\,\right), \end{align}

(3)

is the probabilistic density function of the normal distribution with \(\mu =0\).

A.1 Set Notation of the Input Domain (\(x \in [a,b]\))

Both input variables, \(\sigma\) and \(\theta\), are restricted by the AI-Feynman [58] symbolic regression database to the real-valued interval of 1 to 3 (or \([1,3] \subseteq \mathbb {R}\)). We write this domain restriction as \(\sigma \in [1,3] \wedge \theta \in [1,3]\), where \(\wedge\) denotes the logical conjunction (the logical and), which requires both to be true at the same time. The interval is closed, \([1,3]\), not open, \((1,3)\), as it includes the endpoints.

If n variables share the same input space, we can write them as an n-tuple in an n-dimensional space, e.g., \((\sigma ,\theta) \in [1,3]^2\). Both variables of the tuple can independently obtain any value in \([1,3]\). This notation is therefore interchangeable with \(\sigma \in [1,3] \wedge \theta \in [1,3]\).

A.2 Constraint Definitions (\(\forall _{x} \hspace{1.99997pt}x \in [a,b] \Rightarrow \frac{\partial f}{\partial x} \lt 0\))

Constraints are always defined for a given input domain, within which the SCR algorithm enforces the prediction models to adhere to these constraints. Outside the domain, the models can exhibit arbitrary behavior and are only guided by the loss function.

We use the logical implication \(\Rightarrow\) to show that the constraints are only enforced for data (\(\forall _{\sigma , \theta }\)) inside the defined input domain. The constraints for I.6.2 can be written as, for example,

\begin{align} \forall _{\sigma , \theta } \hspace{5.0pt}(\sigma ,\theta) \in [1,3]^2 \Rightarrow \frac{\partial f}{\partial \mathit {\theta }} \lt 0, \end{align}

(4)

using the same notation as for our real-world example in Equation (2). We can see that Equation I.6.2 is strong monotonic decreasing over \(\theta\) or, alternatively, that its gradient is negative inside this domain (\(\frac{\partial f}{\partial \mathit {\theta }} \lt 0\)).

A.3 Constraint Definitions — Tabular Notation

For efficient space usage, we provide a full list of the Feynman equations in tabular form in Table 3. Table 9 lists only Equation I.6.2 and serves as an example to explain the tabular notation. The first column lists the identifier of the respective equation and the second column the input domains of the variables. The constraints are outlined in the third (first derivative) and fourth column (second derivative), respectively.

Table 9.

Equation	Variable domains	\(\frac{\partial f}{\partial \mathit {var}}\)	\(\frac{\partial ^2 f}{\partial ^2 \mathit {var}}\)
I.6.2	\(\sigma \in [1,3]\)
	\(\theta \in [1,3]\)	\(\lt 0\)

Table 9. An Explanation of Table 3

It contains only the first listing for Equation I.6.2 and displays the same columns. This tabular notation serves as another representation of Equation (4).

B Shape-constrained Polynomial Regression

For unconstrained polynomial regression, a (multi-variate) parametric polynomial is fit to observational data. The goal is to predict the value of a target variable given the values of (multiple) input variables. Polynomial models can capture smooth non-linear functional dependencies between the input variables and the target variable. The coefficients \(\theta\) enter the model only linearly, which means that polynomial models can be fit with an ordinary least squares (OLS) approach. Only the (total) degree d of the polynomial is a hyper-parameter and needs to be adjusted for each dataset [27].

For polynomial regression, it is recommended to use a regularized version of the least-squares optimization problem to induce sparsity and to limit the size of the coefficients and therefore the complexity of the model. This is especially relevant for high-degree polynomials or polynomials in many variables, as the number of terms grows quickly with degree and number of variables. The objective function shown in Equation (5) includes an elastic-net penalty, which allows trading off between ridge regression and the LASSO regression for sparsity [21]:

\begin{align} \text{Obj}(X, y, \theta , \theta _0, \lambda , \alpha) &= \left\Vert f(X,\theta) + \theta _0 - y \right\Vert _2^2 \nonumber \nonumber\\ & \quad +\lambda \left((1-\alpha) \frac{1}{2}\left\Vert \theta \right\Vert _2^2+ \alpha \left\Vert \theta \right\Vert _1 \right), \end{align}

(5)

where \(f(X, \theta)\) is a polynomial with a given total degree and coefficients \(\theta\). The hyper-parameter \(\lambda\) controls the strength of regularization and \(\alpha\) allows balancing the 1-norm and the 2-norm penalties. For shape-constrained polynomial regression, we use the same objective function but include sum-of-squares constraints, which leads to a convex optimization problem (SDP) [18, 42]. Hall [26] gives a detailed mathematical description of SCPR.

Our formulation uses a combination of rotated second-order cones for the least-squares objective and the ridge penalty, linear cones for the lasso penalty, and positive semidefinite matrix cones for the sum-of-squares constraints.

In detail, given the input matrix X with n rows and d columns from the polynomial basis expansion and the target vector y as well as regularization parameters \(\lambda\) and \(\alpha\), we declare the variables \(w=(w_i)_{i=1\ldots d}, w\in \mathbb {R}^d\), and \(t\in \mathbb {R}\) and add the constraint that \((t, 0.5, y-X\, w)\) is in a rotated second-order cone (least squares). For the ridge and lasso penalties we define two variables \(p_\text{ridge} \in \mathbb {R}\) and \(p_\text{lasso} \in \mathbb {R}\), where \((p_\text{ridge}, 0.5, w)\) is in a second-order cone and \((p_\text{lasso}, w)\) is in a norm-one cone. Finally, each shape constraint j can be represented as \(P_j(w, x_s) \le 0, \forall x_s \in \Omega _j\), where \(P_j\) is based on the polynomial representing the model function, and \(\Omega _j\) is the domain for this constraint defined by lower and upper bounds. For example, for a monotonicity constraint, \(P_j\) is based on a partial derivative of the (polynomial) model function, which is again a polynomial in x with parameters w. For each such constraint, we add a variable \(s_j = w\, m(x)\) representing the polynomial, with \(m(x)\) the monomials in the polynomial basis expansion used to produce X, and add the constraint that \(s_j\) must be a sum-of-squares \(s_j = m(x)^\top Q_j m(x)\), where \(Q_{d\times d}\) is constrained to the positive semidefinite cone. The objective is to minimize \(0.5\,t + \lambda \, \alpha \, p_\text{lasso} + 0.5\, \lambda \, (1-\alpha)\, p_\text{ridge}\).

The model is implemented in Julia using the JuMP package, and we use the commercial solver Mosek⁴ to solve this convex programming problem.

C Generating Synthetic Data

This section details how constraints can be determined algorithmically for a given closed-form expression by analyzing the output of the function’s partial derivatives. We also show how we add artificial noise to simulate real-world sensor noise.

C.1 Determine Constraints—Example

Equation (3) has the instance number I.6.2 and represents the probabilistic density function of the normal distribution with \(\mu =0\) for the input domain of \((\sigma , \theta) \in [1,3]^2\). The first subplot in Figure 13 shows the equation output for this domain.

Fig. 13.

Based on the shape of the first subplot in Figure 13, we can see that f is monotonically decreasing over \(\theta\), or \(\frac{\partial f}{\partial \theta } \le 0\). This is further illustrated in the third subplot in Figure 13, which shows the output of the partial derivative \(\frac{\partial }{\partial \theta }f\). In contrast, no monotonicity constraint can be defined for \(\sigma\).

C.2 Determine Constraints — Algorithmically

The constraints can be determined algorithmically by deriving each equation for each of its inputs using SymPy [40] and analyzing the output of the derivatives. The source code, resulting derivatives, and all constraints are shared on our GitHub repository.⁵ Table 3 lists all constraints for a selection of the 120 equation instances.

We iterate over the equations and their inputs and calculate the first- and second-order partial derivatives. We generate a random uniformly distributed input space that matches the defined equation input space. This input space is passed into the partial derivative \(\frac{\partial y}{\partial x_i}\). If all calculated values of the partial derivative are, e.g., positive or zero, we can deduce that y is monotonically increasing over \(x_i\).

C.3 Add Normal Random Noise

To simulate noisy measurements and to increase the difficulty of SC-based data validation, we add different levels of random normal noise to the calculated equation results. The noise level \(\zeta\) is determined as a noise-to-signal ratio (cf. Equation (6)) where \(\sigma _y\) is the standard deviation of the equation output y per instance. Figure 14 shows the effect of added noise \(\zeta =0.01\) onto the instance I.6.2.

\begin{align} y_\text{noisy} = y + N\left(0,\sigma _y \sqrt { \frac{ \zeta }{ 1-\zeta } }\right) \end{align}

(6)

Fig. 14.

References

[1]

Daniel Abadi, Anastasia Ailamaki, David Andersen, Peter Bailis, Magdalena Balazinska, Philip Bernstein, Peter Boncz, Surajit Chaudhuri, Alvin Cheung, AnHai Doan, Luna Dong, Michael J. Franklin, Juliana Freire, Alon Halevy, Joseph M. Hellerstein, Stratos Idreos, Donald Kossmann, Tim Kraska, Sailesh Krishnamurthy, Volker Markl, Sergey Melnik, Tova Milo, C. Mohan, Thomas Neumann, Beng Chin Ooi, Fatma Ozcan, Jignesh Patel, Andrew Pavlo, Raluca Popa, Raghu Ramakrishnan, Christopher Ré, Michael Stonebraker, and Dan Suciu. 2020. The seattle report on database research. SIGMOD Rec. 48, 4 (2020), 44–53. DOI:

Abstract

1 Introduction

2 Related Work

2.1 Outlier or Anomaly Detection

2.2 Rule-based Data Validation

2.3 Machine Learning and Data Quality

2.3.1 Improving Data Quality for Machine Learning Applications.

2.3.2 Using Machine Learning to Assess Data Quality.

2.3.3 Augmenting Machine Learning with Prior Knowledge for Assessing Data Quality.

3 SC-based Data Validation

3.1 A Primer on Shape-constrained Regression

3.2 The Data Validation Process

Phase (1): Initialization.

Step (1.1): Determine Constraints.

Step (1.2): Determine Algorithm Parameters.

Step (1.3): Determine an Error Threshold Value.

Phase (2): Data Validation.

3.3 Demonstrating SC-based Data Validation

4 Evaluation On Synthetic Data

4.1 Synthetic Data Generation

4.1.1 Selection of Equation Instances.

4.1.2 Introducing Synthetic Errors.

4.1.3 Univariate and Multivariate Problem Generation.

4.2 Validating Synthetic Data

4.2.1 Initialization of SC-based Data Validation and Experimental Setup.

4.2.2 Evaluation of Results.

4.3 Discussion of Results

5 Evaluation On A Real-world Application Scenario

5.1 Data from Friction Experiments

5.1.1 Empirical Testing of Friction Systems.

5.1.2 Data Validation Rules and Constraints.

5.1.3 Data Recording and Possible Issues.

5.2 Validating Wet Friction Data

5.2.1 Applying SC-based Data Validation.

5.2.2 Analysis of One Exemplary Validation Result.

5.3 SC-based Data Validation for Friction Data

6 Discussion of SC-based Data Validation

6.1 Shortcomings of SC-based Data Validation

6.2 Viability of SC-based Data Validation in Other Application Domains

7 Conclusion

Footnotes

A Constraint Notation

A.1 Set Notation of the Input Domain (\(x \in [a,b]\))

A.2 Constraint Definitions (\(\forall _{x} \hspace{1.99997pt}x \in [a,b] \Rightarrow \frac{\partial f}{\partial x} \lt 0\))

A.3 Constraint Definitions — Tabular Notation

B Shape-constrained Polynomial Regression

C Generating Synthetic Data

C.1 Determine Constraints—Example

C.2 Determine Constraints — Algorithmically

C.3 Add Normal Random Noise

References

Index Terms

Recommendations

RDF shape induction using knowledge base profiling

Comparing Shape-Constrained Regression Algorithms for Data Validation

Knowledge Validation: Principles and Practice

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share