SC-based data validation can verify if datasets correspond to the patterns expected of valid data. We achieve this by analyzing how well a prediction model, constrained by expert knowledge, can fit previously unseen data, i.e., how well the data fit our expectations.
In this section, we give an introduction to SCR, clarify how constraints are defined, and indicate which types of constraints are available. This primer is followed by a description of our approach, detailing the necessary steps to implement the approach and an illustrative example to visualize the general idea.
3.1 A Primer on Shape-constrained Regression
SCs enable the integration of expert knowledge during the training of ML models and therefore lead to more trustable models [
17]. SCs have long been employed by ML researchers [
62] and have been added to many regression models including cubic smoothing splines [
48], piece-wise linear models (lattice regression) [
24], support vector machines [
35], kernel regression [
5], neural networks [
36], gradient boosted trees [
14], polynomial regression [
18,
26], and symbolic regression [
11,
33]. SCR allows enforcing shape properties of the model function, e.g., that the function output must be positive, monotone, or convex/concave over one or multiple inputs. Table
1 lists some examples of available constraints. These properties can be expressed through bound constraints on the output of the function or its partial derivatives for a given input domain. We know, for example, that the identity function
\(id(x) = x\) is positive for only a positive input domain
\((\forall _{x \gt 0 }~id(x) \gt 0)\) and that it is monotonically increasing for the full input domain
\((\forall _{x} \tfrac{\partial }{\partial x} id(x) \ge 0)\). Appendix
A provides further details on the definition of constraints.
SCs are helpful when expert knowledge is available that can be used for model fitting, in addition to observations of the input variables and the target variable. Fitting a model with SCs can be especially useful when training data is limited or to enforce extrapolation behavior.
We investigated different SCR algorithms in [
6] and selected
shape-constrained polynomial regression (SCPR) for our SC-based data validation approach, because it (1) is deterministic and therefore ensures repeatability and (b) produces reliable results in a short runtime. Moreover, SCPR supports constraints for any order of derivative and subspaces of the input space, whereas other algorithms such as gradient boosted trees [
14] only support monotonicity constraints over the full input space [
6].
Appendix
B describes SCPR in detail, including a description of the hyper-parameters
\(\alpha , \lambda\), and
\(\text{total degree}~d\). The valid ranges of the hyper-parameters are listed in Table
2.
3.2 The Data Validation Process
The application of SC-based data validation is divided into two phases. During the initialization phase, the prerequisites for SC-based data validation are determined. The second phase consists of the validation of new data and employs the initially gathered information to assess data quality. Phase (1) establishes how well constrained models are able to describe valid data, whereas phase (2) investigates the fit of constrained models on new data, where a bad fit is indicative of erroneous data.
Phase (1): Initialization.
The initialization consists of three steps (1.1, 1.2, and 1.3) and is executed only
once initially. The main goal of phase (1) is to define and describe the underlying system for SC-based data validation. Here, we define the parameters of SC-based data validation, as listed in Table
2, and establish the
expected patterns that are used to assess new data in phase (2). The underlying system represents the source of data to be assessed. Each produced dataset can represent an unseen concept for which no baseline of manually validated data exists. However, all data produced by the same system share the expected patterns defined in phase (1). These patterns are represented by the combination of (1.1) the constraints describing experts’ knowledge of valid behavior, (1.2) the complexity of the underlying system (number of inputs and their interactions) represented by suitable hyper-parameters for the SCR algorithm, and (1.3) the classification threshold
t to distinguish between
valid and
invalid data.
Step (1.1): Determine Constraints.
Shape constraints can be defined in cooperation with experts who are familiar with the causal or statistical dependencies between observables of a system. For example, they might know that certain relationships are monotonic or convex/concave. Another example would be knowledge that a variable approaches a certain equilibrium value for large inputs. This can be formulated as using a combination of a monotonicity and a concavity constraint on the function shape. More examples for shape constraints are provided in the description of the real-world application scenario in Section
5.1.2.
Each constraint is specified for an input domain that can be a subspace of the full input domain (e.g., only for input \(x_i \in [\text{lower}, \text{upper}]\)). The domains for multiple constraints may overlap. Therefore, shape constraints are a highly expressive language for describing expected patterns for a system.
Step (1.2): Determine Algorithm Parameters.
The purpose of this step is to determine suitable hyper-parameters for the SCR algorithm and to adapt the approach to the specific application (cf. Figure
1). In contrast to the example discussed in Section
3.3, the complexity of the observed system is not known in real-world applications. Thus, a wide range of hyper-parameters needs to be investigated to find the necessary expressiveness of a model to capture the interactions of the observed system.
This step of SC-based data validation represents a standard ML hyper-parameter search, as prediction models trained by the SCR algorithm need to capture the interactions represented in valid data with a low training error.
We execute a grid search for the best hyper-parameters using cross-validation. In the case of SCPR (see Appendix
B), a polynomial of higher degree
d is able to capture more complex interactions but might overfit, and thus the hyper-parameters
\(\alpha\) and
\(\lambda\) are used for regularization to counter overfitting.
Step (1.3): Determine an Error Threshold Value.
Once suitable hyper-parameters for the SCR algorithm are established, a value for the error threshold t can be determined. The maximum allowable training error (threshold t) is used to distinguish between valid and invalid datasets. There are two different variants to determine a suitable value for t: (1.3a) via an estimate of the unexplained noise or (1.3b) empirically using statistics of valid and invalid data, if available.
Variant (1.3.a): The parameters
\(\alpha , \lambda , d\) affect the fit of the models and thus the training error (e.g., RMSE, see Equation (
1)). Hence, these parameters must be determined prior to
t. The value of
t can be understood as the level of accepted unexplained variance (noise) (cf. Figure
1). Section
3.3 provides a detailed example for this variant.
Variant (1.3.b): If labeled valid and invalid datasets are available (cf. Figure
2), the effect of changing
t and the corresponding classification results can be visualized as a
receiver operator characteristic (ROC) curve, as shown in Figure
11. The ROC curve enables an informed selection of
t. The threshold for our real-world example was selected this way, whereas our benchmark suite visualizes the
area under the ROC curve (AUC) to compare classification accuracy. Alternatively,
t may be tuned automatically (cf. [
66]) to optimally discriminate valid and invalid datasets. This is especially relevant in cases where there is an imbalance in the occurrence of classes or when we can assign a different cost per label [
64].
Phase (2): Data Validation.
This phase implements the validation of new data of unknown quality. Here, we analyze how well a constrained model can be fit to the new dataset and therefore how well the dataset adheres to our expected patterns for valid data, illustrated in Figure
3.
The constraints and the best hyper-parameters determined in phase (1) are applied to train a constrained prediction model on each new dataset. The training error is calculated, and the data are marked as invalid if the error threshold t is exceeded. The assumption that valid data can be well described by constrained models while fitting invalid data would violate constraints and lead to higher training error that exceeds t. Critically, this step is executed without the need to establish a baseline of valid data for each new concept represented by new data.
In practice, datasets investigated in phase (2) can even differ widely from prior data (used in phase (1) or (2)) in the value ranges and distributions of observables or their size. SC-based data validation can still reliably detect deviations from expected patterns.
3.3 Demonstrating SC-based Data Validation
The idea of our approach is illustrated in Figure
4. Here, we define three third-degree polynomials
\(f(x)\),
\(g(x)\), and
\(h(x)\) as our base functions from which we sample data. In this example, we expect valid data to be monotonically increasing. Both base functions
f and
g fulfill this constraint, whereas the polynomial
h is designed to deviate from this pattern, as it contains a decreasing section for
\(x \in [-2,1.5]\).
From these base functions, we sample observations and add Gaussian noise with a standard deviation of \(\sigma = 5\), thus creating the training datasets F, G, and H. We assume these are the data that require quality assessment and that the functions used for data generation (f, g, and h) are unknown to the data validation process.
This setup emulates real-world applications where the details of the underlying system are unknown. Yet, we can define shared expected patterns in the form of a monotonicity constraint that describes valid data.
The random noise imitates measurement uncertainty, which is usually present in real-world scenarios. Because of noisy data, simple detection rules such as the assertion of increasing values or only positive numeric differences cannot be used. While a sliding average with a tuned window size can detect the decreasing section in this simple problem, it cannot be applied as a detector for multivariate problems or more complex constraints.
For the quality assessment, we apply the SCR algorithm to train two models for each dataset
F,
G, and
H: once without constraints, resulting in the models
\(f_1\),
\(g_1\), and
\(h_1\), and a second time resulting in
\(f_2\),
\(g_2\), and
\(h_2\), which are constrained to be monotonically increasing
\((\tfrac{\partial }{\partial x}f(x) \ge 0)\). All six model predictions are plotted in Figure
4.
We assume suitable algorithm parameters (degree d, \(\lambda\), \(\alpha\)) for SCPR as shown in the figure. In the context of real-world applications, more effort is required, as the hyper-parameters must be determined by a grid search with cross-validation.
On the datasets
F and
G, generated from valid functions, the unconstrained and constrained models both have an equally low
root mean squared error (RMSE, Equation (
1)), as the data matches the expected patterns. The RMSE of about 4.9 (see Figure
4) matches the standard deviation of the unexplained Gaussian noise (
\(\sigma = 5\)). However, the model
\(h_2\) is restricted by the constraint defined for valid behavior and has higher training error, which indicates that the dataset
H deviates from the expected patterns of valid data. A threshold value of
\(t=5\) would be able to distinguish
valid from
invalid data, and would also establish our accepted noise level.
This example highlights the capabilities of SC-based data validation. We require only the monotonicity constraint
\((\tfrac{\partial }{\partial x}f(x) \ge 0)\) and the algorithm parameters of degree
d,
\(\lambda\),
\(\alpha\) to assess the quality of the datasets
F,
G, and
H. It may also be used to validate data from any other third-degree polynomial (the same complexity of the established underlying system, see phase (1.2) of Section
3.2) to be monotonically increasing. Thereby, our approach contrasts existing ML-based data quality assessment approaches, as discussed in Section
2.