1. Introduction
Agricultural data acquired through surveys are inherently complex, often characterized by skewed distributions, multivariate relationships and spatio-temporal dynamics. This complexity in the nature of agricultural data along with imperfect data collection processes can lead to anomalies at the entry (cell) level data. The presence of anomalous values in a dataset can affect the analyses based on these data, e.g., introducing bias to the modeled estimates or forecasts. The United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) has implemented a semi-automated revision process to improve the accuracy of the data acquired through surveys. The current semi-automated system detects cellwise anomalies (
Figure 1, graph on the right) that are subsequently corrected manually. Even though the system uses automated decision rules based on if-else conditions designed by experts in agriculture, it still requires human intervention at several levels. Recently, NASS has been investigating alternative approaches to modernize its anomaly detection system and is extending its traditional editing techniques [
1] with novel, more accurate, flexible and objective methodologies.
Historically, researchers have addressed outliers from two perspectives. From the first perspective, an outlier is typically defined as an observation generated by a mechanism different from the one that produced the majority of observations in a dataset. While this assumption was previously used by [
2,
3,
4] to specify a model for generating outliers, it was Freeman who presented in 1980 the definition of outlier as “an observation that has not been generated by the mechanism that generated the majority of observations in the dataset” [
5]. Alternatively, from the second perspective, all data are considered observations generated by a single mechanism. A (statistical) model is assumed to fit the observed data and the outliers arise from the model outputs. From this perspective, the “outlier” status of the data is typically determined through inferences on standardized residuals under the assumed model. After the records are investigated for influential points, they are classified either as outliers or regular cases based on a score variable. However, algorithms for detecting cellwise outliers (i.e., anomalous cells within a record) have been overlooked in the literature from both perspectives until recent developments [
6,
7,
8].
In addition, primarily focusing on record-level anomalies, most outlier detection methods are typically developed under the frequentist framework. A variety of algorithms developed under the frequentist framework generate scores for each individual record and use these scores to detect anomalies at the record-level [
9,
10,
11,
12,
13]. The Detect-Deviating-Cells (DDC) method [
8] is known as the first method developed to detect cellwise outliers in multivariate datasets by accounting for the correlations among variables. This method assumes normality but does not consider stratification or historical information. Due to the advances in computational power and tools, more meaningful criteria for outlier detection in complex multivariate data situations have been developed. For example, it is more intuitive to apply a degree of belief that a cell in a multidimensional dataset is an outlier rather than simply classifying that cell as an outlier. This could be achieved through Bayesian methods where similar to the frequentist framework, outliers are typically identified by analyzing random errors under a given probabilistic model. However, under the Bayesian paradigm, outliers are detected through the analysis of posterior distributions.
The interest in Bayesian procedures for outlier detection has increased in recent decades. A review of the probabilistic methods for record-level outlier detection in a Bayesian setting under a linear model is given in [
14]. The authors classify these methods into two groups: (1) methods that investigate the predictive density of the response variable and (2) methods that investigate the posterior (predictive) probabilities of unobserved residuals. In [
15], the outlier scores are transformed into probabilities, and two approaches are presented. The first considers a logistic sigmoid as a posterior distribution and estimates its parameters using the outlier score data. The second derives the posterior probabilities by assuming a mixture of exponential and Gaussian distributions for the score data. For completeness, we also mention the following Bayesian contributions on record-level outlier detection. Considering a univariate linear model in [
16], the authors propose using the posterior distribution of the squared norm of the realized errors to identify anomalous records. Later, in [
17], the approach is extended to a multivariate linear model using Bayes factors to detect outliers. In [
18], general measures (i.e., Kullback–Leibler (KL), chi-square, L1-divergence) are used to study, via simulations, how individual observations can affect the distribution of the response given the covariates. The authors found that KL and chi-square measures were monotonic on L1 divergence. They recommended L1 divergence as a diagnostic measure due to the easiness of its interpretation [
18]. Geisser’s work on the predictive density of one observation given the rest of the data is worth mentioning here as well [
19,
20,
21,
22]. However, none of these previous works has addressed the Bayesian identification of anomalous data entries within a record (i.e., cellwise outliers).
Despite the extensive literature on outlier detection, publications for cellwise outlier detection are quite sparse [
6,
7,
8]. While most of the algorithms are developed to detect an anomaly at a record level, classical record-level assumptions are seldom satisfied when an anomaly in data collected through surveys occurs at one (or several) component(s) of a record (as illustrated in
Figure 1). Also, most of the existing algorithms proposed for identifying cellwise outliers are often designed without considering characteristics of the data, i.e., missing values, skewed marginal distributions, and spatio-temporal dynamics. Sartore et al. [
23] introduced an outlier detection approach based on robust estimation methods and fuzzy logic. This approach is applicable to sparse or stratified datasets and can leverage additional information when historical data are available. Implementation of the FuzzyHRT algorithm has automated the editing process for repeated surveys at NASS.
The FuzzyHRT algorithm uses the Bienaymé–Chebyshev’s inequality [
24] which is suitable for nonparametric and empirical inferences. The only assumption made about the distribution of the data is that the first and second moments exist, which could be estimated empirically. Then, for a random variable
X with distribution function
and a location parameter estimator
the Bienaymé–Chebyshev’s inequality is defined for any
and
as
In the special case of
, Equation (
1) produces the classical Chebyshev’s inequality,
where
. Fuzzy logic is successively applied to detect four different types of cellwise outliers resulting from format inconsistencies and historical, tail, and relational anomalies. Nonetheless, fuzzy logic is not suited for the probabilistic reasoning behind the identification of anomalous cells. Therefore, Bayesian methods are preferred as better suited methods for quantifying the uncertainty associated with the identification of cellwise outliers. In this paper, two novel Bayesian methods for detecting outliers at the cell level are proposed. Both methods are based on the theory of empirical likelihoods [
25], which provides a nonparametric framework for inference without making assumptions on the distribution of the data. First, a Bayesian bootstrap approach (hereafter referred to as bootstrap approach) is proposed to mitigate the uncertainty associated with the output scores from the FuzzyHRT algorithm [
23]. Second, empirical likelihood under a Bayesian framework [
26] (hereafter referred to as the empirical likelihood approach) provides a probabilistic reasoning for identifying outliers at the cell level.
The rest of the paper is structured as follows. In
Section 2, we give a brief review of the FuzzyHRT algorithm and set up the problem. The bootstrap approach is presented in
Section 3. The empirical likelihood approach is introduced in
Section 4. Results from a controlled simulation study using NASS survey data for livestock and crops to assess the performances of the two proposed approaches are presented in
Section 5. Our concluding remarks are presented in
Section 6.
3. A Bayesian Bootstrap Approach
Let
denote an indicator random variable for an outlier cell
,
In this section, we focus on inferences about the empirical distribution of the scores produced by FuzzyHRT,
, and explore the uncertainty associated with their
empirical percentile
. Hereafter, in this section, the scores
are thought of as a vector
with length
. Then, conditions in (
3) for partitioning the data take the form,
To obtain the empirical distribution of
, let
denote a parameter vector of length
M whose components are probabilities associated with
M distinct values of the scores
,
where
and
denotes the
dimensional set of unit probability simplex for
[
25]. These probabilities satisfy the two following constraints:
A noninformative Dirichlet prior is selected for
,
where
c is the normalizing constant. Assuming that the
scores produced from FuzzyHRT are associated with their respective probability
, for
, one can write the likelihood as
Incorporating the information provided from the final scores
defined in (
2), we obtain the posterior distribution of
given
,
which is a
.
Remark 2. If , then .
The Bayesian bootstrap for the
percentile is based on a resampling scheme performed
B times as follows. To obtain a sample for the random variable
based on its empirical posterior distribution, first, a sample of size
M is drawn from the uniform distribution with support in
, i.e.,
, for any
. Then, for all
, the
is computed,
resulting in a realized sample
. This sample provides probabilities for the vector of final scores
. For practical reasons, without loss of generality, one can assume that the condition of Remark 2 is true, i.e.,
(and hence,
), to construct a Monte-Carlo sample
of size
N.
After sorting the scores, the cumulative distribution of
is empirically evaluated at
, for all
, as the sum of the posterior probabilities in (
7) that are associated with all scores
such that
,
Then, the value of
, for all
, is obtained by solving the following equation:
or similarly
This results in a sample of size
constructed using the bootstrapped statistics in (
8). The mean (and other summary statistics) of the
were considered when determining the outliers based on the scores produced by the FuzzyHRT algorithm.
4. A Bayesian Testing Approach Based on Empirical Likelihoods
In this approach, empirical likelihoods at the cell level are used to study the posterior distribution of the indicator random variable
defined in (
3) given the data
.
describes the outlier status of
(i.e., cell
), where
and
. Specifically, we construct the distribution of the random variable
, where
and
are standardized residuals provided by the FuzzyHRT algorithm for each anomaly type
. The distribution of
is used to evaluate
.
In this setting, different from the bootstrap approach, the level
of data contamination is unknown and treated as a random variable. It is reasonable to specify the conditional distribution for
given
,
This specific choice of distribution for
makes it possible to simplify the successive calculations of the posterior in (
12). A noninformative (Bernoulli) hyperprior is adopted for the outlier status random variable
,
at each cell
.
One can easily obtain the posterior distribution of
,
where
is integrated out. Incorporating the hyperprior (
11) and Equation (
10) in (
12), one obtains the conditional distribution of
,
where the vector
is defined in (
9). This distribution is of an unknown form. Therefore,
is evaluated using the empirical likelihood [
25] defined as follows:
where the terms
, for
,
, and
, control the shape of the estimating equation.
Because the empirical likelihood in (
14) does not provide a closed-form analytical expression for any given values of
, its evaluation is performed by solving an optimization problem. However, the direct optimization of (
14) is computationally challenging, and hence, the theory of duality [
29,
30] is used to simplify the optimization problem. Therefore, the quantification of
is achieved by maximizing the following function:
with respect to
, where
and
are the Lagrange multipliers. The optimization is achieved under the assumption that
where
and
for any
, where the notation
denotes the 99.5% quantile from a cumulative distribution function from the standard normal. By setting the first derivative of (
15) with respect
to zero, one obtains a closed-form solution of the weights as a function on the Lagrange multipliers:
The iterative algorithm to identify the optimal value of an empirical likelihood for a given
, uses
, and
as initial values. After the weights are updated, the Lagrange multipliers are updated using the Gauss–Newton method:
This iterative algorithm stops when either convergence is reached or a maximum number of iterations has been performed. Finally, the likelihood is computed based on (
14),
The integral in (
13) is evaluated using classical quadrature methods. In this paper, the definition of Riemann’s integral is used to compute the posterior probabilities.
5. Simulation Study
As in [
23], a controlled simulation study using four national surveys administered by NASS is conducted to assess the performances of the proposed outlier detection approaches. These national surveys provide a wide range of different agricultural scenarios (see
Table 1 for a short summary). The first two surveys have been conducted for sheep-and-goat and cattle inventories, and the last two on row-crop yields and cranberry production. Usually, livestock surveys focus on the herd composition, while crop surveys collect information on production and yields. Surveys under consideration were administered between 2021 and 2022, and have sample sizes ranging between 218 and 21,154 and a number of nonnegative continuous variables ranging between 3 and 49.
Because the four datasets shown in
Table 1 have been manually edited, there are no ground-truth labels on the anomaly status of each cell. Thus, cellwise outliers have been synthetically introduced in the four datasets to test the robustness of the proposed algorithms across various scenarios. These scenarios have been developed to include different types of outliers to better reflect the anomalies observed in the raw data. This approach allowed us to flag and track anomalous values when evaluating the proposed detection algorithms across a wide variety of potential data conditions.
The generative algorithm used in [
23] has been applied to each dataset in
Table 1 by randomly replacing a few cells with anomalous values. Three specific modules synthesizing historical, tail, and “relational” anomalies, respectively, are created. Each module randomly selects 5% of the item responses. Half of these are replaced by multiplying the current values by random factors in
, and the other half using random factors in
. The random factors are generated from uniform distributions in intervals shown in
Table 2 (more specifically in columns 2 and 3). Shrinking and expansion ranges are randomly selected with equal probability for each combination of anomaly type and dissimilarity level. Therefore, two distinct datasets have been created for each survey. The datasets marked as “high” contain anomalous cells that are more likely to be identified. On the other hand, the datasets marked as “low” contain anomalous cells that are more difficult to detect. The “high” and “low” distinctions describe the level of dissimilarity between artificial anomalies and regular data.
In detail, historical anomalies are introduced by replacing a current value with its historical one multiplied by a random factor. Tail and “relational” anomalies are produced by multiplying original values with their respective random factors. “Relational” anomalies are introduced only for the variables with stronger linear relationships (i.e., having a correlation coefficient larger than 0.8). Hence, every record is equally likely to receive a historical, tail, or “relational” anomaly for one or several of its item responses.
For the relational outliers, the dataset is organized such that each row represents a record, and each column represents a field (item response). The dataset matrix may have many missing data, resulting in a sparse matrix. It is natural to have a sparse matrix of data collected from surveys, especially in the Agricultural Yield Survey, which includes multiple crops. The respondents in different states and different strata would only have certain types of crops but not all. Therefore, about 92% of the values in the Agricultural Yield data matrix are missing.
5.1. Evaluations
Several accuracy measures have been computed according to the standards found in the literature [
31]. The confusion matrix is constructed by comparing the classification results to the ground-truth labels as in a binary classification problem (where the two classes are outliers and nonoutliers). This
matrix contains the counts of True Positives (TP) and True Negatives (TN) in the main diagonal, and False Positives (FP) and False Negatives (FN) in the off diagonal. TP refers to the number of true outliers correctly classified as such. TN refers to the number of true regular data (nonoutliers) correctly classified as such. FP refers to the number of regular data (nonoutliers) incorrectly classified as outliers. FN refers to the number of outliers incorrectly classified as nonoutliers. The overall accuracy is computed as the ratio between the number of correct identifications divided by the total number of units:
The recall statistics are based on the ratios computed by conditioning on the ground truth labels (for true outliers and truly regular data):
The precision statistics are based on ratios computed by conditioning on the labels provided by two approaches proposed in
Section 3 and
Section 4. It shows the fraction of outlier identifications that are truly outliers:
or the fraction of regular identifications that are truly regular:
The proposed outlier detection methodologies have been evaluated for accuracy at the item response level.
Table 3 shows the results based on the Bayesian bootstrap approach setting the threshold as
. The overall accuracy ranges between 87% and 94%, the precision for detected outliers ranges between 21% and 70%, and the recall of outliers ranges between 21% and 63%.
Table 4 shows the results based on the Bayesian empirical likelihood approach without setting the threshold. At the item response level, the overall accuracy ranges between 83% and 94%, the precision for detected outliers ranges between 10% and 71%, and the recall of outliers ranges between 3% and 73%. Furthermore, the precision for regular cells (i.e., for data entries that are not cellwise outliers) is larger than 87% for both approaches. The recall for regular cells is larger than 93%. The change in contamination level (from high to low) substantially affects the precision and recall of outliers, with drops of 31–90%; however, the precision and recall for regular cells is quite stable with differences of 1–6%. The overall accuracy has also shown a similar behavior with differences of 2–8%. In general, both proposed methods perform better on datasets with higher contamination levels.
5.2. Comparisons
In this section, the performances of the four approaches are compared on two datasets discussed above for the Agricultural Yields and Cattle Inventory. The first approach described in
Section 2.1 is the FuzzyHRT algorithm. Given a user-provided degree of contamination
, one can classify the observed data values as outliers or not by using the
empirical percentile of the final scores produced through the FuzzyHRT algorithm. The DDC method [
8] is used as a second application with the same contamination threshold set at
. DDC was the first method proposed in the literature to detect cellwise outliers in multivariate datasets by accounting for the correlations among variables. The third method applied is the Bayesian Bootstrap (with a contamination threshold set to
) and the fourth one is the Bayesian approach based on empirical likelihoods.
Figure 2 shows the overall accuracies of the four methods for each available state. All datasets are split by states because the DDC drops all variables with more than 50% of missing values by default, and it processes only the few that remain. However, the other three methods are better suited for sparse matrices and use all available data entries. Therefore, the accuracies of the four methods are compared at the state level. When applied to the Agriculture Yield dataset, the DDC algorithm does not provide the overall accuracies for six states due to the high level of sparseness. When applied to the Cattle Inventory dataset, the DDC did not produce results for one state. In contrast, the other three methods have identified anomalies in all states. Therefore, the results for the states where the DDC can not detect outliers are excluded from the comparisons and are not shown in
Figure 2. The upper left panel (a) is based on the “low” Cattle Inventory dataset, and the upper right panel (b) is based on the “high” one. The lower left panel (c) is based on the “low” Agriculture yield dataset, and the lower right panel (d) is based on the “high” one.
As shown by the graphs, the Bayesian bootstrap approach and the Bayesian likelihood approach have correctly detected more outliers on “high” Cattle Inventory dataset and have provided generally higher accuracies for all states than the DDC and FuzzyHRT methods. The Bayesian likelihood approach has the highest accuracies in 45 out of 50 states among the four methods. As shown in the previous section, similar results are obtained when comparing the bootstrap approach with DDC and FuzzyHRT methods. There are mixed accuracy results throughout states for both Cattle and Agriculture Yield datasets with a “low” level of contamination. Specifically, the Bayesian empirical likelihood approach has performed better in 28 out of 50 states in the “low” Cattle dataset, while FuzzyHRT has performed better in 15 out of 50 states. Similarly, the Bayesian empirical likelihood approach performed better in 18 out of 50 states in the “low” Agriculture Yield dataset, while FuzzyHRT performed better in 15 out of 50 states. These results are reasonable because the percentage of historical outliers in the Cattle Inventory datasets is larger than the percentage in the Agricultural Yield datasets.
6. Conclusions
In many applications, it is more helpful to check for anomalies at the data-entry level rather than at the record level. The FuzzyHRT algorithm uses the Bienaymé–Chebyshev’s inequality and fuzzy logic, along with a user-provided level of contamination to detect four different types of outliers resulting from format inconsistencies, historical, tail, and relational anomalies within a record. The user-provided level of contamination contributes to the uncertainty associated with the outlier detection. In addition, fuzzy logic is not suited for the probabilistic reasoning behind the identification of anomalous cells.
The novelty of this work stands on mitigating the uncertainty associated with the scores produced by the FuzzyHRT algorithm [
23]. Two methods are developed under a Bayesian framework. The proposed bootstrap approach explores the uncertainty associated with the output scores. The empirical likelihoods approach provides a probabilistic reasoning behind the detection of anomalous cells. The new algorithm based on empirical likelihoods at the cell level are developed as a better alternative to FuzzyHRT.
Furthermore, the new and improved algorithms can be applied to datasets that potentially suffer from the presence of cellwise anomalies, skewed distributions (with positive support), missing values, and multivariate relationships. The new algorithms do effectively cope with sparse and missing data by accounting for zero inflation without removing entire records and/or variables with missing values.
The performance of the proposed algorithms has been illustrated using NASS livestock and crop survey data with randomly generated anomalies. Our simulation study considered four different datasets, where the proposed algorithms detected the cellwise outliers with high accuracy and robustness. Moreover, comparisons using real data with previously reported data illustrate that the proposed approaches have generally higher overall accuracy than the DDC method. When previously reported data are not available, the proposed algorithms are comparable (or even equivalent) to the DDC method (as is the case for the FuzzyHRT algorithm). However, as an advantage, our algorithms are designed to identify cellwise outliers without dropping records or variables with many missing values (as is the case for the DDC algorithm). Lastly, as future work, the algorithm could be updated to produce a candidate (prediction) value for each detected cell anomaly by leveraging administrative, structured, or unstructured data available for the whole or a subset of the surveyed records.