EcoVal: An Efficient Data Valuation Framework for Machine Learning

Ayush K Tarun

{}^{*}

RespAI Lab, India
ayushtarun210@gmail.com
&Vikram S Chundawat

{}^{*}

RespAI Lab, India
vikram2000b@gmail.com
&Murari Mandal †
RespAI Lab, KIIT Bhubaneswar, India
murari.mandalfcs@kiit.ac.in
&Hong Ming Tan
NUS Business School
National University of Singapore
thm@nus.edu.sg
&Bowei Chen
Adam Smith Business School
University of Glasgow
bowei.chen@glasgow.ac.uk
&Mohan Kankanhalli
School of Computing
National University of Singapore
mohan@comp.nus.edu.sg

Abstract

Quantifying the value of data within a machine learning workflow can play a pivotal role in making more strategic decisions in machine learning initiatives. The existing Shapley value based frameworks for data valuation in machine learning are computationally expensive as they require considerable amount of repeated training of the model to obtain the Shapley value. In this paper, we introduce an efficient data valuation framework EcoVal, to estimate the value of data for machine learning models in a fast and practical manner. Instead of directly working with individual data sample, we determine the value of a cluster of similar data points. This value is further propagated amongst all the member cluster points. We show that the overall data value can be determined by estimating the intrinsic and extrinsic value of each data. This is enabled by formulating the performance of a model as a production function, a concept which is popularly used to estimate the amount of output based on factors like labor and capital in a traditional free economic market. We provide a formal proof of our valuation technique and elucidate the principles and mechanisms that enable its accelerated performance. We demonstrate the real-world applicability of our method by showcasing its effectiveness for both in-distribution and out-of-sample data. This work addresses one of the core challenges of efficient data valuation at scale in machine learning models.

^*^*footnotetext: These authors contributed equally to this work^†^†footnotetext: Corresponding author

1 Introduction

Data valuation is a pivotal concern in modern machine learning (ML) and data analytics, where the quality and worth of data have profound implications for decision-making, model performance, and data marketplace. Quantifying the worth of data plays an important role in data pricing and regulation compliance [1, 2], removing low-value/noisy data from the training set [3, 4], and incentivizing data sharing by personal data monetization [5, 6, 7, 8]. In a ML framework, the quality of data determines the effectiveness of the final model. Therefore, identifying high and low value data through data valuation would yield significant benefits for a wide range of machine learning applications.

Background. In recent studies, a cooperative game theory concept, Shapley value [9] has been frequently used for data valuation in supervised ML [6, 5, 7]. It offers a desirable property of equitable reward allocation. The data Shapley and its extensions [6, 7, 10, 11] have empirically shown the effectiveness of Shapley value based valuation in a fixed dataset as well as in a particular distribution of data, allowing for out-of-time data valuation. The value of a data point in ML relies on its individual contribution to the model’s performance and its relationship with other data points utilized during training. The presence of similar data in the training set can dilute the significance of individual points. To account for these interactions, data Shapley methods evaluate the contribution of each point by determining how its absence affects the overall performance of the model. This process usually involves repeatedly training the model with the selective exclusion of certain instances or subsets, thereby identifying those with the most substantial impact. The impact is measured by the observing the change in the performance score of the ML model. However, this incurs a high computational cost, typically requiring model training runs in the order of $O(n^{2})$ in current methodologies, where $n$ is the total number of data points in the dataset.

Motivation. While offering insightful analyses of data point significance and alleviating the issue of poor discrimination of data quality in leave-one-out (LOO) error methods, existing data Shapley based frameworks [6, 7, 12, 11] suffer from a high computational cost. The necessity of higher number of repeatedly of training a model, as these methods require, results in inefficiencies in terms of time and resource utilization. Furthermore, this inefficiency translates to an increased carbon footprint due to the energy requirements of training, thereby exacerbating climate change concerns [13]. The development of scalable algorithms capable of handling extensive datasets is essential for practical use of data valuation in real-world applications.

Our Contribution. We adopt a two-step approach where the valuation is performed at cluster-level first and the value is further divided among the cluster members. The similar data points are represented through a cluster which significantly reduces the total number of data points to deal with during training phase of the valuation process. At cluster level, we can use a simple LOO error for valuation since there is minimal possibility (almost zero) of a similar datum to be found in other clusters. The difficulty however, is to divide the value at each cluster among the cluster members. To address this issue, a novel approach is proposed based on production functions in economics. Our two-step approach aims to significantly speed up the valuation process in comparison to the Truncated Monte Carlo (TMC) Shapley.

In this paper, we introduce a a novel framework based on Leave Cluster Out (LCO) and production functions for data valuation in machine learning. The framework is computationally efficient, with theoretical and empirical verifications. The following are the key contributions of our work:

Novel Framework: The intuition behind our framework is that we find a group of similar items and estimate this cluster’s marginal contribution. As similar data items are bound to have similar values, we extend this principle to estimate cluster-level value through Leave Cluster Out (LCO).

We introduce a production function formulation representing the relation between the data and its utility in a model. We show that this formulation can be used to estimate the value of individual data based on the value of each cluster.

Computational Efficiency: We estimate the intrinsic and extrinsic value of each data point to determine the individual data value. By checking only the marginal distribution of the representative data point of a cluster, we substantially reduce the overhead of creating multiple subsets containing similar data points. Our approach is scalable to large datasets without being limited by the presence of similar data points in the dataset.

Theoretical Proof: We provide a theoretical proof of our data valuation method. We also show that the valuation obtained by our method has negligible error margin when compared with the vanilla Shapley value approximation method.

Empirical Evaluation: We conduct experiments with machine learning models on MNIST, CIFAR10, and CIFAR100. We compare the value rankings of our method with the existing state-of-the-art data valuation approaches data Shapley [6], LOO error, and Distributional Shapley [7] and notice similar or better performance with significant speed-up in data valuation process.

2 Related Work

Literature Review of Shapley Value. Shapley value as formalized in [14] establishes the axiomatic properties and demonstrates its unique ability to fairly allocate gains from cooperation among players. This seminal contribution laid the theoretical groundwork for subsequent developments in cooperative game theory [15, 16, 17, 18]. Shapley value has been extensively used for applications in economics, [19, 20, 21], management science [22, 23], online advertising [24]. In machine learning, it has been utilized for addressing the challenges in pricing ML training data, feature seelction, and ML explainability. [25, 26] proposed to employ Shapley value properties for feature selection. [27, 28] use Shapley value in market mechanism to price training data and match buyers to sellers data marketplace design. [29] introduced the SHAP framework, leveraging Shapley values to provide interpretable explanations for machine learning models. Other works have also explored its utility in explaining black-box model predictions [30, 31, 32].

Data Valuation in ML. Recently, the subfield of data valuation in ML models has attracted significant attention and the existing works have shown promising outcomes. Data Shapley [6, 5] proposed to use Shapley value from cooperative game theory for valuation of training data. KNN Shapley [33] improved the efficiency of data Shapley by using a k-nearest neighborhood model. Distributional Shapley [7] expanded the scope of valuation to the underlying data distribution instead of only considering the data points. Beta Shapley [11] relaxes the efficiency axiom in DataShapley and reports utility of data valuation in detecting mislabeled images in the training data. Data Banzhaf [34] propose to estimate the Banzhaf value to improve results on noisy label detection. Several works have attempted to improve the efficiency of the Shapley value computation through approximation techniques [10]. Apart from this, other aspects of data value has been studied in [8, 35, 36, 37, 38]. However, approximation of Shapley value still remains a computationally expensive process, making it difficult to adapt for large models and datasets. The main goal of this work is to develop an alternative efficient data valuation framework to overcome this problem.

Literature Review of Production Functions. [39] offers a detailed outline of the evolution and econometrics of the production function. Aggregate production functions are used in macroeconomics to represent the relationship between total output of an economy (GDP) and the inputs used to produce that output. These inputs typically include capital ( $K$ ), labor ( $L$ ), and sometimes other factors like technology or natural resources [40, 41]. The simplest production function used in economics, is the Cobb-Douglas production function introduced by [42]. [43] identifies all multi-factor production functions with given elasticity of output and from given elasticity of production. Production functions have been used in various domains, including health, education, and energy, to name a few [44, 45, 46]. In our study, we adopt the concept of a production function and adapt it for data valuation. This approach draws inspiration from foundational works and recent advancements in the field. [47] develops a theoretical framework that applies the production function to the economics of data, particularly employing data as an input for training machine learning models. Moreover, [48] highlights the role of data as information aimed at reducing forecast errors, which hints at a production function characterized by bounded returns to data. In our paper, we align with these perspectives and further the discourse by specifically focusing on the application of the production function concept in the valuation of data.

3 Preliminaries

Let an ML model $M$ , intended for a task $T$ , is trained on a dataset $B$ of size $m$ . Let $U$ denote the performance metric and $U^{T}$ denote the performance obtained on task $T$ . The overall performance $U$ is achieved after training a sufficient number of epochs $e$ . Here the sufficient number of epochs means $|U_{e+i+1}-U_{e+i}|<\gamma$ for all $i\geq 0$ , where $\gamma$ is an arbitrarily small value. It should be noted that $\gamma$ arises due to the randomness within the learning algorithm and not further training. The value of a data point is denoted by $\Phi$ .

Leave-One-Out (LOO) Error. The LOO error computes the value of a datum $z$ based on the increase in performance obtained by adding it to the training set:

\Phi_{LOO}(z;U,B)=U(B)-U(B\setminus\{z\}).

(1)

It struggles in differentiating data quality when similar data samples exist in the dataset. For example, if each sample has a duplicate copy in the dataset, the LOO will return a value $0$ for all of the samples. Shapely value overcomes this limitation by checking the marginal distribution over many subsets of the dataset.

Shapely Value. Shapely value [6] measures the value of a data point $z$ as the weighted average of the performance increase when $z$ is added to different subsets of the dataset $B$ :

\Phi_{s}(z;U,B)=\frac{1}{m}\sum_{k=1}^{m}{\frac{1}{\binom{m-1}{k-1}}}\sum_{S% \subseteq B\setminus\{z\}}{\Delta(z;U,S)},

(2)

where $|S|=k-1$ for $k\in N$ and $\Delta(z;U,S)=U(S\cup\{z\})-U(S)$ . Thus, data Shapley value is the weighted average of the marginal contribution $\Delta(z;U,S)$ . It satisfies the following Shapely value axioms:

•

Dummy Player: If $U(S\cup\{z\})=U(S)+e$ for all $S\subseteq B\i$ and some $e\in R$ , then $\Phi(z;U,B)=e$ .
•

Symmetry: If $U(S\cup\{z\})=U(S\cup\{z^{\prime}\})$ for all $S\subseteq B\backslash\{z,z^{\prime}\}$ , then $\Phi(z;U,B)=\Phi(z^{\prime};U,B)$ .
•

Linearity: $\Phi(z;\alpha_{1}U_{1}+\alpha_{2}U_{2},B)=\alpha_{1}\Phi(z;U_{1},B)+\alpha_{2}% \Phi(z;U_{2},B)$ for $\alpha_{1},\alpha_{2}\in R$ .
•

Efficiency: $\sum_{z\in N}\Phi(z;U,B)=\Phi(U,B)$ .

Further details regarding the interpretation of the above axioms in the context of machine learning can be referred to [6] and [5].

Production Function. In economics, a production function expresses the relationship between the specific quantities and combinations of different inputs a company uses and the amount of output it produces. Commonly used production functions include Linear, Leontief, Cobb–Douglas [49, 50], CES, and CRESH [51], each varying in their assumptions for the input and the output. The widespread usage of the Cobb-Douglas production function is attributed to its simplicity and adaptability. It assumes homogeneity of inputs and this principle is consistent with many machine learning setups.

Let $P(g)$ denote the production over a set of goods $g=(g_{1},g_{2},....g_{n})$ , the Cobb-Douglas production function is defined as

P(g)=A\prod_{i=1}^{n}g_{i}^{x_{i}},

(3)

where $x_{i}$ is an elastic parameter for good $i$ , and $A$ is the total factor productivity or the quality factor. If inputs are just labor $L$ and capital $K$ , the production function is then

P=AL^{x}K^{y}.

(4)

It should be noted that the Cobb-Douglas production function also supports the diminishing returns in terms of both labor and capital. The Law of Diminishing Returns [52] states that as the amount of a single factor of production is incrementally increased, the marginal output of a production process decreases. This property is analogous to how more data points have diminishing effects on a machine learning models performance. We therefore adapt the formulation of production functions in our proposed method to efficiently distribute the value of a cluster among its data members.

4 Proposed Method

A two-stage approach is proposed for efficient data valuation. First, data points are clustered together based on shared characteristics. Then, a leave cluster out (LCO) technique is applied to estimate the value of each cluster. This cluster value is then distributed among its members to obtain the preliminary individual data valuations. In the following, we delve into the building blocks of the proposed method and discuss its properties compared to the original Shapley methods.

4.1 Leave-Cluster-Out

Cluster analysis is firstly performed on the given data and the marginal contribution of a cluster $c$ can be expressed as

V_{c}=U(B)-U(B\setminus c).

(5)

The simple LOO error may provide an underestimated view of the true impact of specific data points, especially when similar data points remain in the dataset even after removal. Data Shapley alleviates this issue but suffers from high computational cost. By organizing data points into clusters based on their similarity, we ensure that when an entire cluster is removed, there are no closely-related points to mask the effect of its absence in other clusters. Consequently, this leads to a more precise assessment of the cluster’s marginal contribution, effectively approximating its value. Furthermore, this clustering approach significantly reduces the number of model training iterations needed in comparison to Data Shapley since evaluations are conducted at the cluster level instead of for each individual data point. Once we have obtained cluster-level valuations, the subsequent step involves efficiently approximating the values of individual data points within each cluster.

4.2 Value Propagation within a Cluster

Production Function for ML. We adapt the Cobb-Douglas production function to approximate the data value for ML. In this context, we can draw an analogy: the labor $L$ corresponds to the available data points for the model; the learning capacity or the number of parameters in the model represents the capital $K$ ; and the final output is the obtained performance on the test set $U^{T}$ . As both data quantity and model complexity exhibit diminishing returns, the Cobb-Douglas production function can be leveraged to effectively model learning performance. Therefore, we propose to approximate the model’s performance after $e$ epochs as

U^{T}(S,N)=Af(S)h^{T}(N),

(6)

where $f(S)$ quantifies the informational utility of the dataset $S$ to the predictive efficacy of the model $U$ , $T$ denotes the task, and $h^{T}(N)$ represents the effect of the model capacity which is dependent on $N$ , the number of parameters of the model.

Then, for a new point $z$ , the performance change $\Delta U$ in the model incurred by the small increase ( $\Delta S=\{z\}$ ) in $S$ can be computed by

	$\displaystyle\Delta U^{T}(S,N)$
$\displaystyle=$	$\displaystyle Af(S+\Delta S)h^{T}(N)-Af(S)h^{T}(N)$
$\displaystyle=$	$\displaystyle A\left[f(S+\Delta S)-f(S)\right]h^{T}(N)$
$\displaystyle=$	$\displaystyle A\left[\frac{f(S+\Delta S)-f(S)}{o(z)}\right]h^{T}(N)o(z).$	(7)

To better understand Eq. (7), let us consider $f$ as a smooth function of $x$ as specified in Eq. (6), i.e., $U^{T}(x,N)=Af(x)h^{T}(N)$ . Thus, a minor change in $x$ leads to a change in $U^{T}$ , which can be approximated by $Af^{\prime}(x)h^{T}(N)\Delta x$ . This allows us to interpret the expression enclosed in square brackets of Eq. (7) as effectively serving as the derivative of $f$ with respect to the set $S$ , especially when considering incremental changes to $S$ .

Also, in Eq. (7), $o(z)$ serves as an indicator of how a single data point enhances the model’s overall performance and is a proxy to $\Delta x$ discussed above. Therefore, the difference $f(S+\Delta S)-f(S)$ captures the marginal impact on the model’s performance when dataset $S$ is augmented by a new data point. Analogous to the concept of derivatives in calculus, this difference, when normalized by the contribution $o(z)$ of the individual point, can be interpreted as the “rate-of-change”of $f$ upon the addition of a new data point. This rate is contingent on both the existing dataset $S$ and the new data point being added. That is

U(S\cup\{z\})-U(S)=\alpha^{T}(z)\beta(z,S),

(8)

where

	$\displaystyle\alpha^{T}(z)=$	$\displaystyle Ah^{T}(N)o(z),$
	$\displaystyle\beta(z,S)=$	$\displaystyle\frac{f(S+\Delta S)-f(S)}{o(z)}.$

Substituting the above into Eq. (2) then gives

	$\displaystyle\Phi_{s}(z;U^{T},B)=$	$\displaystyle\frac{1}{m}{\sum}_{k=1}^{m}{\frac{1}{\binom{m-1}{k-1}}}\sum_{% \begin{subarray}{c}S\subset B\setminus\{z\}\\ \|S\|=k-1\end{subarray}}{\alpha^{T}(z)\beta(z,S)}$		(11)
	$\displaystyle=$	$\displaystyle\alpha^{T}(z)\beta^{*}(z,B),$		(12)

where

\displaystyle\beta^{*}(z,B)=

\displaystyle\frac{1}{m}{\sum}_{k=1}^{m}{\frac{1}{\binom{m-1}{k-1}}}\sum_{% \begin{subarray}{c}S\subset B\setminus\{z\}\\ |S|=k-1\end{subarray}}{\beta(z,S)}.

(15)

Proposition 1

(Production Function Based Valuation for ML). Let $\alpha^{T}(z)$ denote the intrinsic value of a datum $z$ , i.e., $\alpha^{T}(z)$ is only dependent on the characteristics of $z$ . The interaction of $z$ with rest of the data points in $B$ is captured by $\beta^{*}(z,B)$ . From equitable properties of data valuation in [6], we postulate that for every datum $z$ having an intrinsic value $\alpha^{T}(z)$ , the $\beta^{*}(z,B)$ acts as a multiplier or extrinsic factor that decreases the value of $z$ if similar data points are present in the dataset. Similarly, it increases the data value if $z$ is a unique datum. Then the data valuation can be performed as below

\Phi(z;U^{T},B)=\alpha^{T}(z)\beta^{*}(z,B).

(16)

To simplify notation, we denote $\alpha^{T}(z)$ with $\alpha(z)$ , and $\Phi(z;U^{T},B)$ with $\Phi(z;U,B)$ for the rest of the discussion, since $T$ is invariant.

Fast Data Valuation. Based on the above setup, we propose an efficient data valuation method that also works as an efficient proxy to Distributional Shapely [7] to predict valuation for unseen data-points in the distribution. The existing Data Shapley adheres to two fundamental axioms [12]: symmetry and efficiency. Symmetry states that for points $z$ and $z^{\prime}$ that contribute similarly to the model’s performance should have the same value, i.e. $U(S\cup\{z\})=U(S\cup\{z^{\prime}\})$ for all $S\in B\setminus\{z,z^{\prime}\}$ . Efficiency, on the other hand, ensures that the aggregate value of all data points aligns with the overall performance achieved after training on the entire dataset.

Proposition 2

(Fast Data Valuation of Cluster Data Members) The symmetry and efficiency properties when applied to a specific cluster implies the data points within a cluster, characterized by similar features, will likely possess similar values and a cluster’s value can be accurately represented as the sum of its constituent data points’ valuations.

Let $V_{c}$ ( $=\Phi_{c}$ ) be the value of cluster $c$ , the initial value assigned to any data point $z_{i}$ within this cluster is:

V_{i}=V_{c}/n_{c},

(17)

where $n_{c}$ is the number of data points in cluster $c$ . Using this cluster-level assignment of initial data value, we estimate the actual data value based on Eq. (16) as

V_{i}^{*}=\alpha_{i}\beta_{i}^{*}.

(18)

Estimating $\alpha$ and $\beta^{*}$ . Assuming each cluster contains an equal number of data points, the distribution of similar and dissimilar samples encountered by each datum becomes roughly uniform. This results in a near-constant extrinsic factor, $\beta^{*}(z,B)$ , across all data points. Thus, the value of these data points are directly proportional to $\alpha(z_{i})$ . We use $Q_{i}$ to denote the value of individual datum to differentiate it from $V_{i}$ value that is initialized by the cluster value in Eq. (17).

Theorem 4.1

For data point $z_{i}$ , assuming there is no error in $\beta_{i}^{*}$ , its adjusted value $V_{i}^{\Delta\alpha_{i}}$ is

V_{i}^{\Delta\alpha_{i}}=(\alpha_{i}+\Delta\alpha_{i})\beta_{i}^{*}=\Gamma_{% \alpha_{i}}\alpha_{i}\beta_{i}^{*},

(19)

where $\Gamma_{\alpha_{i}}$ is an adjustment factor for $\alpha_{i}$

\Gamma_{\alpha_{i}}=1+\frac{Q_{i}}{\sum_{z_{j}\in c}Q_{j}}V_{c}.

(20)

Corollary 4.1.1

When all data points in a cluster are exactly the same, the adjustment factor should be equal $1$ so that for each point in $c$ , the value becomes $V_{i}$ . But the above formulation of $\Gamma_{\alpha_{i}}$ yields $1+1/n$ when all the points are identical as $V_{i}$ and $V_{j}$ will be equal for any $i$ , $j$ . Thus, we normalize $\Gamma_{\alpha_{i}}$ as follows

\Gamma_{\alpha_{i}}=\frac{1}{1+V_{c}/n_{c}}\left(1+\frac{Q_{i}}{\sum_{z_{j}\in c% }Q_{j}}V_{c}\right).

(21)

Similar to $\alpha_{i}$ , we find the adjustment factor for $\beta_{i}^{*}$ , i.e. $\Gamma_{\beta_{i}^{*}}$ . $\beta_{i}^{*}$ measures the interaction of $z_{i}$ with all other data points in $B$ . As all data points similar to $z_{i}$ belong to the same cluster and $\beta_{i}^{*}$ is only affected by the other members in $z_{i}$ ’s cluster. We use the distance between $z_{i}$ and cluster centroid as a measure to it’s belongingness to the cluster or similarity to other points in the cluster.

Theorem 4.2

For data point $z_{i}$ , assuming no error in $\alpha_{i}$ , its adjusted value $V_{i}^{\Delta\beta_{i}^{*}}$ is

\displaystyle V_{i}^{\Delta\beta_{i}^{*}}=\alpha_{i}(\beta_{i}^{*}+\Delta\beta% _{i}^{*})=\Gamma_{\beta_{i}^{*}}\alpha_{i}\beta_{1i},

(22)

where $d_{i}$ is the distance of $z_{i}$ and $\Gamma_{\beta_{i}^{*}}$ is the adjustment factor represented as

\Gamma_{\beta_{i}^{*}}=\frac{1}{1+V_{c}/n_{c}}\left(1+\frac{d_{i}}{\sum_{z_{j}% \in c}d_{j}}V_{c}\right).

(23)

Production Function based Data Value Estimation. The final approximation value $\hat{\Phi}_{i}$ of the data point is

\displaystyle\hat{\Phi}_{i}=

\displaystyle(\alpha_{i}+\Delta\alpha_{i})(\beta_{i}^{*}+\Delta\beta_{i}^{*}).

(24)

Ignoring $\Delta\alpha_{i}\Delta\beta_{i}^{*}$ then gives

\displaystyle\hat{\Phi}_{i}\approx

\displaystyle(\alpha_{i}+\Delta\alpha_{i})\beta_{i}^{*}+\alpha(\beta_{i}^{*}+% \Delta\beta_{i}^{*})-\alpha_{i}\beta_{i}^{*}.

(25)

By substituting Eq. (18), Eq. (19), Eq. (22), we obtain.

$\displaystyle\hat{\Phi}_{i}=$	$\displaystyle V_{i}^{\Delta\alpha_{i}}+V_{i}^{\Delta\beta_{i}^{*}}-V_{i}$
$\displaystyle=$	$\displaystyle V_{i}(\Gamma_{\alpha_{i}}+\Gamma_{\beta_{i}^{*}}-1)$
$\displaystyle=$	$\displaystyle V_{i}\left[\left(\frac{1}{1+V_{c}/n_{c}}\right)\left(1+\frac{Q_{% i}}{\sum_{z_{j}\in c}Q_{j}}V_{c}\right)+\left.\left(\frac{1}{1+V_{c}/n_{c}}% \right)\left(1+\frac{d_{i}}{\sum_{z_{j}\in c}d_{j}}V_{c}\right)-1\right]\right.$	(26)

For the reader’s convenience, Algorithm 1 outlines the implementation steps of the EcoVal efficient data valuation framework.

4.3 Discussion: Comparison with Original Shapely

Let $E(z)$ denote the appropriate embedding from a machine learning model or the pre-final layer of a deep learning model for a data point $z$ . We extend the notion of Lipschitz Stability of data Shapely introduced in [7] to estimate the difference in value of different data points. We use proximity of the embeddings $E(z)$ as a proxy to the closeness in the underlying data distribution and formalize the same in the following Theorem.

Theorem 4.3

For any $z_{j}$ , $z_{k}$ if $||E(z_{j})-E(z_{k})||<\epsilon$ then, $|\Phi(z_{j})-\Phi(z_{k})|\leq\epsilon_{1}$ for very small $\epsilon,\epsilon_{1}\geq 0$

From the principle of clustering, a datum $z_{j}$ belongs to cluster $c$ if

	$\displaystyle\|\|E(z_{j})-E(z_{k})\|\|\leq\epsilon,\forall z_{k}\in c,$	(27)
then for this cluster
	$\displaystyle\|\Phi(z_{j})-\Phi(z_{k})\|\leq\epsilon_{1},\forall z_{k},z_{j}\in c.$	(28)

Algorithm 1 EcoVal Data Valuation

M(.;\psi)

: Fully Trained Model

B

: Training Dataset

B_{D}

: Set of available points from the underlying distribution of

B

M_{-n}(x;\psi)\leftarrow

Embedding of data

x

obtained from the

n^{th}

last layer of the model

5:Let

E(x)=M_{-n}(x;\psi)

6:Let

A_{c}

be a clustering algorithm then

(x_{i},c_{j})\leftarrow A_{c}(B_{D})\forall x_{i}\in B_{D}

where

c_{j}\in C

is the cluster associated with

x_{i}

and

C

is the set of all clusters

7:Find valuation at cluster level

V_{c_{j}}=U(B)-U(B\setminus c_{j})

\forall c_{j}\in C

9:Initialize value

V_{i}

for each cluster member

x_{i}

10:

V_{i}=V_{c_{j}}/n_{c_{j}}

, where

n_{c_{j}}

is the number of elements in cluster

c_{j}

to which

x_{i}

belongs

11:Initialize:

D\leftarrow

[]

12:for

c_{j}\in C

13: Sample

X_{j}

= {

x_{1}^{j}

x_{2}^{j}

, …

x_{n_{c}}^{j}

} from

c_{j}

14:

D\leftarrow D\cup X_{j}

15:end for

16:Run TMC Shapely [6]

17:

(x_{k},v_{TMC_{k}})\leftarrow TMC(U^{T},D)\forall x_{k}\in D

18:Train a regression model

R

on the sampled data {

(x_{1},v_{TMC_{1}}),(x_{2},v_{TMC_{2}})....(x_{\lvert D\rvert},v_{TMC_{\lvert D% \rvert}})

}

19:for

c_{j}\in C

20:

(x_{i}^{j},q_{i}^{j})\leftarrow R(x_{i}^{j})\forall x_{i}^{j}\in c_{j}

21: Let

\bar{x}_{c_{j}}

be the centroid of the cluster

c_{j}

22:

(x_{i}^{j},d_{i}^{j})\leftarrow distance(x_{i}^{j},\bar{x}_{c_{j}})\forall x_{% i}\in c_{j}

23:end for

24:for

x_{i}\in B

25: Find correction term for

\alpha

26:

\Gamma_{\alpha_{i}}=\frac{1}{1+V_{c_{j}}/n_{c_{j}}}(1+\frac{q_{i}^{j}}{\sum_{z% _{k}\in c_{j}}q_{k}^{j}}V_{c_{j}})

27: Find correction term for

\beta_{i}^{*}

28:

\Gamma_{\beta_{i}^{*}}=\frac{1}{1+V_{c_{j}}/n_{c_{j}}}(1+\frac{d_{i}^{j}}{\sum% _{z_{k}\in c_{j}}d_{k}^{j}}V_{c_{j}})

29: Final valuation =

V_{i}*(\Gamma_{\alpha_{i}}+\Gamma_{\beta_{i}^{*}}-1)

30:end for

It means all Shaply values lie within an $\epsilon_{1}$ interval. Therefore, $\Phi(z_{j})$ for any $z_{j}\in c$ can be expressed as

\Phi(z_{j})=\bar{\Phi}_{c}+\delta(z_{j}),

(29)

where $\delta(z_{j})\leq\epsilon/2$ and $\bar{\Phi}_{c}$ lies somewhere in the $\epsilon_{1}$ interval. Value of the this cluster, $V_{c}$ by Shapely axioms is

\displaystyle V_{c}

\displaystyle=\sum_{z_{j}\in c}\Phi(z_{j})=n_{c}\bar{\Phi}_{c}+\sum\delta(z_{j% }).

(30)

Theorem 4.4

The difference between the original Shapley value and our proposed approximated data value is

\displaystyle\Delta\Phi_{i}

\displaystyle\approx\frac{n_{c}\bar{\Phi}_{c}\delta_{R}}{\sum_{z_{j}\in c}Q_{j% }}+\frac{n_{c}^{2}\bar{\Phi}_{c}Q_{i}\delta_{R}}{(\sum_{z_{j}\in c}Q_{j})^{2}}.

(31)

Due to the intrinsic limitations on the magnitudes of average Shapley value within a cluster $\bar{\Phi}_{c}$ and individual point contribution $Q_{i}$ , both values inherently remain within a bounded range. As cluster size $n_{c}$ increases, the predicted aggregate value $\sum_{z_{j}\in c}Q_{j}$ proportionately grows, naturally restricting the potential expansion of $\frac{n_{c}}{\sum_{z_{j}\in c}Q_{j}}$ . Additionally, a moderately accurate regression model ensures a low $\delta_{R}$ error. Therefore, our method produces Shapley value estimates $\Phi_{i}$ with minimal margin of error.

The detailed proof to 4.4 is provided in the Appendix.

5 Experiments

We show the broad effectiveness of the proposed valuation framework and its general applicability to machine learning models through empirical evidence. We estimate the value of data in a machine learning model in MNIST, CIFAR10, and CIFAR100 datasets. We compare our method with Data Shapley [6] and Distributional Shaply [7].

Experiment Settings. Following the common practice in previous works, we extract the features from last layer of a pre-trained network and apply Shapley on this embedded vector. We sample a small subset, i.e. 200 samples from the original training data and run the baseline methods TMC-Shapley (Data Shapley) and distributional Shapley. 2000 samples are used for testing and holdout for Shapley calculation. We keep 10,000 samples which are never seen by model or valuation method at any point, we call that out-of-sample (OOS) set. The rest of the samples are used as data distribution and exposed to Distributional Shapely, and our method during the clustering step and $\alpha$ correction step. We use Gaussian Mixture Models (GMM) for clustering. Our proposed method works for both in-distribution and OOS samples. As Data Shapley only works for in-distribution samples, we compare our results with Distribution Shapley for out-of-sample data.

Refer to caption — Figure 1: Computation cost in terms of number of training iterations required for the given dataset size. We compare EcoVal with TMC Shapely (also known as Data Shapley), distributional Shapely, and a lighweight version of EcoVal. Our method requires substantially lower number of training iterations for data valuation.

5.1 Comparative Analysis of the Computational Time

The Data Shapley approximation method TMC Shapely [6] converges in approximately $3\lvert B\rvert$ (or $3\times m$ in Eq. 2) Monte Carlo samples. Each Monte Carlo sample is a random permutation of the data points in the training set. The marginal contribution of a data point $z$ in a given permutation is obtained as the performance difference between the model trained on data points before this datum, say $S$ , and the model trained on $S\cup\{z\}$ . Each point is added sequentially meaning $\lvert B\rvert$ training runs are required in a single Monte Carlo sample. This makes the number of training runs in the order of $O(\lvert B\rvert^{2})$ . Distributional Shapely’s [7] time complexity is similar with $T$ runs to get an unbiased estimate using different subsets $S$ from the underlying data distribution. This makes the number of training runs of Distributional Shapely $O(T*\lvert B\rvert^{2})$ .

Our method performs clustering that takes less time than training a machine learning or deep learning model in most real-world scenarios. This is a one time effort, so the complexity is in the order of $O(1)$ . Estimating the value of each cluster requires $O(p)$ training runs. Apart from that, our method involves running Data Shapely on a curated subset $p$ containing an equal number of points from each cluster, this take $O(p^{2})$ time. The size of this subset $p$ is much smaller than $\lvert B\rvert$ . The total number of training runs required is in the order of $O(1)+O(p^{2})+O(p)$ . When compared to the existing Data Shapley based methods, our EcoVal is significantly faster as shown in Figure 1. With the increase in the dataset size, the utility of our EcoVal becomes more evident. Our method without $\alpha$ correction is even faster with negligible loss in valuation quality.

5.2 Data Point Addition and Removal Experiments

We evaluate the data valuation methods by running the data point addition and removal experiments as proposed in [6]. For a given model and dataset, the data points are added in the order of predicted value, i.e. from largest to lowest values, and the model is retrained for each addition. Similarly, another experiment is conducted where we remove samples with high values and observe the performance drop. The impact of removal and addition of high value data-points help us measure the effectiveness of data valuation techniques. We compare our results with state-of-the-art Data Shapley and Distributed Data Shapley valuation methods.

Removing Most Valued Data Points. We predict values of data-points using each valuation method and we measure the drop in performance of model by removing most-valued data-points for each method. A better valuation method’s high value data-points will result in a higher drop in performance. So, for removal of most valued points, the method resulting in higher performance drop is a better valuation method.

Adding Most Valued Data Points. This approach is vice-versa of the previous approach, we add most valued data-points into the training set and observe the increase in the performance. A higher increase on adding the top data-points shows better valuation method. Figure 2 shows the performance drop and increase upon adding and removing most valued points, respectively. It can be observed that EcoVal performance drop is slightly less than that of Data Shapley but significantly higher than the Distributed Data Shapley which is desired. Similar patterns can be observed in the data addition graph also.

Removing Most Valuable Data Points from Out-of-sample Set. The discussed earlier, EcoVal supports data valuation for out-of-sample data as well which is supported only by Distributed Data Shapley. Therefore, we compare the OOS valuation results between them. Figure 3 shows the performance drop by removing the most valuable points from an out-of-sample set of size $10,000$ . It can be observed that EcoVal’s performance drop is very high as compared to Distributed Data Shapley. The steep drop in the performance after removing the most valuable data points implies better precision for data valuation in our EcoVal framework.

5.3 Effect of the Adjustment Terms in EcoVal

We observe the effect of removing different adjustment terms $\Gamma_{\beta_{i}^{*}}$ , $\Gamma_{\alpha_{i}}$ or both in the EcoVal framework and show the results in Figure 4. The overall EcoVal framework with both $\alpha$ and $\beta$ terms perform the best in general. Eliminating one of the adjustment terms deteriorates the quality of the valuation by a small margin. Removing both corrections significantly impacts the quality of obtained valuations. This is particularly visible in the initial phase of adding the most significant data points. It should be noted that eliminating $\Gamma_{\alpha_{i}}$ only affects the valuation quality marginally, but completely removes the need for model training, giving an even more efficient version of our valuation method.

6 Conclusion

This work presents a focused study on improving the speed of data valuation in machine learning models. We develop an efficient data valuation method that is significantly fast and practical for working with large datasets. Our method works for both in-distribution and out-of-sample data. The proposed EcoVal data valuation framework shows comparable and sometimes even better results than the existing approaches for in-distribution data. For out-of-sample data points, our method significantly outperforms competing methods, thereby establishing a new state-of-the-art. This proves our method’s utility in a data market, where new data points analogous to our out-of-sample set are generated every passing instant. Our valuation also shows negligible error margin with vanilla Shapley value approximation. The aforementioned points collectively make the proposed method a robust and scalable approach to estimate data value across variety of machine learning models.

References

[1] Tarun Wadhwa. Economic impact and feasibility of data dividends, 2020.
[2] Jian Pei. Data pricing – from economics to data science. Association for Computing Machinery, 2020.
[3] Siyi Tang, Amirata Ghorbani, Rikiya Yamashita, Sameer Rehman, Jared A Dunnmon, James Zou, and Daniel L Rubin. Data valuation for medical imaging using shapley value and application to a large-scale chest x-ray dataset. Scientific reports, 11(1):8366, 2021.
[4] Bojan Karlaš, David Dao, Matteo Interlandi, Bo Li, Sebastian Schelter, Wentao Wu, and Ce Zhang. Data debugging with shapley importance over end-to-end machine learning pipelines. arXiv preprint arXiv:2204.11131, 2022.
[5] Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1167–1176. PMLR, 2019.
[6] Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning, pages 2242–2251. PMLR, 2019.
[7] Amirata Ghorbani, Michael Kim, and James Zou. A distributional framework for data valuation. In International Conference on Machine Learning, pages 3535–3544. PMLR, 2020.
[8] Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry. Datamodels: Understanding predictions with data and data with predictions. In International Conference on Machine Learning, pages 9525–9587. PMLR, 2022.
[9] LS Shapley. A value for n-person games. In Contributions to the Theory of Games (AM-28), Volume II, pages 307–318. Princeton University Press, 1953.
[10] Yongchan Kwon, Manuel A Rivas, and James Zou. Efficient computation and analysis of distributional shapley values. In International Conference on Artificial Intelligence and Statistics, pages 793–801. PMLR, 2021.
[11] Yongchan Kwon and James Zou. Beta shapley: a unified and noise-reduced data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, pages 8780–8802. PMLR, 2022.
[12] Jiachen T Wang and Ruoxi Jia. Data banzhaf: A robust data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, pages 6388–6421. PMLR, 2023.
[13] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700, 2019.
[14] Lloyd S Shapley et al. A value for n-person games. 1953.
[15] Ehud Kalai and Dov Samet. On weighted shapley values. International journal of game theory, 16:205–222, 1987.
[16] Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games. International Journal of game theory, 28:547–565, 1999.
[17] Roger B Myerson. Graphs and cooperation in games. Mathematics of operations research, 2(3):225–229, 1977.
[18] Robert J Aumann and Lloyd S Shapley. Values of non-atomic games. Princeton University Press, 2015.
[19] Faruk Gul. Bargaining foundations of shapley value. Econometrica: Journal of the Econometric Society, pages 81–95, 1989.
[20] Hervé Moulin. An application of the shapley value to fair division with money. Econometrica: Journal of the Econometric Society, pages 1331–1349, 1992.
[21] Alvin E Roth and Robert E Verrecchia. The shapley value as applied to cost allocation: a reinterpretation. Journal of Accounting Research, pages 295–303, 1979.
[22] Eda Kemahlıoğlu-Ziya and John J Bartholdi III. Centralizing inventory in supply chains by using shapley value to allocate the profits. Manufacturing & Service Operations Management, 13(2):146–162, 2011.
[23] Pradeep Dubey, Abraham Neyman, and Robert James Weber. Value theory without efficiency. Mathematics of Operations Research, 6(1):122–128, 1981.
[24] Raghav Singal, Omar Besbes, Antoine Desir, Vineet Goyal, and Garud Iyengar. Shapley meets uniform: An axiomatic framework for attribution in online advertising. In The World Wide Web Conference, pages 1713–1723, 2019.
[25] Shay Cohen, Eytan Ruppin, and Gideon Dror. Feature selection based on the shapley value. In Proceedings of the 19th international joint conference on Artificial intelligence, pages 665–670, 2005.
[26] Mohammad Zaeri-Amirani, Fatemeh Afghah, and Sajad Mousavi. A feature selection method based on shapley value to false alarm reduction in icus a genetic-algorithm approach. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 319–323. IEEE, 2018.
[27] Anish Agarwal, Munther Dahleh, and Tuhin Sarkar. A marketplace for data: An algorithmic solution. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 701–726, 2019.
[28] Raul Castro Fernandez, Pranav Subramaniam, and Michael J Franklin. Data market platforms: Trading data assets to solve data problems. Proceedings of the VLDB Endowment, 13(11).
[29] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
[30] Jianbo Chen, Le Song, Martin J Wainwright, and Michael I Jordan. L-shapley and c-shapley: Efficient model interpretation for structured data. In International Conference on Learning Representations, 2018.
[31] Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In International conference on machine learning, pages 9269–9278. PMLR, 2020.
[32] Amirata Ghorbani and James Y Zou. Neuron shapley: Discovering the responsible neurons. Advances in neural information processing systems, 33:5922–5932, 2020.
[33] Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li4 Ce Zhang, and Costas Spanos1 Dawn Song. Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment, 12(11).
[34] Jiachen T. Wang and Ruoxi Jia. Data banzhaf: A robust data valuation framework for machine learning. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 6388–6421. PMLR, 25–27 Apr 2023.
[35] Ki Nohyun, Hoyong Choi, and Hye Won Chung. Data valuation without training of a model. In The Eleventh International Conference on Learning Representations, 2022.
[36] Ian Covert, Scott Lundberg, and Su-In Lee. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1–90, 2021.
[37] Ian Covert and Su-In Lee. Improving kernelshap: Practical shapley value estimation using linear regression. In International Conference on Artificial Intelligence and Statistics, pages 3457–3465. PMLR, 2021.
[38] Rui Wang, Xiaoqian Wang, and David I Inouye. Shapley explanation networks. In International Conference on Learning Representations, 2020.
[39] Sudhanshu K Mishra. A brief history of production functions. Available at SSRN 1020577, 2007.
[40] Robert J Barro. Macroeconomics. MIT Press, 1997.
[41] Ronald William Shephard. Theory of cost and production functions. Princeton University Press, 2015.
[42] Kenneth J Arrow, Hollis B Chenery, Bagicha S Minhas, and Robert M Solow. Capital-labor substitution and economic efficiency. The review of Economics and Statistics, pages 225–250, 1961.
[43] Guennadi A Khatskevich and Andrei F Pranevich. Production functions with given elasticities of output and production. 2018.
[44] Eric A Hanushek. Education production functions. In The economics of education, pages 161–170. Elsevier, 2020.
[45] Grant Allan, Michelle Gilmartin, Peter McGregor, Karen Turner, and J Kim Swales. Economics of energy efficiency. In International Handbook on the Economics of Energy. Edward Elgar Publishing, 2009.
[46] David E Bloom, David Canning, and Jaypee Sevilla. The effect of health on economic growth: a production function approach. World development, 32(1):1–13, 2004.
[47] Charles I Jones and Christopher Tonetti. Nonrivalry and the economics of data. American Economic Review, 110(9):2819–2858, 2020.
[48] Maryam Farboodi and Laura Veldkamp. A model of the data economy. Technical report, National Bureau of Economic Research, 2021.
[49] CW COBB. A theory of production. American Economic Review, 18:139–165, 1928.
[50] Lawrence Blume, Steven Durlauf, and Lawrence E Blume. The new Palgrave dictionary of economics. Palgram Macmillan, 2008.
[51] Robin C Sickles and Valentin Zelenyuk. Measurement of productivity and efficiency. Cambridge University Press, 2019.
[52] Ronald W Shephard and Rolf Färe. The law of diminishing returns. In Production Theory: Proceedings of an International Seminar Held at the University at Karlsruhe May–July 1973, pages 287–318. Springer, 1974.

Appendix

Appendix A Proof of Theorem 3.4

Let equal number of samples are used from each cluster to run TMC Shapley. Then the intrinsic value of a datum is independent of proportion and bias in the data distribution. If $n_{s}$ such samples exist, the value is divided into these $n_{s}$ samples. From Shapely axioms, the data Shapley at the current stage of TMC becomes approximately $\frac{\alpha_{i}}{n_{s}}$ .

For rest of the samples in the TMC, we train a regression model $R$ for predicting $\frac{\alpha_{i}}{n_{s}}$ for an input data. If the predicted Shapley for any $z_{i}\in c$ is $Q_{i}$ , then assuming no error is introduced due to the TMC Shapely algorithm, this gives us

Q_{i}=\frac{\alpha_{i}}{n_{s}}+\delta_{R_{i}},

(32)

where $\delta_{R_{i}}$ is error introduced by the regression model. $Q_{i}$ denotes $b*\alpha_{i}$ , where $b$ is some constant. We use this to obtain the adjustment factor for $\alpha_{i}$ in Eq. (19). Assuming $V_{i}$ and $d_{i}$ do not have any error, we take the differentiation of Eq. (26) with respect to $z$

\frac{\partial\Phi_{i}}{\partial z}=\frac{1}{\partial z}(\frac{V_{i}}{1+V_{c}/% n_{c}})\left[\frac{n_{c}\partial Q_{i}}{\sum_{z_{j}\in c}Q_{j}}-\frac{n_{c}Q_{% i}\partial(\sum_{z_{j}\in c}Q_{j})}{(\sum_{z_{j}\in c}Q_{j})^{2}}\right].

(33)

Intuition. $V_{i}$ does not have any error as this is the difference between the performance with and without the cluster $c$ divided by a constant. Both the values can be directly computed from the model. Similarly, $d_{i}$ is the distance of the datum from the centroid of the cluster $c$ which can be calculated without any error.

Comparing Eq. (33) with change in Shapley value leads to the following inequality

	$\displaystyle\Delta\Phi_{i}$	$\displaystyle\leq\frac{V_{c}/n_{c}}{1+V_{c}/n_{c}}\left[\frac{\delta_{R}n_{c}}% {\sum_{z_{j}\in c}Q_{j}}+\frac{n_{c}Q_{i}(n_{c}\delta_{R})}{(\sum_{z_{j}\in c}% Q_{j})^{2}}\right],$
where $\delta_{R}=\max\limits_{i}\delta_{R_{i}}$ is the maximum error of the regression model
	$\displaystyle\Delta\Phi_{i}$	$\displaystyle\leq V_{c}\left[\frac{\delta_{R}}{\sum_{z_{j}\in c}Q_{j}}+\frac{n% _{c}Q_{i}\delta_{R}}{(\sum_{z_{j}\in c}Q_{j})^{2}}\right],$
as $V_{c}\geq 0$ , $n_{c}\geq 1$ therefore, $\frac{1}{1+V_{c}/n_{c}}\leq 1$ . From Eq. 29 and Eq. 30
	$\displaystyle\Delta\Phi_{i}$	$\displaystyle\leq(n_{c}\bar{\Phi}_{c}+n_{c}\epsilon/2)\left[\frac{\delta_{R}}{% \sum_{z_{j}\in c}Q_{j}}+\frac{n_{c}Q_{i}\delta_{R}}{(\sum_{z_{j}\in c}Q_{j})^{% 2}}\right].$
Ignoring factors with multiples of $\epsilon$ and $\delta$ as these values are very small. We get the final difference between the original Shapley value and our proposed approximated Value as below
	$\displaystyle\Delta\Phi_{i}$	$\displaystyle\approx\frac{n_{c}\bar{\Phi}_{c}\delta_{R}}{\sum_{z_{j}\in c}Q_{j% }}+\frac{n_{c}^{2}\bar{\Phi}_{c}Q_{i}\delta_{R}}{(\sum_{z_{j}\in c}Q_{j})^{2}},$	(34)

$\bar{\Phi}_{c}$ and $Q_{i}$ cannot be arbitrarily large as they are the average Shapely value for a cluster and change in performance due to a data point $z_{i}$ . With increasing cluster size $n_{c}$ , the corresponding predicted value $\sum_{z_{j}\in c}Q_{j}$ will increase. Thus, $n_{c}/\sum_{z_{j}\in c}Q_{j}$ can not be very large. The $\delta_{R}$ error will be low for a moderately good regression model. Thus, our method estimates Shapely value $\Phi_{i}$ with negligible error.