Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: CC BY-NC-SA 4.0
arXiv:2402.09288v3 [cs.LG] 18 Mar 2024

EcoVal: An Efficient Data Valuation Framework for Machine Learning

Ayush K Tarun*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT
RespAI Lab, India
ayushtarun210@gmail.com
&Vikram S Chundawat*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT
RespAI Lab, India
vikram2000b@gmail.com
&Murari Mandal †
RespAI Lab, KIIT Bhubaneswar, India
murari.mandalfcs@kiit.ac.in
&Hong Ming Tan
NUS Business School
National University of Singapore
thm@nus.edu.sg
&Bowei Chen
Adam Smith Business School
University of Glasgow
bowei.chen@glasgow.ac.uk
&Mohan Kankanhalli
School of Computing
National University of Singapore
mohan@comp.nus.edu.sg
Abstract

Quantifying the value of data within a machine learning workflow can play a pivotal role in making more strategic decisions in machine learning initiatives. The existing Shapley value based frameworks for data valuation in machine learning are computationally expensive as they require considerable amount of repeated training of the model to obtain the Shapley value. In this paper, we introduce an efficient data valuation framework EcoVal, to estimate the value of data for machine learning models in a fast and practical manner. Instead of directly working with individual data sample, we determine the value of a cluster of similar data points. This value is further propagated amongst all the member cluster points. We show that the overall data value can be determined by estimating the intrinsic and extrinsic value of each data. This is enabled by formulating the performance of a model as a production function, a concept which is popularly used to estimate the amount of output based on factors like labor and capital in a traditional free economic market. We provide a formal proof of our valuation technique and elucidate the principles and mechanisms that enable its accelerated performance. We demonstrate the real-world applicability of our method by showcasing its effectiveness for both in-distribution and out-of-sample data. This work addresses one of the core challenges of efficient data valuation at scale in machine learning models.

**footnotetext: These authors contributed equally to this workfootnotetext: Corresponding author

1 Introduction

Data valuation is a pivotal concern in modern machine learning (ML) and data analytics, where the quality and worth of data have profound implications for decision-making, model performance, and data marketplace. Quantifying the worth of data plays an important role in data pricing and regulation compliance [1, 2], removing low-value/noisy data from the training set [3, 4], and incentivizing data sharing by personal data monetization [5, 6, 7, 8]. In a ML framework, the quality of data determines the effectiveness of the final model. Therefore, identifying high and low value data through data valuation would yield significant benefits for a wide range of machine learning applications.

Background. In recent studies, a cooperative game theory concept, Shapley value [9] has been frequently used for data valuation in supervised ML [6, 5, 7]. It offers a desirable property of equitable reward allocation. The data Shapley and its extensions [6, 7, 10, 11] have empirically shown the effectiveness of Shapley value based valuation in a fixed dataset as well as in a particular distribution of data, allowing for out-of-time data valuation. The value of a data point in ML relies on its individual contribution to the model’s performance and its relationship with other data points utilized during training. The presence of similar data in the training set can dilute the significance of individual points. To account for these interactions, data Shapley methods evaluate the contribution of each point by determining how its absence affects the overall performance of the model. This process usually involves repeatedly training the model with the selective exclusion of certain instances or subsets, thereby identifying those with the most substantial impact. The impact is measured by the observing the change in the performance score of the ML model. However, this incurs a high computational cost, typically requiring model training runs in the order of O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) in current methodologies, where n𝑛nitalic_n is the total number of data points in the dataset.

Motivation. While offering insightful analyses of data point significance and alleviating the issue of poor discrimination of data quality in leave-one-out (LOO) error methods, existing data Shapley based frameworks [6, 7, 12, 11] suffer from a high computational cost. The necessity of higher number of repeatedly of training a model, as these methods require, results in inefficiencies in terms of time and resource utilization. Furthermore, this inefficiency translates to an increased carbon footprint due to the energy requirements of training, thereby exacerbating climate change concerns [13]. The development of scalable algorithms capable of handling extensive datasets is essential for practical use of data valuation in real-world applications.

Our Contribution. We adopt a two-step approach where the valuation is performed at cluster-level first and the value is further divided among the cluster members. The similar data points are represented through a cluster which significantly reduces the total number of data points to deal with during training phase of the valuation process. At cluster level, we can use a simple LOO error for valuation since there is minimal possibility (almost zero) of a similar datum to be found in other clusters. The difficulty however, is to divide the value at each cluster among the cluster members. To address this issue, a novel approach is proposed based on production functions in economics. Our two-step approach aims to significantly speed up the valuation process in comparison to the Truncated Monte Carlo (TMC) Shapley.

In this paper, we introduce a a novel framework based on Leave Cluster Out (LCO) and production functions for data valuation in machine learning. The framework is computationally efficient, with theoretical and empirical verifications. The following are the key contributions of our work:

Novel Framework: The intuition behind our framework is that we find a group of similar items and estimate this cluster’s marginal contribution. As similar data items are bound to have similar values, we extend this principle to estimate cluster-level value through Leave Cluster Out (LCO).

We introduce a production function formulation representing the relation between the data and its utility in a model. We show that this formulation can be used to estimate the value of individual data based on the value of each cluster.

Computational Efficiency: We estimate the intrinsic and extrinsic value of each data point to determine the individual data value. By checking only the marginal distribution of the representative data point of a cluster, we substantially reduce the overhead of creating multiple subsets containing similar data points. Our approach is scalable to large datasets without being limited by the presence of similar data points in the dataset.

Theoretical Proof: We provide a theoretical proof of our data valuation method. We also show that the valuation obtained by our method has negligible error margin when compared with the vanilla Shapley value approximation method.

Empirical Evaluation: We conduct experiments with machine learning models on MNIST, CIFAR10, and CIFAR100. We compare the value rankings of our method with the existing state-of-the-art data valuation approaches data Shapley [6], LOO error, and Distributional Shapley [7] and notice similar or better performance with significant speed-up in data valuation process.

2 Related Work

Literature Review of Shapley Value. Shapley value as formalized in [14] establishes the axiomatic properties and demonstrates its unique ability to fairly allocate gains from cooperation among players. This seminal contribution laid the theoretical groundwork for subsequent developments in cooperative game theory [15, 16, 17, 18]. Shapley value has been extensively used for applications in economics, [19, 20, 21], management science [22, 23], online advertising [24]. In machine learning, it has been utilized for addressing the challenges in pricing ML training data, feature seelction, and ML explainability. [25, 26] proposed to employ Shapley value properties for feature selection. [27, 28] use Shapley value in market mechanism to price training data and match buyers to sellers data marketplace design. [29] introduced the SHAP framework, leveraging Shapley values to provide interpretable explanations for machine learning models. Other works have also explored its utility in explaining black-box model predictions [30, 31, 32].

Data Valuation in ML. Recently, the subfield of data valuation in ML models has attracted significant attention and the existing works have shown promising outcomes. Data Shapley [6, 5] proposed to use Shapley value from cooperative game theory for valuation of training data. KNN Shapley [33] improved the efficiency of data Shapley by using a k-nearest neighborhood model. Distributional Shapley [7] expanded the scope of valuation to the underlying data distribution instead of only considering the data points. Beta Shapley [11] relaxes the efficiency axiom in DataShapley and reports utility of data valuation in detecting mislabeled images in the training data. Data Banzhaf [34] propose to estimate the Banzhaf value to improve results on noisy label detection. Several works have attempted to improve the efficiency of the Shapley value computation through approximation techniques [10]. Apart from this, other aspects of data value has been studied in [8, 35, 36, 37, 38]. However, approximation of Shapley value still remains a computationally expensive process, making it difficult to adapt for large models and datasets. The main goal of this work is to develop an alternative efficient data valuation framework to overcome this problem.

Literature Review of Production Functions. [39] offers a detailed outline of the evolution and econometrics of the production function. Aggregate production functions are used in macroeconomics to represent the relationship between total output of an economy (GDP) and the inputs used to produce that output. These inputs typically include capital (K𝐾Kitalic_K), labor (L𝐿Litalic_L), and sometimes other factors like technology or natural resources [40, 41]. The simplest production function used in economics, is the Cobb-Douglas production function introduced by [42][43] identifies all multi-factor production functions with given elasticity of output and from given elasticity of production. Production functions have been used in various domains, including health, education, and energy, to name a few [44, 45, 46]. In our study, we adopt the concept of a production function and adapt it for data valuation. This approach draws inspiration from foundational works and recent advancements in the field. [47] develops a theoretical framework that applies the production function to the economics of data, particularly employing data as an input for training machine learning models. Moreover, [48] highlights the role of data as information aimed at reducing forecast errors, which hints at a production function characterized by bounded returns to data. In our paper, we align with these perspectives and further the discourse by specifically focusing on the application of the production function concept in the valuation of data.

3 Preliminaries

Let an ML model M𝑀Mitalic_M, intended for a task T𝑇Titalic_T, is trained on a dataset B𝐵Bitalic_B of size m𝑚mitalic_m. Let U𝑈Uitalic_U denote the performance metric and UTsuperscript𝑈𝑇U^{T}italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denote the performance obtained on task T𝑇Titalic_T. The overall performance U𝑈Uitalic_U is achieved after training a sufficient number of epochs e𝑒eitalic_e. Here the sufficient number of epochs means |Ue+i+1Ue+i|<γsubscript𝑈𝑒𝑖1subscript𝑈𝑒𝑖𝛾|U_{e+i+1}-U_{e+i}|<\gamma| italic_U start_POSTSUBSCRIPT italic_e + italic_i + 1 end_POSTSUBSCRIPT - italic_U start_POSTSUBSCRIPT italic_e + italic_i end_POSTSUBSCRIPT | < italic_γ for all i0𝑖0i\geq 0italic_i ≥ 0, where γ𝛾\gammaitalic_γ is an arbitrarily small value. It should be noted that γ𝛾\gammaitalic_γ arises due to the randomness within the learning algorithm and not further training. The value of a data point is denoted by ΦΦ\Phiroman_Φ.

Leave-One-Out (LOO) Error. The LOO error computes the value of a datum z𝑧zitalic_z based on the increase in performance obtained by adding it to the training set:

ΦLOO(z;U,B)=U(B)U(B{z}).subscriptΦ𝐿𝑂𝑂𝑧𝑈𝐵𝑈𝐵𝑈𝐵𝑧\Phi_{LOO}(z;U,B)=U(B)-U(B\setminus\{z\}).roman_Φ start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT ( italic_z ; italic_U , italic_B ) = italic_U ( italic_B ) - italic_U ( italic_B ∖ { italic_z } ) . (1)

It struggles in differentiating data quality when similar data samples exist in the dataset. For example, if each sample has a duplicate copy in the dataset, the LOO will return a value 00 for all of the samples. Shapely value overcomes this limitation by checking the marginal distribution over many subsets of the dataset.

Shapely Value. Shapely value [6] measures the value of a data point z𝑧zitalic_z as the weighted average of the performance increase when z𝑧zitalic_z is added to different subsets of the dataset B𝐵Bitalic_B:

Φs(z;U,B)=1mk=1m1(m1k1)SB{z}Δ(z;U,S),subscriptΦ𝑠𝑧𝑈𝐵1𝑚superscriptsubscript𝑘1𝑚1binomial𝑚1𝑘1subscript𝑆𝐵𝑧Δ𝑧𝑈𝑆\Phi_{s}(z;U,B)=\frac{1}{m}\sum_{k=1}^{m}{\frac{1}{\binom{m-1}{k-1}}}\sum_{S% \subseteq B\setminus\{z\}}{\Delta(z;U,S)},roman_Φ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_z ; italic_U , italic_B ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( FRACOP start_ARG italic_m - 1 end_ARG start_ARG italic_k - 1 end_ARG ) end_ARG ∑ start_POSTSUBSCRIPT italic_S ⊆ italic_B ∖ { italic_z } end_POSTSUBSCRIPT roman_Δ ( italic_z ; italic_U , italic_S ) , (2)

where |S|=k1𝑆𝑘1|S|=k-1| italic_S | = italic_k - 1 for kN𝑘𝑁k\in Nitalic_k ∈ italic_N and Δ(z;U,S)=U(S{z})U(S)Δ𝑧𝑈𝑆𝑈𝑆𝑧𝑈𝑆\Delta(z;U,S)=U(S\cup\{z\})-U(S)roman_Δ ( italic_z ; italic_U , italic_S ) = italic_U ( italic_S ∪ { italic_z } ) - italic_U ( italic_S ). Thus, data Shapley value is the weighted average of the marginal contribution Δ(z;U,S)Δ𝑧𝑈𝑆\Delta(z;U,S)roman_Δ ( italic_z ; italic_U , italic_S ). It satisfies the following Shapely value axioms:

  • Dummy Player: If U(S{z})=U(S)+e𝑈𝑆𝑧𝑈𝑆𝑒U(S\cup\{z\})=U(S)+eitalic_U ( italic_S ∪ { italic_z } ) = italic_U ( italic_S ) + italic_e for all SBı𝑆𝐵italic-ıS\subseteq B\iitalic_S ⊆ italic_B italic_ı and some eR𝑒𝑅e\in Ritalic_e ∈ italic_R, then Φ(z;U,B)=eΦ𝑧𝑈𝐵𝑒\Phi(z;U,B)=eroman_Φ ( italic_z ; italic_U , italic_B ) = italic_e.

  • Symmetry: If U(S{z})=U(S{z})𝑈𝑆𝑧𝑈𝑆superscript𝑧U(S\cup\{z\})=U(S\cup\{z^{\prime}\})italic_U ( italic_S ∪ { italic_z } ) = italic_U ( italic_S ∪ { italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } ) for all SB\{z,z}𝑆\𝐵𝑧superscript𝑧S\subseteq B\backslash\{z,z^{\prime}\}italic_S ⊆ italic_B \ { italic_z , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }, then Φ(z;U,B)=Φ(z;U,B)Φ𝑧𝑈𝐵Φsuperscript𝑧𝑈𝐵\Phi(z;U,B)=\Phi(z^{\prime};U,B)roman_Φ ( italic_z ; italic_U , italic_B ) = roman_Φ ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_U , italic_B ).

  • Linearity: Φ(z;α1U1+α2U2,B)=α1Φ(z;U1,B)+α2Φ(z;U2,B)Φ𝑧subscript𝛼1subscript𝑈1subscript𝛼2subscript𝑈2𝐵subscript𝛼1Φ𝑧subscript𝑈1𝐵subscript𝛼2Φ𝑧subscript𝑈2𝐵\Phi(z;\alpha_{1}U_{1}+\alpha_{2}U_{2},B)=\alpha_{1}\Phi(z;U_{1},B)+\alpha_{2}% \Phi(z;U_{2},B)roman_Φ ( italic_z ; italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_B ) = italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Φ ( italic_z ; italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_B ) + italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Φ ( italic_z ; italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_B ) for α1,α2Rsubscript𝛼1subscript𝛼2𝑅\alpha_{1},\alpha_{2}\in Ritalic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_R.

  • Efficiency: zNΦ(z;U,B)=Φ(U,B)subscript𝑧𝑁Φ𝑧𝑈𝐵Φ𝑈𝐵\sum_{z\in N}\Phi(z;U,B)=\Phi(U,B)∑ start_POSTSUBSCRIPT italic_z ∈ italic_N end_POSTSUBSCRIPT roman_Φ ( italic_z ; italic_U , italic_B ) = roman_Φ ( italic_U , italic_B ).

Further details regarding the interpretation of the above axioms in the context of machine learning can be referred to [6] and [5].

Production Function. In economics, a production function expresses the relationship between the specific quantities and combinations of different inputs a company uses and the amount of output it produces. Commonly used production functions include Linear, Leontief, Cobb–Douglas [49, 50], CES, and CRESH [51], each varying in their assumptions for the input and the output. The widespread usage of the Cobb-Douglas production function is attributed to its simplicity and adaptability. It assumes homogeneity of inputs and this principle is consistent with many machine learning setups.

Let P(g)𝑃𝑔P(g)italic_P ( italic_g ) denote the production over a set of goods g=(g1,g2,.gn)g=(g_{1},g_{2},....g_{n})italic_g = ( italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … . italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), the Cobb-Douglas production function is defined as

P(g)=Ai=1ngixi,𝑃𝑔𝐴superscriptsubscriptproduct𝑖1𝑛superscriptsubscript𝑔𝑖subscript𝑥𝑖P(g)=A\prod_{i=1}^{n}g_{i}^{x_{i}},italic_P ( italic_g ) = italic_A ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (3)

where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an elastic parameter for good i𝑖iitalic_i, and A𝐴Aitalic_A is the total factor productivity or the quality factor. If inputs are just labor L𝐿Litalic_L and capital K𝐾Kitalic_K, the production function is then

P=ALxKy.𝑃𝐴superscript𝐿𝑥superscript𝐾𝑦P=AL^{x}K^{y}.italic_P = italic_A italic_L start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT . (4)

It should be noted that the Cobb-Douglas production function also supports the diminishing returns in terms of both labor and capital. The Law of Diminishing Returns [52] states that as the amount of a single factor of production is incrementally increased, the marginal output of a production process decreases. This property is analogous to how more data points have diminishing effects on a machine learning models performance. We therefore adapt the formulation of production functions in our proposed method to efficiently distribute the value of a cluster among its data members.

4 Proposed Method

A two-stage approach is proposed for efficient data valuation. First, data points are clustered together based on shared characteristics. Then, a leave cluster out (LCO) technique is applied to estimate the value of each cluster. This cluster value is then distributed among its members to obtain the preliminary individual data valuations. In the following, we delve into the building blocks of the proposed method and discuss its properties compared to the original Shapley methods.

4.1 Leave-Cluster-Out

Cluster analysis is firstly performed on the given data and the marginal contribution of a cluster c𝑐citalic_c can be expressed as

Vc=U(B)U(Bc).subscript𝑉𝑐𝑈𝐵𝑈𝐵𝑐V_{c}=U(B)-U(B\setminus c).italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_U ( italic_B ) - italic_U ( italic_B ∖ italic_c ) . (5)

The simple LOO error may provide an underestimated view of the true impact of specific data points, especially when similar data points remain in the dataset even after removal. Data Shapley alleviates this issue but suffers from high computational cost. By organizing data points into clusters based on their similarity, we ensure that when an entire cluster is removed, there are no closely-related points to mask the effect of its absence in other clusters. Consequently, this leads to a more precise assessment of the cluster’s marginal contribution, effectively approximating its value. Furthermore, this clustering approach significantly reduces the number of model training iterations needed in comparison to Data Shapley since evaluations are conducted at the cluster level instead of for each individual data point. Once we have obtained cluster-level valuations, the subsequent step involves efficiently approximating the values of individual data points within each cluster.

4.2 Value Propagation within a Cluster

Production Function for ML. We adapt the Cobb-Douglas production function to approximate the data value for ML. In this context, we can draw an analogy: the labor L𝐿Litalic_L corresponds to the available data points for the model; the learning capacity or the number of parameters in the model represents the capital K𝐾Kitalic_K; and the final output is the obtained performance on the test set UTsuperscript𝑈𝑇U^{T}italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. As both data quantity and model complexity exhibit diminishing returns, the Cobb-Douglas production function can be leveraged to effectively model learning performance. Therefore, we propose to approximate the model’s performance after e𝑒eitalic_e epochs as

UT(S,N)=Af(S)hT(N),superscript𝑈𝑇𝑆𝑁𝐴𝑓𝑆superscript𝑇𝑁U^{T}(S,N)=Af(S)h^{T}(N),italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_S , italic_N ) = italic_A italic_f ( italic_S ) italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_N ) , (6)

where f(S)𝑓𝑆f(S)italic_f ( italic_S ) quantifies the informational utility of the dataset S𝑆Sitalic_S to the predictive efficacy of the model U𝑈Uitalic_U, T𝑇Titalic_T denotes the task, and hT(N)superscript𝑇𝑁h^{T}(N)italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_N ) represents the effect of the model capacity which is dependent on N𝑁Nitalic_N, the number of parameters of the model.

Then, for a new point z𝑧zitalic_z, the performance change ΔUΔ𝑈\Delta Uroman_Δ italic_U in the model incurred by the small increase (ΔS={z}Δ𝑆𝑧\Delta S=\{z\}roman_Δ italic_S = { italic_z }) in S𝑆Sitalic_S can be computed by

ΔUT(S,N)Δsuperscript𝑈𝑇𝑆𝑁\displaystyle\Delta U^{T}(S,N)roman_Δ italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_S , italic_N )
=\displaystyle== Af(S+ΔS)hT(N)Af(S)hT(N)𝐴𝑓𝑆Δ𝑆superscript𝑇𝑁𝐴𝑓𝑆superscript𝑇𝑁\displaystyle Af(S+\Delta S)h^{T}(N)-Af(S)h^{T}(N)italic_A italic_f ( italic_S + roman_Δ italic_S ) italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_N ) - italic_A italic_f ( italic_S ) italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_N )
=\displaystyle== A[f(S+ΔS)f(S)]hT(N)𝐴delimited-[]𝑓𝑆Δ𝑆𝑓𝑆superscript𝑇𝑁\displaystyle A\left[f(S+\Delta S)-f(S)\right]h^{T}(N)italic_A [ italic_f ( italic_S + roman_Δ italic_S ) - italic_f ( italic_S ) ] italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_N )
=\displaystyle== A[f(S+ΔS)f(S)o(z)]hT(N)o(z).𝐴delimited-[]𝑓𝑆Δ𝑆𝑓𝑆𝑜𝑧superscript𝑇𝑁𝑜𝑧\displaystyle A\left[\frac{f(S+\Delta S)-f(S)}{o(z)}\right]h^{T}(N)o(z).italic_A [ divide start_ARG italic_f ( italic_S + roman_Δ italic_S ) - italic_f ( italic_S ) end_ARG start_ARG italic_o ( italic_z ) end_ARG ] italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_N ) italic_o ( italic_z ) . (7)

To better understand Eq. (7), let us consider f𝑓fitalic_f as a smooth function of x𝑥xitalic_x as specified in Eq. (6), i.e., UT(x,N)=Af(x)hT(N)superscript𝑈𝑇𝑥𝑁𝐴𝑓𝑥superscript𝑇𝑁U^{T}(x,N)=Af(x)h^{T}(N)italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_x , italic_N ) = italic_A italic_f ( italic_x ) italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_N ). Thus, a minor change in x𝑥xitalic_x leads to a change in UTsuperscript𝑈𝑇U^{T}italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, which can be approximated by Af(x)hT(N)Δx𝐴superscript𝑓𝑥superscript𝑇𝑁Δ𝑥Af^{\prime}(x)h^{T}(N)\Delta xitalic_A italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_N ) roman_Δ italic_x. This allows us to interpret the expression enclosed in square brackets of Eq. (7) as effectively serving as the derivative of f𝑓fitalic_f with respect to the set S𝑆Sitalic_S, especially when considering incremental changes to S𝑆Sitalic_S.

Also, in Eq. (7), o(z)𝑜𝑧o(z)italic_o ( italic_z ) serves as an indicator of how a single data point enhances the model’s overall performance and is a proxy to ΔxΔ𝑥\Delta xroman_Δ italic_x discussed above. Therefore, the difference f(S+ΔS)f(S)𝑓𝑆Δ𝑆𝑓𝑆f(S+\Delta S)-f(S)italic_f ( italic_S + roman_Δ italic_S ) - italic_f ( italic_S ) captures the marginal impact on the model’s performance when dataset S𝑆Sitalic_S is augmented by a new data point. Analogous to the concept of derivatives in calculus, this difference, when normalized by the contribution o(z)𝑜𝑧o(z)italic_o ( italic_z ) of the individual point, can be interpreted as the “rate-of-change”of f𝑓fitalic_f upon the addition of a new data point. This rate is contingent on both the existing dataset S𝑆Sitalic_S and the new data point being added. That is

U(S{z})U(S)=αT(z)β(z,S),𝑈𝑆𝑧𝑈𝑆superscript𝛼𝑇𝑧𝛽𝑧𝑆U(S\cup\{z\})-U(S)=\alpha^{T}(z)\beta(z,S),italic_U ( italic_S ∪ { italic_z } ) - italic_U ( italic_S ) = italic_α start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_z ) italic_β ( italic_z , italic_S ) , (8)

where

αT(z)=superscript𝛼𝑇𝑧absent\displaystyle\alpha^{T}(z)=italic_α start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_z ) = AhT(N)o(z),𝐴superscript𝑇𝑁𝑜𝑧\displaystyle Ah^{T}(N)o(z),italic_A italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_N ) italic_o ( italic_z ) ,
β(z,S)=𝛽𝑧𝑆absent\displaystyle\beta(z,S)=italic_β ( italic_z , italic_S ) = f(S+ΔS)f(S)o(z).𝑓𝑆Δ𝑆𝑓𝑆𝑜𝑧\displaystyle\frac{f(S+\Delta S)-f(S)}{o(z)}.divide start_ARG italic_f ( italic_S + roman_Δ italic_S ) - italic_f ( italic_S ) end_ARG start_ARG italic_o ( italic_z ) end_ARG .

Substituting the above into Eq. (2) then gives

Φs(z;UT,B)=subscriptΦ𝑠𝑧superscript𝑈𝑇𝐵absent\displaystyle\Phi_{s}(z;U^{T},B)=roman_Φ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_z ; italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_B ) = 1mk=1m1(m1k1)SB{z}|S|=k1αT(z)β(z,S)1𝑚superscriptsubscript𝑘1𝑚1binomial𝑚1𝑘1subscript𝑆𝐵𝑧𝑆𝑘1superscript𝛼𝑇𝑧𝛽𝑧𝑆\displaystyle\frac{1}{m}{\sum}_{k=1}^{m}{\frac{1}{\binom{m-1}{k-1}}}\sum_{% \begin{subarray}{c}S\subset B\setminus\{z\}\\ |S|=k-1\end{subarray}}{\alpha^{T}(z)\beta(z,S)}divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( FRACOP start_ARG italic_m - 1 end_ARG start_ARG italic_k - 1 end_ARG ) end_ARG ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_S ⊂ italic_B ∖ { italic_z } end_CELL end_ROW start_ROW start_CELL | italic_S | = italic_k - 1 end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_z ) italic_β ( italic_z , italic_S ) (11)
=\displaystyle== αT(z)β*(z,B),superscript𝛼𝑇𝑧superscript𝛽𝑧𝐵\displaystyle\alpha^{T}(z)\beta^{*}(z,B),italic_α start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_z ) italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_z , italic_B ) , (12)

where

β*(z,B)=superscript𝛽𝑧𝐵absent\displaystyle\beta^{*}(z,B)=italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_z , italic_B ) = 1mk=1m1(m1k1)SB{z}|S|=k1β(z,S).1𝑚superscriptsubscript𝑘1𝑚1binomial𝑚1𝑘1subscript𝑆𝐵𝑧𝑆𝑘1𝛽𝑧𝑆\displaystyle\frac{1}{m}{\sum}_{k=1}^{m}{\frac{1}{\binom{m-1}{k-1}}}\sum_{% \begin{subarray}{c}S\subset B\setminus\{z\}\\ |S|=k-1\end{subarray}}{\beta(z,S)}.divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( FRACOP start_ARG italic_m - 1 end_ARG start_ARG italic_k - 1 end_ARG ) end_ARG ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_S ⊂ italic_B ∖ { italic_z } end_CELL end_ROW start_ROW start_CELL | italic_S | = italic_k - 1 end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_β ( italic_z , italic_S ) . (15)
Proposition 1

(Production Function Based Valuation for ML). Let αT(z)superscriptnormal-αnormal-Tnormal-z\alpha^{T}(z)italic_α start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_z ) denote the intrinsic value of a datum znormal-zzitalic_z, i.e., αT(z)superscriptnormal-αnormal-Tnormal-z\alpha^{T}(z)italic_α start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_z ) is only dependent on the characteristics of znormal-zzitalic_z. The interaction of znormal-zzitalic_z with rest of the data points in Bnormal-BBitalic_B is captured by β*(z,B)superscriptnormal-βnormal-znormal-B\beta^{*}(z,B)italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_z , italic_B ). From equitable properties of data valuation in [6], we postulate that for every datum znormal-zzitalic_z having an intrinsic value αT(z)superscriptnormal-αnormal-Tnormal-z\alpha^{T}(z)italic_α start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_z ), the β*(z,B)superscriptnormal-βnormal-znormal-B\beta^{*}(z,B)italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_z , italic_B ) acts as a multiplier or extrinsic factor that decreases the value of znormal-zzitalic_z if similar data points are present in the dataset. Similarly, it increases the data value if znormal-zzitalic_z is a unique datum. Then the data valuation can be performed as below

Φ(z;UT,B)=αT(z)β*(z,B).Φ𝑧superscript𝑈𝑇𝐵superscript𝛼𝑇𝑧superscript𝛽𝑧𝐵\Phi(z;U^{T},B)=\alpha^{T}(z)\beta^{*}(z,B).roman_Φ ( italic_z ; italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_B ) = italic_α start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_z ) italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_z , italic_B ) . (16)

To simplify notation, we denote αT(z)superscript𝛼𝑇𝑧\alpha^{T}(z)italic_α start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_z ) with α(z)𝛼𝑧\alpha(z)italic_α ( italic_z ), and Φ(z;UT,B)normal-Φ𝑧superscript𝑈𝑇𝐵\Phi(z;U^{T},B)roman_Φ ( italic_z ; italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_B ) with Φ(z;U,B)normal-Φ𝑧𝑈𝐵\Phi(z;U,B)roman_Φ ( italic_z ; italic_U , italic_B ) for the rest of the discussion, since T𝑇Titalic_T is invariant.

Fast Data Valuation. Based on the above setup, we propose an efficient data valuation method that also works as an efficient proxy to Distributional Shapely [7] to predict valuation for unseen data-points in the distribution. The existing Data Shapley adheres to two fundamental axioms [12]: symmetry and efficiency. Symmetry states that for points z𝑧zitalic_z and zsuperscript𝑧z^{\prime}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that contribute similarly to the model’s performance should have the same value, i.e. U(S{z})=U(S{z})𝑈𝑆𝑧𝑈𝑆superscript𝑧U(S\cup\{z\})=U(S\cup\{z^{\prime}\})italic_U ( italic_S ∪ { italic_z } ) = italic_U ( italic_S ∪ { italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } ) for all SB{z,z}𝑆𝐵𝑧superscript𝑧S\in B\setminus\{z,z^{\prime}\}italic_S ∈ italic_B ∖ { italic_z , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }. Efficiency, on the other hand, ensures that the aggregate value of all data points aligns with the overall performance achieved after training on the entire dataset.

Proposition 2

(Fast Data Valuation of Cluster Data Members) The symmetry and efficiency properties when applied to a specific cluster implies the data points within a cluster, characterized by similar features, will likely possess similar values and a cluster’s value can be accurately represented as the sum of its constituent data points’ valuations.

Let Vcsubscript𝑉𝑐V_{c}italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (=Φcabsentsubscriptnormal-Φ𝑐=\Phi_{c}= roman_Φ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT) be the value of cluster c𝑐citalic_c, the initial value assigned to any data point zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT within this cluster is:

Vi=Vc/nc,subscript𝑉𝑖subscript𝑉𝑐subscript𝑛𝑐V_{i}=V_{c}/n_{c},italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , (17)

where ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the number of data points in cluster c𝑐citalic_c. Using this cluster-level assignment of initial data value, we estimate the actual data value based on Eq. (16) as

Vi*=αiβi*.superscriptsubscript𝑉𝑖subscript𝛼𝑖superscriptsubscript𝛽𝑖V_{i}^{*}=\alpha_{i}\beta_{i}^{*}.italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT . (18)

Estimating α𝛼\alphaitalic_α and β*superscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. Assuming each cluster contains an equal number of data points, the distribution of similar and dissimilar samples encountered by each datum becomes roughly uniform. This results in a near-constant extrinsic factor, β*(z,B)superscript𝛽𝑧𝐵\beta^{*}(z,B)italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_z , italic_B ), across all data points. Thus, the value of these data points are directly proportional to α(zi)𝛼subscript𝑧𝑖\alpha(z_{i})italic_α ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). We use Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote the value of individual datum to differentiate it from Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT value that is initialized by the cluster value in Eq. (17).

Theorem 4.1

For data point zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, assuming there is no error in βi*superscriptsubscript𝛽𝑖\beta_{i}^{*}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, its adjusted value ViΔαisuperscriptsubscript𝑉𝑖normal-Δsubscript𝛼𝑖V_{i}^{\Delta\alpha_{i}}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Δ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is

ViΔαi=(αi+Δαi)βi*=Γαiαiβi*,superscriptsubscript𝑉𝑖Δsubscript𝛼𝑖subscript𝛼𝑖Δsubscript𝛼𝑖superscriptsubscript𝛽𝑖subscriptΓsubscript𝛼𝑖subscript𝛼𝑖superscriptsubscript𝛽𝑖V_{i}^{\Delta\alpha_{i}}=(\alpha_{i}+\Delta\alpha_{i})\beta_{i}^{*}=\Gamma_{% \alpha_{i}}\alpha_{i}\beta_{i}^{*},italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Δ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = ( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Δ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_Γ start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , (19)

where Γαisubscriptnormal-Γsubscript𝛼𝑖\Gamma_{\alpha_{i}}roman_Γ start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is an adjustment factor for αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Γαi=1+QizjcQjVc.subscriptΓsubscript𝛼𝑖1subscript𝑄𝑖subscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗subscript𝑉𝑐\Gamma_{\alpha_{i}}=1+\frac{Q_{i}}{\sum_{z_{j}\in c}Q_{j}}V_{c}.roman_Γ start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 + divide start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT . (20)
Corollary 4.1.1

When all data points in a cluster are exactly the same, the adjustment factor should be equal 1111 so that for each point in c𝑐citalic_c, the value becomes Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. But the above formulation of Γαisubscriptnormal-Γsubscript𝛼𝑖\Gamma_{\alpha_{i}}roman_Γ start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT yields 1+1/n11𝑛1+1/n1 + 1 / italic_n when all the points are identical as Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Vjsubscript𝑉𝑗V_{j}italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT will be equal for any i𝑖iitalic_i, j𝑗jitalic_j. Thus, we normalize Γαisubscriptnormal-Γsubscript𝛼𝑖\Gamma_{\alpha_{i}}roman_Γ start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT as follows

Γαi=11+Vc/nc(1+QizjcQjVc).subscriptΓsubscript𝛼𝑖11subscript𝑉𝑐subscript𝑛𝑐1subscript𝑄𝑖subscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗subscript𝑉𝑐\Gamma_{\alpha_{i}}=\frac{1}{1+V_{c}/n_{c}}\left(1+\frac{Q_{i}}{\sum_{z_{j}\in c% }Q_{j}}V_{c}\right).roman_Γ start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ( 1 + divide start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) . (21)

Similar to αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we find the adjustment factor for βi*superscriptsubscript𝛽𝑖\beta_{i}^{*}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, i.e. Γβi*subscriptΓsuperscriptsubscript𝛽𝑖\Gamma_{\beta_{i}^{*}}roman_Γ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. βi*superscriptsubscript𝛽𝑖\beta_{i}^{*}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT measures the interaction of zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with all other data points in B𝐵Bitalic_B. As all data points similar to zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belong to the same cluster and βi*superscriptsubscript𝛽𝑖\beta_{i}^{*}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is only affected by the other members in zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s cluster. We use the distance between zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and cluster centroid as a measure to it’s belongingness to the cluster or similarity to other points in the cluster.

Theorem 4.2

For data point zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, assuming no error in αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its adjusted value ViΔβi*superscriptsubscript𝑉𝑖normal-Δsuperscriptsubscript𝛽𝑖V_{i}^{\Delta\beta_{i}^{*}}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Δ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is

ViΔβi*=αi(βi*+Δβi*)=Γβi*αiβ1i,superscriptsubscript𝑉𝑖Δsuperscriptsubscript𝛽𝑖subscript𝛼𝑖superscriptsubscript𝛽𝑖Δsuperscriptsubscript𝛽𝑖subscriptΓsuperscriptsubscript𝛽𝑖subscript𝛼𝑖subscript𝛽1𝑖\displaystyle V_{i}^{\Delta\beta_{i}^{*}}=\alpha_{i}(\beta_{i}^{*}+\Delta\beta% _{i}^{*})=\Gamma_{\beta_{i}^{*}}\alpha_{i}\beta_{1i},italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Δ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT + roman_Δ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = roman_Γ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , (22)

where disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the distance of zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Γβi*subscriptnormal-Γsuperscriptsubscript𝛽𝑖\Gamma_{\beta_{i}^{*}}roman_Γ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the adjustment factor represented as

Γβi*=11+Vc/nc(1+dizjcdjVc).subscriptΓsuperscriptsubscript𝛽𝑖11subscript𝑉𝑐subscript𝑛𝑐1subscript𝑑𝑖subscriptsubscript𝑧𝑗𝑐subscript𝑑𝑗subscript𝑉𝑐\Gamma_{\beta_{i}^{*}}=\frac{1}{1+V_{c}/n_{c}}\left(1+\frac{d_{i}}{\sum_{z_{j}% \in c}d_{j}}V_{c}\right).roman_Γ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ( 1 + divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) . (23)

Production Function based Data Value Estimation. The final approximation value Φ^isubscript^Φ𝑖\hat{\Phi}_{i}over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the data point is

Φ^i=subscript^Φ𝑖absent\displaystyle\hat{\Phi}_{i}=over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = (αi+Δαi)(βi*+Δβi*).subscript𝛼𝑖Δsubscript𝛼𝑖superscriptsubscript𝛽𝑖Δsuperscriptsubscript𝛽𝑖\displaystyle(\alpha_{i}+\Delta\alpha_{i})(\beta_{i}^{*}+\Delta\beta_{i}^{*}).( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Δ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT + roman_Δ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) . (24)

Ignoring ΔαiΔβi*Δsubscript𝛼𝑖Δsuperscriptsubscript𝛽𝑖\Delta\alpha_{i}\Delta\beta_{i}^{*}roman_Δ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT then gives

Φ^isubscript^Φ𝑖absent\displaystyle\hat{\Phi}_{i}\approxover^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≈ (αi+Δαi)βi*+α(βi*+Δβi*)αiβi*.subscript𝛼𝑖Δsubscript𝛼𝑖superscriptsubscript𝛽𝑖𝛼superscriptsubscript𝛽𝑖Δsuperscriptsubscript𝛽𝑖subscript𝛼𝑖superscriptsubscript𝛽𝑖\displaystyle(\alpha_{i}+\Delta\alpha_{i})\beta_{i}^{*}+\alpha(\beta_{i}^{*}+% \Delta\beta_{i}^{*})-\alpha_{i}\beta_{i}^{*}.( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Δ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT + italic_α ( italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT + roman_Δ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT . (25)

By substituting Eq. (18), Eq. (19), Eq. (22), we obtain.

Φ^i=subscript^Φ𝑖absent\displaystyle\hat{\Phi}_{i}=over^ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ViΔαi+ViΔβi*Visuperscriptsubscript𝑉𝑖Δsubscript𝛼𝑖superscriptsubscript𝑉𝑖Δsuperscriptsubscript𝛽𝑖subscript𝑉𝑖\displaystyle V_{i}^{\Delta\alpha_{i}}+V_{i}^{\Delta\beta_{i}^{*}}-V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Δ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Δ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=\displaystyle== Vi(Γαi+Γβi*1)subscript𝑉𝑖subscriptΓsubscript𝛼𝑖subscriptΓsuperscriptsubscript𝛽𝑖1\displaystyle V_{i}(\Gamma_{\alpha_{i}}+\Gamma_{\beta_{i}^{*}}-1)italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Γ start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_Γ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - 1 )
=\displaystyle== Vi[(11+Vc/nc)(1+QizjcQjVc)+(11+Vc/nc)(1+dizjcdjVc)1]subscript𝑉𝑖delimited-[]11subscript𝑉𝑐subscript𝑛𝑐1subscript𝑄𝑖subscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗subscript𝑉𝑐11subscript𝑉𝑐subscript𝑛𝑐1subscript𝑑𝑖subscriptsubscript𝑧𝑗𝑐subscript𝑑𝑗subscript𝑉𝑐1\displaystyle V_{i}\left[\left(\frac{1}{1+V_{c}/n_{c}}\right)\left(1+\frac{Q_{% i}}{\sum_{z_{j}\in c}Q_{j}}V_{c}\right)+\left.\left(\frac{1}{1+V_{c}/n_{c}}% \right)\left(1+\frac{d_{i}}{\sum_{z_{j}\in c}d_{j}}V_{c}\right)-1\right]\right.italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ ( divide start_ARG 1 end_ARG start_ARG 1 + italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ) ( 1 + divide start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) + ( divide start_ARG 1 end_ARG start_ARG 1 + italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ) ( 1 + divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) - 1 ] (26)

For the reader’s convenience, Algorithm 1 outlines the implementation steps of the EcoVal efficient data valuation framework.

4.3 Discussion: Comparison with Original Shapely

Let E(z)𝐸𝑧E(z)italic_E ( italic_z ) denote the appropriate embedding from a machine learning model or the pre-final layer of a deep learning model for a data point z𝑧zitalic_z. We extend the notion of Lipschitz Stability of data Shapely introduced in [7] to estimate the difference in value of different data points. We use proximity of the embeddings E(z)𝐸𝑧E(z)italic_E ( italic_z ) as a proxy to the closeness in the underlying data distribution and formalize the same in the following Theorem.

Theorem 4.3

For any zjsubscript𝑧𝑗z_{j}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, zksubscript𝑧𝑘z_{k}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT if E(zj)E(zk)<ϵnorm𝐸subscript𝑧𝑗𝐸subscript𝑧𝑘italic-ϵ||E(z_{j})-E(z_{k})||<\epsilon| | italic_E ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_E ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | | < italic_ϵ then, |Φ(zj)Φ(zk)|ϵ1normal-Φsubscript𝑧𝑗normal-Φsubscript𝑧𝑘subscriptitalic-ϵ1|\Phi(z_{j})-\Phi(z_{k})|\leq\epsilon_{1}| roman_Φ ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - roman_Φ ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | ≤ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for very small ϵ,ϵ10italic-ϵsubscriptitalic-ϵ10\epsilon,\epsilon_{1}\geq 0italic_ϵ , italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ 0

From the principle of clustering, a datum zjsubscript𝑧𝑗z_{j}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT belongs to cluster c𝑐citalic_c if

E(zj)E(zk)ϵ,zkc,formulae-sequencenorm𝐸subscript𝑧𝑗𝐸subscript𝑧𝑘italic-ϵfor-allsubscript𝑧𝑘𝑐\displaystyle||E(z_{j})-E(z_{k})||\leq\epsilon,\forall z_{k}\in c,| | italic_E ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_E ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | | ≤ italic_ϵ , ∀ italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_c , (27)
then for this cluster
|Φ(zj)Φ(zk)|ϵ1,zk,zjc.formulae-sequenceΦsubscript𝑧𝑗Φsubscript𝑧𝑘subscriptitalic-ϵ1for-allsubscript𝑧𝑘subscript𝑧𝑗𝑐\displaystyle|\Phi(z_{j})-\Phi(z_{k})|\leq\epsilon_{1},\forall z_{k},z_{j}\in c.| roman_Φ ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - roman_Φ ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | ≤ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ∀ italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c . (28)
Algorithm 1 EcoVal Data Valuation
1:M(.;ψ)M(.;\psi)italic_M ( . ; italic_ψ ): Fully Trained Model
2:B𝐵Bitalic_B: Training Dataset
3:BDsubscript𝐵𝐷B_{D}italic_B start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT: Set of available points from the underlying distribution of B𝐵Bitalic_B
4:Mn(x;ψ)subscript𝑀𝑛𝑥𝜓absentM_{-n}(x;\psi)\leftarrowitalic_M start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT ( italic_x ; italic_ψ ) ← Embedding of data x𝑥xitalic_x obtained from the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT last layer of the model
5:Let E(x)=Mn(x;ψ)𝐸𝑥subscript𝑀𝑛𝑥𝜓E(x)=M_{-n}(x;\psi)italic_E ( italic_x ) = italic_M start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT ( italic_x ; italic_ψ )
6:Let Acsubscript𝐴𝑐A_{c}italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT be a clustering algorithm then (xi,cj)Ac(BD)xiBDsubscript𝑥𝑖subscript𝑐𝑗subscript𝐴𝑐subscript𝐵𝐷for-allsubscript𝑥𝑖subscript𝐵𝐷(x_{i},c_{j})\leftarrow A_{c}(B_{D})\forall x_{i}\in B_{D}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ← italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) ∀ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_B start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT where cjCsubscript𝑐𝑗𝐶c_{j}\in Citalic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_C is the cluster associated with xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and C𝐶Citalic_C is the set of all clusters
7:Find valuation at cluster level
8:Vcj=U(B)U(Bcj)subscript𝑉subscript𝑐𝑗𝑈𝐵𝑈𝐵subscript𝑐𝑗V_{c_{j}}=U(B)-U(B\setminus c_{j})italic_V start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_U ( italic_B ) - italic_U ( italic_B ∖ italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) cjCfor-allsubscript𝑐𝑗𝐶\forall c_{j}\in C∀ italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_C
9:Initialize value Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each cluster member xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
10:Vi=Vcj/ncjsubscript𝑉𝑖subscript𝑉subscript𝑐𝑗subscript𝑛subscript𝑐𝑗V_{i}=V_{c_{j}}/n_{c_{j}}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where ncjsubscript𝑛subscript𝑐𝑗n_{c_{j}}italic_n start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the number of elements in cluster cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to which xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belongs
11:Initialize: D𝐷absentD\leftarrowitalic_D ← []
12:for cjCsubscript𝑐𝑗𝐶c_{j}\in Citalic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_C do
13:     Sample Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = {x1jsuperscriptsubscript𝑥1𝑗x_{1}^{j}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, x2jsuperscriptsubscript𝑥2𝑗x_{2}^{j}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, … xncjsuperscriptsubscript𝑥subscript𝑛𝑐𝑗x_{n_{c}}^{j}italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT} from cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
14:     DDXj𝐷𝐷subscript𝑋𝑗D\leftarrow D\cup X_{j}italic_D ← italic_D ∪ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
15:end for
16:Run TMC Shapely [6]
17:(xk,vTMCk)TMC(UT,D)xkDsubscript𝑥𝑘subscript𝑣𝑇𝑀subscript𝐶𝑘𝑇𝑀𝐶superscript𝑈𝑇𝐷for-allsubscript𝑥𝑘𝐷(x_{k},v_{TMC_{k}})\leftarrow TMC(U^{T},D)\forall x_{k}\in D( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_T italic_M italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ← italic_T italic_M italic_C ( italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_D ) ∀ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_D
18:Train a regression model R𝑅Ritalic_R on the sampled data {(x1,vTMC1),(x2,vTMC2).(x|D|,vTMC|D|)formulae-sequencesubscript𝑥1subscript𝑣𝑇𝑀subscript𝐶1subscript𝑥2subscript𝑣𝑇𝑀subscript𝐶2subscript𝑥𝐷subscript𝑣𝑇𝑀subscript𝐶𝐷(x_{1},v_{TMC_{1}}),(x_{2},v_{TMC_{2}})....(x_{\lvert D\rvert},v_{TMC_{\lvert D% \rvert}})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_T italic_M italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_T italic_M italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) … . ( italic_x start_POSTSUBSCRIPT | italic_D | end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_T italic_M italic_C start_POSTSUBSCRIPT | italic_D | end_POSTSUBSCRIPT end_POSTSUBSCRIPT )}
19:for cjCsubscript𝑐𝑗𝐶c_{j}\in Citalic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_C do
20:     (xij,qij)R(xij)xijcjsuperscriptsubscript𝑥𝑖𝑗superscriptsubscript𝑞𝑖𝑗𝑅superscriptsubscript𝑥𝑖𝑗for-allsuperscriptsubscript𝑥𝑖𝑗subscript𝑐𝑗(x_{i}^{j},q_{i}^{j})\leftarrow R(x_{i}^{j})\forall x_{i}^{j}\in c_{j}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ← italic_R ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∀ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
21:     Let x¯cjsubscript¯𝑥subscript𝑐𝑗\bar{x}_{c_{j}}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the centroid of the cluster cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
22:(xij,dij)distance(xij,x¯cj)xicjsuperscriptsubscript𝑥𝑖𝑗superscriptsubscript𝑑𝑖𝑗𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒superscriptsubscript𝑥𝑖𝑗subscript¯𝑥subscript𝑐𝑗for-allsubscript𝑥𝑖subscript𝑐𝑗(x_{i}^{j},d_{i}^{j})\leftarrow distance(x_{i}^{j},\bar{x}_{c_{j}})\forall x_{% i}\in c_{j}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ← italic_d italic_i italic_s italic_t italic_a italic_n italic_c italic_e ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∀ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
23:end for
24:for xiBsubscript𝑥𝑖𝐵x_{i}\in Bitalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_B do
25:     Find correction term for α𝛼\alphaitalic_α
26:Γαi=11+Vcj/ncj(1+qijzkcjqkjVcj)subscriptΓsubscript𝛼𝑖11subscript𝑉subscript𝑐𝑗subscript𝑛subscript𝑐𝑗1superscriptsubscript𝑞𝑖𝑗subscriptsubscript𝑧𝑘subscript𝑐𝑗superscriptsubscript𝑞𝑘𝑗subscript𝑉subscript𝑐𝑗\Gamma_{\alpha_{i}}=\frac{1}{1+V_{c_{j}}/n_{c_{j}}}(1+\frac{q_{i}^{j}}{\sum_{z% _{k}\in c_{j}}q_{k}^{j}}V_{c_{j}})roman_Γ start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_V start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ( 1 + divide start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG italic_V start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
27:     Find correction term for βi*superscriptsubscript𝛽𝑖\beta_{i}^{*}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT
28:Γβi*=11+Vcj/ncj(1+dijzkcjdkjVcj)subscriptΓsuperscriptsubscript𝛽𝑖11subscript𝑉subscript𝑐𝑗subscript𝑛subscript𝑐𝑗1superscriptsubscript𝑑𝑖𝑗subscriptsubscript𝑧𝑘subscript𝑐𝑗superscriptsubscript𝑑𝑘𝑗subscript𝑉subscript𝑐𝑗\Gamma_{\beta_{i}^{*}}=\frac{1}{1+V_{c_{j}}/n_{c_{j}}}(1+\frac{d_{i}^{j}}{\sum% _{z_{k}\in c_{j}}d_{k}^{j}}V_{c_{j}})roman_Γ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_V start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ( 1 + divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG italic_V start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
29:     Final valuation = Vi*(Γαi+Γβi*1)subscript𝑉𝑖subscriptΓsubscript𝛼𝑖subscriptΓsuperscriptsubscript𝛽𝑖1V_{i}*(\Gamma_{\alpha_{i}}+\Gamma_{\beta_{i}^{*}}-1)italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT * ( roman_Γ start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_Γ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - 1 )
30:end for

It means all Shaply values lie within an ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT interval. Therefore, Φ(zj)Φsubscript𝑧𝑗\Phi(z_{j})roman_Φ ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) for any zjcsubscript𝑧𝑗𝑐z_{j}\in citalic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c can be expressed as

Φ(zj)=Φ¯c+δ(zj),Φsubscript𝑧𝑗subscript¯Φ𝑐𝛿subscript𝑧𝑗\Phi(z_{j})=\bar{\Phi}_{c}+\delta(z_{j}),roman_Φ ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_δ ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (29)

where δ(zj)ϵ/2𝛿subscript𝑧𝑗italic-ϵ2\delta(z_{j})\leq\epsilon/2italic_δ ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ italic_ϵ / 2 and Φ¯csubscript¯Φ𝑐\bar{\Phi}_{c}over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT lies somewhere in the ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT interval. Value of the this cluster, Vcsubscript𝑉𝑐V_{c}italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT by Shapely axioms is

Vcsubscript𝑉𝑐\displaystyle V_{c}italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT =zjcΦ(zj)=ncΦ¯c+δ(zj).absentsubscriptsubscript𝑧𝑗𝑐Φsubscript𝑧𝑗subscript𝑛𝑐subscript¯Φ𝑐𝛿subscript𝑧𝑗\displaystyle=\sum_{z_{j}\in c}\Phi(z_{j})=n_{c}\bar{\Phi}_{c}+\sum\delta(z_{j% }).= ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT roman_Φ ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + ∑ italic_δ ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (30)
Theorem 4.4

The difference between the original Shapley value and our proposed approximated data value is

ΔΦiΔsubscriptΦ𝑖\displaystyle\Delta\Phi_{i}roman_Δ roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ncΦ¯cδRzjcQj+nc2Φ¯cQiδR(zjcQj)2.absentsubscript𝑛𝑐subscript¯Φ𝑐subscript𝛿𝑅subscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗superscriptsubscript𝑛𝑐2subscript¯Φ𝑐subscript𝑄𝑖subscript𝛿𝑅superscriptsubscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗2\displaystyle\approx\frac{n_{c}\bar{\Phi}_{c}\delta_{R}}{\sum_{z_{j}\in c}Q_{j% }}+\frac{n_{c}^{2}\bar{\Phi}_{c}Q_{i}\delta_{R}}{(\sum_{z_{j}\in c}Q_{j})^{2}}.≈ divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (31)

Due to the intrinsic limitations on the magnitudes of average Shapley value within a cluster Φ¯csubscriptnormal-¯normal-Φ𝑐\bar{\Phi}_{c}over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and individual point contribution Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, both values inherently remain within a bounded range. As cluster size ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT increases, the predicted aggregate value zjcQjsubscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗\sum_{z_{j}\in c}Q_{j}∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT proportionately grows, naturally restricting the potential expansion of nczjcQjsubscript𝑛𝑐subscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗\frac{n_{c}}{\sum_{z_{j}\in c}Q_{j}}divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG. Additionally, a moderately accurate regression model ensures a low δRsubscript𝛿𝑅\delta_{R}italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT error. Therefore, our method produces Shapley value estimates Φisubscriptnormal-Φ𝑖\Phi_{i}roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with minimal margin of error.

The detailed proof to 4.4 is provided in the Appendix.

5 Experiments

We show the broad effectiveness of the proposed valuation framework and its general applicability to machine learning models through empirical evidence. We estimate the value of data in a machine learning model in MNIST, CIFAR10, and CIFAR100 datasets. We compare our method with Data Shapley [6] and Distributional Shaply [7].

Experiment Settings. Following the common practice in previous works, we extract the features from last layer of a pre-trained network and apply Shapley on this embedded vector. We sample a small subset, i.e. 200 samples from the original training data and run the baseline methods TMC-Shapley (Data Shapley) and distributional Shapley. 2000 samples are used for testing and holdout for Shapley calculation. We keep 10,000 samples which are never seen by model or valuation method at any point, we call that out-of-sample (OOS) set. The rest of the samples are used as data distribution and exposed to Distributional Shapely, and our method during the clustering step and α𝛼\alphaitalic_α correction step. We use Gaussian Mixture Models (GMM) for clustering. Our proposed method works for both in-distribution and OOS samples. As Data Shapley only works for in-distribution samples, we compare our results with Distribution Shapley for out-of-sample data.

Refer to caption
Figure 1: Computation cost in terms of number of training iterations required for the given dataset size. We compare EcoVal with TMC Shapely (also known as Data Shapley), distributional Shapely, and a lighweight version of EcoVal. Our method requires substantially lower number of training iterations for data valuation.

5.1 Comparative Analysis of the Computational Time

The Data Shapley approximation method TMC Shapely [6] converges in approximately 3|B|3𝐵3\lvert B\rvert3 | italic_B | (or 3×m3𝑚3\times m3 × italic_m in Eq. 2) Monte Carlo samples. Each Monte Carlo sample is a random permutation of the data points in the training set. The marginal contribution of a data point z𝑧zitalic_z in a given permutation is obtained as the performance difference between the model trained on data points before this datum, say S𝑆Sitalic_S, and the model trained on S{z}𝑆𝑧S\cup\{z\}italic_S ∪ { italic_z }. Each point is added sequentially meaning |B|𝐵\lvert B\rvert| italic_B | training runs are required in a single Monte Carlo sample. This makes the number of training runs in the order of O(|B|2)𝑂superscript𝐵2O(\lvert B\rvert^{2})italic_O ( | italic_B | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Distributional Shapely’s [7] time complexity is similar with T𝑇Titalic_T runs to get an unbiased estimate using different subsets S𝑆Sitalic_S from the underlying data distribution. This makes the number of training runs of Distributional Shapely O(T*|B|2)𝑂𝑇superscript𝐵2O(T*\lvert B\rvert^{2})italic_O ( italic_T * | italic_B | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Our method performs clustering that takes less time than training a machine learning or deep learning model in most real-world scenarios. This is a one time effort, so the complexity is in the order of O(1)𝑂1O(1)italic_O ( 1 ). Estimating the value of each cluster requires O(p)𝑂𝑝O(p)italic_O ( italic_p ) training runs. Apart from that, our method involves running Data Shapely on a curated subset p𝑝pitalic_p containing an equal number of points from each cluster, this take O(p2)𝑂superscript𝑝2O(p^{2})italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) time. The size of this subset p𝑝pitalic_p is much smaller than |B|𝐵\lvert B\rvert| italic_B |. The total number of training runs required is in the order of O(1)+O(p2)+O(p)𝑂1𝑂superscript𝑝2𝑂𝑝O(1)+O(p^{2})+O(p)italic_O ( 1 ) + italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_O ( italic_p ). When compared to the existing Data Shapley based methods, our EcoVal is significantly faster as shown in Figure 1. With the increase in the dataset size, the utility of our EcoVal becomes more evident. Our method without α𝛼\alphaitalic_α correction is even faster with negligible loss in valuation quality.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Accuracy difference with respect to the % of data points added or removed. We add or remove the highest valued data points first and then subsequently add or remove the lesser value data, respectively. The top, middle and bottom rows show the results for MNIST, CIFAR10, CIFAR100, respectively with in-distribution valuation. The EcoVal gives comparable or better performance when compared to Data Shapley and Distribution Data Shapley.
Refer to caption
Refer to caption
Refer to caption
Figure 3: Data valuation on out-of-sample data (left to right: MNIST, CIFAR10, CIFAR100, respectively). Our EcoVal method outperforms Distributed Data Shapley by getting steeper performance drop with increasing % valuable data removal.

5.2 Data Point Addition and Removal Experiments

We evaluate the data valuation methods by running the data point addition and removal experiments as proposed in [6]. For a given model and dataset, the data points are added in the order of predicted value, i.e. from largest to lowest values, and the model is retrained for each addition. Similarly, another experiment is conducted where we remove samples with high values and observe the performance drop. The impact of removal and addition of high value data-points help us measure the effectiveness of data valuation techniques. We compare our results with state-of-the-art Data Shapley and Distributed Data Shapley valuation methods.

Removing Most Valued Data Points. We predict values of data-points using each valuation method and we measure the drop in performance of model by removing most-valued data-points for each method. A better valuation method’s high value data-points will result in a higher drop in performance. So, for removal of most valued points, the method resulting in higher performance drop is a better valuation method.

Adding Most Valued Data Points. This approach is vice-versa of the previous approach, we add most valued data-points into the training set and observe the increase in the performance. A higher increase on adding the top data-points shows better valuation method. Figure 2 shows the performance drop and increase upon adding and removing most valued points, respectively. It can be observed that EcoVal performance drop is slightly less than that of Data Shapley but significantly higher than the Distributed Data Shapley which is desired. Similar patterns can be observed in the data addition graph also.

Removing Most Valuable Data Points from Out-of-sample Set. The discussed earlier, EcoVal supports data valuation for out-of-sample data as well which is supported only by Distributed Data Shapley. Therefore, we compare the OOS valuation results between them. Figure 3 shows the performance drop by removing the most valuable points from an out-of-sample set of size 10,0001000010,00010 , 000. It can be observed that EcoVal’s performance drop is very high as compared to Distributed Data Shapley. The steep drop in the performance after removing the most valuable data points implies better precision for data valuation in our EcoVal framework.

5.3 Effect of the Adjustment Terms in EcoVal

We observe the effect of removing different adjustment terms Γβi*subscriptΓsuperscriptsubscript𝛽𝑖\Gamma_{\beta_{i}^{*}}roman_Γ start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, ΓαisubscriptΓsubscript𝛼𝑖\Gamma_{\alpha_{i}}roman_Γ start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT or both in the EcoVal framework and show the results in Figure 4. The overall EcoVal framework with both α𝛼\alphaitalic_α and β𝛽\betaitalic_β terms perform the best in general. Eliminating one of the adjustment terms deteriorates the quality of the valuation by a small margin. Removing both corrections significantly impacts the quality of obtained valuations. This is particularly visible in the initial phase of adding the most significant data points. It should be noted that eliminating ΓαisubscriptΓsubscript𝛼𝑖\Gamma_{\alpha_{i}}roman_Γ start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT only affects the valuation quality marginally, but completely removes the need for model training, giving an even more efficient version of our valuation method.

Refer to caption
Refer to caption
Figure 4: Effect of adding different adjustment terms (refer Section 4.2). EcoVal: the full proposed method, EcoVal_no_alpha: removal of α𝛼\alphaitalic_α adjustment term, EcoVal_no_beta: removal of β𝛽\betaitalic_β adjustment term, EcoVal_no_adjustment: EcoVal without any adjustment terms, i.e. the mean of the cluster value used as the data value.

6 Conclusion

This work presents a focused study on improving the speed of data valuation in machine learning models. We develop an efficient data valuation method that is significantly fast and practical for working with large datasets. Our method works for both in-distribution and out-of-sample data. The proposed EcoVal data valuation framework shows comparable and sometimes even better results than the existing approaches for in-distribution data. For out-of-sample data points, our method significantly outperforms competing methods, thereby establishing a new state-of-the-art. This proves our method’s utility in a data market, where new data points analogous to our out-of-sample set are generated every passing instant. Our valuation also shows negligible error margin with vanilla Shapley value approximation. The aforementioned points collectively make the proposed method a robust and scalable approach to estimate data value across variety of machine learning models.

References

  • [1] Tarun Wadhwa. Economic impact and feasibility of data dividends, 2020.
  • [2] Jian Pei. Data pricing – from economics to data science. Association for Computing Machinery, 2020.
  • [3] Siyi Tang, Amirata Ghorbani, Rikiya Yamashita, Sameer Rehman, Jared A Dunnmon, James Zou, and Daniel L Rubin. Data valuation for medical imaging using shapley value and application to a large-scale chest x-ray dataset. Scientific reports, 11(1):8366, 2021.
  • [4] Bojan Karlaš, David Dao, Matteo Interlandi, Bo Li, Sebastian Schelter, Wentao Wu, and Ce Zhang. Data debugging with shapley importance over end-to-end machine learning pipelines. arXiv preprint arXiv:2204.11131, 2022.
  • [5] Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1167–1176. PMLR, 2019.
  • [6] Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning, pages 2242–2251. PMLR, 2019.
  • [7] Amirata Ghorbani, Michael Kim, and James Zou. A distributional framework for data valuation. In International Conference on Machine Learning, pages 3535–3544. PMLR, 2020.
  • [8] Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry. Datamodels: Understanding predictions with data and data with predictions. In International Conference on Machine Learning, pages 9525–9587. PMLR, 2022.
  • [9] LS Shapley. A value for n-person games. In Contributions to the Theory of Games (AM-28), Volume II, pages 307–318. Princeton University Press, 1953.
  • [10] Yongchan Kwon, Manuel A Rivas, and James Zou. Efficient computation and analysis of distributional shapley values. In International Conference on Artificial Intelligence and Statistics, pages 793–801. PMLR, 2021.
  • [11] Yongchan Kwon and James Zou. Beta shapley: a unified and noise-reduced data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, pages 8780–8802. PMLR, 2022.
  • [12] Jiachen T Wang and Ruoxi Jia. Data banzhaf: A robust data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, pages 6388–6421. PMLR, 2023.
  • [13] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700, 2019.
  • [14] Lloyd S Shapley et al. A value for n-person games. 1953.
  • [15] Ehud Kalai and Dov Samet. On weighted shapley values. International journal of game theory, 16:205–222, 1987.
  • [16] Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games. International Journal of game theory, 28:547–565, 1999.
  • [17] Roger B Myerson. Graphs and cooperation in games. Mathematics of operations research, 2(3):225–229, 1977.
  • [18] Robert J Aumann and Lloyd S Shapley. Values of non-atomic games. Princeton University Press, 2015.
  • [19] Faruk Gul. Bargaining foundations of shapley value. Econometrica: Journal of the Econometric Society, pages 81–95, 1989.
  • [20] Hervé Moulin. An application of the shapley value to fair division with money. Econometrica: Journal of the Econometric Society, pages 1331–1349, 1992.
  • [21] Alvin E Roth and Robert E Verrecchia. The shapley value as applied to cost allocation: a reinterpretation. Journal of Accounting Research, pages 295–303, 1979.
  • [22] Eda Kemahlıoğlu-Ziya and John J Bartholdi III. Centralizing inventory in supply chains by using shapley value to allocate the profits. Manufacturing & Service Operations Management, 13(2):146–162, 2011.
  • [23] Pradeep Dubey, Abraham Neyman, and Robert James Weber. Value theory without efficiency. Mathematics of Operations Research, 6(1):122–128, 1981.
  • [24] Raghav Singal, Omar Besbes, Antoine Desir, Vineet Goyal, and Garud Iyengar. Shapley meets uniform: An axiomatic framework for attribution in online advertising. In The World Wide Web Conference, pages 1713–1723, 2019.
  • [25] Shay Cohen, Eytan Ruppin, and Gideon Dror. Feature selection based on the shapley value. In Proceedings of the 19th international joint conference on Artificial intelligence, pages 665–670, 2005.
  • [26] Mohammad Zaeri-Amirani, Fatemeh Afghah, and Sajad Mousavi. A feature selection method based on shapley value to false alarm reduction in icus a genetic-algorithm approach. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 319–323. IEEE, 2018.
  • [27] Anish Agarwal, Munther Dahleh, and Tuhin Sarkar. A marketplace for data: An algorithmic solution. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 701–726, 2019.
  • [28] Raul Castro Fernandez, Pranav Subramaniam, and Michael J Franklin. Data market platforms: Trading data assets to solve data problems. Proceedings of the VLDB Endowment, 13(11).
  • [29] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
  • [30] Jianbo Chen, Le Song, Martin J Wainwright, and Michael I Jordan. L-shapley and c-shapley: Efficient model interpretation for structured data. In International Conference on Learning Representations, 2018.
  • [31] Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In International conference on machine learning, pages 9269–9278. PMLR, 2020.
  • [32] Amirata Ghorbani and James Y Zou. Neuron shapley: Discovering the responsible neurons. Advances in neural information processing systems, 33:5922–5932, 2020.
  • [33] Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li4 Ce Zhang, and Costas Spanos1 Dawn Song. Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment, 12(11).
  • [34] Jiachen T. Wang and Ruoxi Jia. Data banzhaf: A robust data valuation framework for machine learning. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 6388–6421. PMLR, 25–27 Apr 2023.
  • [35] Ki Nohyun, Hoyong Choi, and Hye Won Chung. Data valuation without training of a model. In The Eleventh International Conference on Learning Representations, 2022.
  • [36] Ian Covert, Scott Lundberg, and Su-In Lee. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1–90, 2021.
  • [37] Ian Covert and Su-In Lee. Improving kernelshap: Practical shapley value estimation using linear regression. In International Conference on Artificial Intelligence and Statistics, pages 3457–3465. PMLR, 2021.
  • [38] Rui Wang, Xiaoqian Wang, and David I Inouye. Shapley explanation networks. In International Conference on Learning Representations, 2020.
  • [39] Sudhanshu K Mishra. A brief history of production functions. Available at SSRN 1020577, 2007.
  • [40] Robert J Barro. Macroeconomics. MIT Press, 1997.
  • [41] Ronald William Shephard. Theory of cost and production functions. Princeton University Press, 2015.
  • [42] Kenneth J Arrow, Hollis B Chenery, Bagicha S Minhas, and Robert M Solow. Capital-labor substitution and economic efficiency. The review of Economics and Statistics, pages 225–250, 1961.
  • [43] Guennadi A Khatskevich and Andrei F Pranevich. Production functions with given elasticities of output and production. 2018.
  • [44] Eric A Hanushek. Education production functions. In The economics of education, pages 161–170. Elsevier, 2020.
  • [45] Grant Allan, Michelle Gilmartin, Peter McGregor, Karen Turner, and J Kim Swales. Economics of energy efficiency. In International Handbook on the Economics of Energy. Edward Elgar Publishing, 2009.
  • [46] David E Bloom, David Canning, and Jaypee Sevilla. The effect of health on economic growth: a production function approach. World development, 32(1):1–13, 2004.
  • [47] Charles I Jones and Christopher Tonetti. Nonrivalry and the economics of data. American Economic Review, 110(9):2819–2858, 2020.
  • [48] Maryam Farboodi and Laura Veldkamp. A model of the data economy. Technical report, National Bureau of Economic Research, 2021.
  • [49] CW COBB. A theory of production. American Economic Review, 18:139–165, 1928.
  • [50] Lawrence Blume, Steven Durlauf, and Lawrence E Blume. The new Palgrave dictionary of economics. Palgram Macmillan, 2008.
  • [51] Robin C Sickles and Valentin Zelenyuk. Measurement of productivity and efficiency. Cambridge University Press, 2019.
  • [52] Ronald W Shephard and Rolf Färe. The law of diminishing returns. In Production Theory: Proceedings of an International Seminar Held at the University at Karlsruhe May–July 1973, pages 287–318. Springer, 1974.

Appendix

Appendix A Proof of Theorem 3.4

Let equal number of samples are used from each cluster to run TMC Shapley. Then the intrinsic value of a datum is independent of proportion and bias in the data distribution. If nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT such samples exist, the value is divided into these nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT samples. From Shapely axioms, the data Shapley at the current stage of TMC becomes approximately αinssubscript𝛼𝑖subscript𝑛𝑠\frac{\alpha_{i}}{n_{s}}divide start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG.

For rest of the samples in the TMC, we train a regression model R𝑅Ritalic_R for predicting αinssubscript𝛼𝑖subscript𝑛𝑠\frac{\alpha_{i}}{n_{s}}divide start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG for an input data. If the predicted Shapley for any zicsubscript𝑧𝑖𝑐z_{i}\in citalic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_c is Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, then assuming no error is introduced due to the TMC Shapely algorithm, this gives us

Qi=αins+δRi,subscript𝑄𝑖subscript𝛼𝑖subscript𝑛𝑠subscript𝛿subscript𝑅𝑖Q_{i}=\frac{\alpha_{i}}{n_{s}}+\delta_{R_{i}},italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + italic_δ start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (32)

where δRisubscript𝛿subscript𝑅𝑖\delta_{R_{i}}italic_δ start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is error introduced by the regression model. Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes b*αi𝑏subscript𝛼𝑖b*\alpha_{i}italic_b * italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where b𝑏bitalic_b is some constant. We use this to obtain the adjustment factor for αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in Eq. (19). Assuming Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT do not have any error, we take the differentiation of Eq. (26) with respect to z𝑧zitalic_z

Φiz=1z(Vi1+Vc/nc)[ncQizjcQjncQi(zjcQj)(zjcQj)2].subscriptΦ𝑖𝑧1𝑧subscript𝑉𝑖1subscript𝑉𝑐subscript𝑛𝑐delimited-[]subscript𝑛𝑐subscript𝑄𝑖subscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗subscript𝑛𝑐subscript𝑄𝑖subscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗superscriptsubscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗2\frac{\partial\Phi_{i}}{\partial z}=\frac{1}{\partial z}(\frac{V_{i}}{1+V_{c}/% n_{c}})\left[\frac{n_{c}\partial Q_{i}}{\sum_{z_{j}\in c}Q_{j}}-\frac{n_{c}Q_{% i}\partial(\sum_{z_{j}\in c}Q_{j})}{(\sum_{z_{j}\in c}Q_{j})^{2}}\right].divide start_ARG ∂ roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_z end_ARG = divide start_ARG 1 end_ARG start_ARG ∂ italic_z end_ARG ( divide start_ARG italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 1 + italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ) [ divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∂ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∂ ( ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] . (33)

Intuition. Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT does not have any error as this is the difference between the performance with and without the cluster c𝑐citalic_c divided by a constant. Both the values can be directly computed from the model. Similarly, disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the distance of the datum from the centroid of the cluster c𝑐citalic_c which can be calculated without any error.

Comparing Eq. (33) with change in Shapley value leads to the following inequality

ΔΦiΔsubscriptΦ𝑖\displaystyle\Delta\Phi_{i}roman_Δ roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Vc/nc1+Vc/nc[δRnczjcQj+ncQi(ncδR)(zjcQj)2],absentsubscript𝑉𝑐subscript𝑛𝑐1subscript𝑉𝑐subscript𝑛𝑐delimited-[]subscript𝛿𝑅subscript𝑛𝑐subscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗subscript𝑛𝑐subscript𝑄𝑖subscript𝑛𝑐subscript𝛿𝑅superscriptsubscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗2\displaystyle\leq\frac{V_{c}/n_{c}}{1+V_{c}/n_{c}}\left[\frac{\delta_{R}n_{c}}% {\sum_{z_{j}\in c}Q_{j}}+\frac{n_{c}Q_{i}(n_{c}\delta_{R})}{(\sum_{z_{j}\in c}% Q_{j})^{2}}\right],≤ divide start_ARG italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG 1 + italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG [ divide start_ARG italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] ,
where δR=maxiδRisubscript𝛿𝑅subscript𝑖subscript𝛿subscript𝑅𝑖\delta_{R}=\max\limits_{i}\delta_{R_{i}}italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the maximum error of the regression model
ΔΦiΔsubscriptΦ𝑖\displaystyle\Delta\Phi_{i}roman_Δ roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Vc[δRzjcQj+ncQiδR(zjcQj)2],absentsubscript𝑉𝑐delimited-[]subscript𝛿𝑅subscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗subscript𝑛𝑐subscript𝑄𝑖subscript𝛿𝑅superscriptsubscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗2\displaystyle\leq V_{c}\left[\frac{\delta_{R}}{\sum_{z_{j}\in c}Q_{j}}+\frac{n% _{c}Q_{i}\delta_{R}}{(\sum_{z_{j}\in c}Q_{j})^{2}}\right],≤ italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT [ divide start_ARG italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] ,
as Vc0subscript𝑉𝑐0V_{c}\geq 0italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ≥ 0, nc1subscript𝑛𝑐1n_{c}\geq 1italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ≥ 1 therefore, 11+Vc/nc111subscript𝑉𝑐subscript𝑛𝑐1\frac{1}{1+V_{c}/n_{c}}\leq 1divide start_ARG 1 end_ARG start_ARG 1 + italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ≤ 1. From Eq. 29 and Eq. 30
ΔΦiΔsubscriptΦ𝑖\displaystyle\Delta\Phi_{i}roman_Δ roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (ncΦ¯c+ncϵ/2)[δRzjcQj+ncQiδR(zjcQj)2].absentsubscript𝑛𝑐subscript¯Φ𝑐subscript𝑛𝑐italic-ϵ2delimited-[]subscript𝛿𝑅subscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗subscript𝑛𝑐subscript𝑄𝑖subscript𝛿𝑅superscriptsubscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗2\displaystyle\leq(n_{c}\bar{\Phi}_{c}+n_{c}\epsilon/2)\left[\frac{\delta_{R}}{% \sum_{z_{j}\in c}Q_{j}}+\frac{n_{c}Q_{i}\delta_{R}}{(\sum_{z_{j}\in c}Q_{j})^{% 2}}\right].≤ ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_ϵ / 2 ) [ divide start_ARG italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] .
Ignoring factors with multiples of ϵitalic-ϵ\epsilonitalic_ϵ and δ𝛿\deltaitalic_δ as these values are very small. We get the final difference between the original Shapley value and our proposed approximated Value as below
ΔΦiΔsubscriptΦ𝑖\displaystyle\Delta\Phi_{i}roman_Δ roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ncΦ¯cδRzjcQj+nc2Φ¯cQiδR(zjcQj)2,absentsubscript𝑛𝑐subscript¯Φ𝑐subscript𝛿𝑅subscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗superscriptsubscript𝑛𝑐2subscript¯Φ𝑐subscript𝑄𝑖subscript𝛿𝑅superscriptsubscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗2\displaystyle\approx\frac{n_{c}\bar{\Phi}_{c}\delta_{R}}{\sum_{z_{j}\in c}Q_{j% }}+\frac{n_{c}^{2}\bar{\Phi}_{c}Q_{i}\delta_{R}}{(\sum_{z_{j}\in c}Q_{j})^{2}},≈ divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (34)

Φ¯csubscript¯Φ𝑐\bar{\Phi}_{c}over¯ start_ARG roman_Φ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT cannot be arbitrarily large as they are the average Shapely value for a cluster and change in performance due to a data point zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. With increasing cluster size ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, the corresponding predicted value zjcQjsubscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗\sum_{z_{j}\in c}Q_{j}∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT will increase. Thus, nc/zjcQjsubscript𝑛𝑐subscriptsubscript𝑧𝑗𝑐subscript𝑄𝑗n_{c}/\sum_{z_{j}\in c}Q_{j}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_c end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT can not be very large. The δRsubscript𝛿𝑅\delta_{R}italic_δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT error will be low for a moderately good regression model. Thus, our method estimates Shapely value ΦisubscriptΦ𝑖\Phi_{i}roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with negligible error.