Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Statistical matching and uncertainty analysis in combining household income and expenditure data

  • Original Paper
  • Published:
Statistical Methods & Applications Aims and scope Submit manuscript

Abstract

Among the goals of statistical matching, a very important one is the estimation of the joint distribution of variables not jointly observed in a sample survey but separately available from independent sample surveys. The absence of joint information on the variables of interest leads to uncertainty about the data generating model since the available sample information is unable to discriminate among a set of plausible joint distributions. In the present paper a short review of the concept of uncertainty in statistical matching under logical constraints, as well as how to measure uncertainty for continuous variables is presented. The notion of matching error is related to an appropriate measure of uncertainty and a criterion of selecting matching variables by choosing the variables minimizing such an uncertainty measure is introduced. Finally, a method to choose a plausible joint distribution for the variables of interest via iterative proportional fitting algorithm is described. The proposed methodology is then applied to household income and expenditure data when extra sample information regarding the average propensity to consume is available. This leads to a reconstructed complete dataset where each record includes measures on income and expenditure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Attanasio O, Pistaferri L (2014) Consumption inequality over the last half century: some evidence using the new PSID consumption measure. Am Econ Rev 104:122–126

    Article  Google Scholar 

  • Battistin E, Miniaci R, Weber G (2003) What do we learn from recall consumption data? J Hum Resour 38(2):354–385

    Article  Google Scholar 

  • Bishop YM, Fienberg SE, Holland PW (1975) Discrete multivariate analysis. Springer, New York

    MATH  Google Scholar 

  • Blundell R, Pistaferri L, Preston I (2008) Consumption inequality and partial insurance. Am Econ Rev 98:1887–1921

    Article  Google Scholar 

  • Brewer KRW (1979) A class of robust designs for large-scale surveys. J Am Stat Assoc 74:911–915

    Article  MathSciNet  MATH  Google Scholar 

  • Browning M, Collado MD (1996) Assessing the effectiveness of saving incentives. J Econ Perspect 10(4):73–90

    Article  Google Scholar 

  • Browning M, Collado MD (2001) The response of expenditures to anticipated income changes: panel data estimates. Am Econ Rev 91(3):681–692

    Article  Google Scholar 

  • Browning M, Crossley TF, Winter J (2014) The measurement of household consumption expenditures. Annu Rev Econ 6(1):475–501

    Article  Google Scholar 

  • Brozzi A, Capotorti A, Vantaggi B (2012) Incoherence correction strategies in statistical matching. Int J Approx Reason 53:1124–1136

    Article  MathSciNet  MATH  Google Scholar 

  • Caballero RJ, Ricardo J (1990) Consumption puzzles and precautionary savings. J Monet Econ 25(1):113–136

    Article  Google Scholar 

  • Cifaldi G, Neri A (2013) Asking income and consumption questions in the same survey: What are the risks? Bank of Italy, Economic Research and International Relations Area. Economic working papers, 908

  • Conti PL (2014) On the estimation of the distribution function of a finite populatrion under high entropy sampling designs, with applications. Sankhya B 76(2):234–259

    Article  MathSciNet  MATH  Google Scholar 

  • Conti PL, Marella D, Scanu M (2012) Uncertainty analysis in statistical matching. J Off Stat 28:69–88

    Google Scholar 

  • Conti PL, Marella D, Scanu M (2013) Uncertainty analysis for statistical matching of ordered categorical variables. Comput Stat Data Anal 68:311–325

    Article  MathSciNet  Google Scholar 

  • Conti PL, Marella D, Scanu M (2016a) How far from identifiability? A systematic overview of the statistical matching problem in a non-parametric framework. Commun Stat Theory Methods. doi:10.1080/036109261010005

  • Conti PL, Marella D, Scanu M (2016b) Statistical matching analysis for complex survey data with applications. J Am Stat Assoc. doi:10.1080/01621459.2015.1112803

  • Donatiello G, D’Orazio M, Frattarola D, Rizzi A, Scanu M, Spaziani M (2016) The role of the conditional independence assumption in statistically matching income and consumption. Stat J IAOS 32:667–675

  • D’Orazio M, Di Zio M, Scanu M (2006a) Statistical matching: theory and practice. Wiley, Chichester

    Book  MATH  Google Scholar 

  • D’Orazio M, Zio M, Scanu M (2006b) Statistical matching for categorical data: displaying uncertainty and using logical constraints. J Off Stat 22:137–157

    Google Scholar 

  • D’Orazio M, Zio MD, Scanu M (2009) Uncertainty intervals for nonidentifiable parameters in statistical matching. In: 57th Session of the International Statistical Institute, Durban, South Africa, Aug 2009

  • D’Orazio M, Di Zio M, Scanu M (2015) The use of uncertainty to choose the matching variables in statistical matching. NTTS 2015 new techniques and technologies for statistics and exchange of technology and know-how, Brussels 10–12 Mar 2015

  • Goodman LA (1968) The analysis of cross-classified data: independence, quasi-independence, and interaction in contingency tables with or without missing cells. J Am Stat Assoc 63:1091–1131

    MATH  Google Scholar 

  • Guiso L, Jappelli T, Terlizzese D (1992) Earnings uncertainty and precautionary saving. J Monet Econ 30(2):307–337

    Article  Google Scholar 

  • Kennickell A, Lusardi A (2004) Disentangling the importance of the precautionary saving motive. NBER working papers series, 10888, 1–64

  • Little RJA (1983) Estimating a finite population mean from unequal probability samples. J Am Stat Assoc 78:596–604

    Article  Google Scholar 

  • Okner BA (1972) Constructing a new database from existing microdata sets: the 1966 merge file. Ann Econ Soc Meas 1:325–342

    Google Scholar 

  • Palumbo MG (1999) Uncertain medical expenses and precautionary saving near the end of the life cycle. Rev Econ Stud 66(2):395–421

    Article  MathSciNet  MATH  Google Scholar 

  • Pfeffermann D (1993) The role of sampling weights when modeling survey data. Int Stat Rev 61:317–337

    Article  MATH  Google Scholar 

  • Rässler S (2002) Statistical matching: a frequentist theory, practical applications and alternative bayesian approaches. Springer, New York

    Book  MATH  Google Scholar 

  • Renssen RH (1998) Use of statistical matching techniques in calibration estimation. Surv Methodol 24:171–183

    Google Scholar 

  • Rodgers WL (1984) An evaluation of statistical matching. J Bus Econ Stat 2:91–102

    Google Scholar 

  • Rubin DB (1986) Statistical matching using file concatenation with adjusted weights and multiple imputations. J Bus Econ Stat 4:87–94

    Google Scholar 

  • Sims CA (1972) Comments and rejoinder (On Okner (1972)). Ann Econ Soc Meas 1:343–345, 355–357

  • Singh AC, Mantel H, Kinack M, Rowe G (1993) Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption. Surv Methodol 19:59–79

    Google Scholar 

  • Skinner J (1987) A superior measure of consumption from the panel study of income dynamic. Econ Lett 23:213–216

    Article  Google Scholar 

  • Tedeschi S, Pisano E (2013) Data fusion between bank of Italy-SHIW and ISTAT-HBS. MPRA Paper. RePEc:pra:mprapa:51253

  • Vantaggi B (2008) Statistical matching of multiple sources: a look through coherence. Int J Approx Reason 49:701–711

    Article  MathSciNet  MATH  Google Scholar 

  • Wu C (2004) Combining information from multiple surveys through the empirical likelihood method. Can J Stat 32:112

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniela Marella.

Appendix

Appendix

Proof of Proposition  1

Taking into account that

  1. 1.

    \( K^x_{N +} (y , \, z ) \le \min ( F_N (y \vert x) , \, G_N (z \vert x ))\);

  2. 2.

    \( K^x_{N -} (y, \, z) \ge \max (0, \, F_N (y \vert x) + G_N (z \vert x) -1)\);

it is not difficult to see that

$$\begin{aligned} \varDelta ^{x}(F_{N},G_{N})\le & {} \int _{R^{2}} \left\{ \min ( F_N (y \vert x) , \, G_N (z \vert x )) \right. \nonumber \\&\quad \left. - \max (0, \, F_N (y \vert x) + G_N (z \vert x) -1) \right\} \, dF_{N}(y\,\vert x)dG_{N}(z\,\vert x) \nonumber \\ \,\approx & {} \int _{0}^{1} \int _{0}^{1} \left\{ min (u, \, v) - \max (0, \, u+v-1 ) \right\} \, du dv \nonumber \\ \,= & {} \frac{1}{6} . \end{aligned}$$
(23)

In other terms, the maximal value of the conditional measure of uncertainty (9) is essentially \(1/6 \approx 0.167\). As an easy consequence of Proposition 1, also the unconditional uncertainty measure computed as in (10) takes the value 1 / 6.\(\square \)

Proof of Proposition 2

The following two statements hold:

$$\begin{aligned}&\widehat{\varDelta }^{*x}_H \mathop {\rightarrow }\limits ^{p} \varDelta ^x (F_N , \, G_N ) \;\;as \; k \rightarrow \infty \end{aligned}$$
(24)
$$\begin{aligned}&\widehat{\varDelta }^{*}_H \mathop {\rightarrow }\limits ^{p} \varDelta (F_N , \, G_N ) \;\;as \; k \rightarrow \infty \end{aligned}$$
(25)

Asymptotic analysis requires to define how the samples sizes \(n_A\), \(n_B\) and the population size N go to infinity. As in Brewer (1979) (cfr. also Little (1983)), this will be done as follows:

  1. 1.

    k replicates of the original population are formed.

  2. 2.

    From each replicate, an independent sample \(\varvec{s}_A\) (\(\varvec{s}_B\)) of size \(n_A\) (\(n_B\)) is selected, according to the sampling design \(P_A\) (\(P_B\)). Using notation introduced above, let \(D_{i,A}^j\) (\(D_{i,B}^j\)) be a Bernoulli r.v. taking the value 1 if unit i is included in the sample drawn from the jth replicate of the population \((j=1, \, \dots , \, k\)) according to the sampling design \(P_A\) (\(P_B\)), and the value 0 otherwise.

  3. 3.

    The k populations are aggregated to a population of size \(N^* = kN\). We will denote by \(F_{N^*} ( y \, \vert x)\), \(G_{N^*} ( z \, \vert x)\), \(p_{N^*} (x)\) the conditional p.d.f.s of Y and Z given \(X=x\) and the proportion of units such that \(X=x\), respectively.

  4. 4.

    The k samples drawn with the sampling design \(P_A\) (\(P_B\)) are aggregated to a sample \(\varvec{s}^*_A\) (\(\varvec{s}^*_B\)) of \(n^*_A = k n_A \) (\(n^*_B = k n_B \)) units.

  5. 5.

    The quantities \(F_{N^*} ( y \, \vert x)\), \(G_{N^*} ( z \, \vert x)\), \(p_{N^*} (x)\) are estimated by their Hájek estimators, as defined in Sect. 3.1, and based on \(n^*_A\) and \(n^*_B\) sample units. Such estimates are denoted by \(\widehat{F}_{H}^{*} ( y \, \vert x)\), \(\widehat{G}_{H}^{*} ( z \, \vert x)\), \(\widehat{p}_{H}^{*} (x)\), respectively. Then, the uncertainty measures are estimated accordingly. We will denote by \(\widehat{\varDelta }^{*x}_H\) (\(\widehat{\varDelta }^{*}_H\)) the estimate of the conditional (unconditional) measure of uncertainty.

  6. 6.

    k is allowed to tend to infinity.\(\square \)

First of all, it is immediate to see that

$$\begin{aligned} F_{N^*} ( y \, \vert x) = F_{N} ( y \, \vert x), \;\; G_{N^*} ( z \, \vert x) = G_{N} ( z \, \vert x), \;\; p_{N^*} (x) = p_{N} (x). \end{aligned}$$

In the second place, from

$$\begin{aligned} \widehat{F}_{H}^{*} ( y \, \vert x) = \frac{\sum _{i=1}^{N} \left\{ \frac{1}{k} \sum _{j=1}^{k} \frac{D_{i,A}^j}{\pi _{i,A}} \right\} I_{(y_i \le y)} I_{(x_i =x)}}{\sum _{i=1}^{N} \left\{ \frac{1}{k} \sum _{j=1}^{k} \frac{D_{i,A}^j}{\pi _{i,A}} \right\} I_{(x_i =x)}} \end{aligned}$$

and using the law of large numbers

$$\begin{aligned} \frac{1}{k} \sum _{j=1}^{k} \frac{D_{i,A}^j}{\pi _{i,A}} \end{aligned}$$
(26)

converges in probability to 1 as k goes to infinity, then it is not difficult to see that \(\widehat{F}_{H}^{*} ( y \, \vert x) \) converges in probability to \(F_{N} ( y \, \vert x) \) as k tends to infinity, for each x and uniformly in y. In the same way, it is possible to show that \(\widehat{G}_{H}^{*} ( z \, \vert x) \) converges in probability to \(G_{N} ( z \, \vert x) \) as k tends to infinity, for each x and uniformly in z. Since the functional \(\varDelta ^x (F_N , \, G_N )\) is continuous in the sup-norm, (24) is proved. In the same way, (25) can be proved.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Conti, P.L., Marella, D. & Neri, A. Statistical matching and uncertainty analysis in combining household income and expenditure data. Stat Methods Appl 26, 485–505 (2017). https://doi.org/10.1007/s10260-016-0374-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10260-016-0374-7

Keywords