Pertemuan 4-5. Scalable Algorithm
Pertemuan 4-5. Scalable Algorithm
Pertemuan 4-5. Scalable Algorithm
Swarna Reddy
Algorithms
for Data
Science
Chapter 3
Scalable Algorithms and Associative
Statistics
3.1 Introduction
Suppose that the data set of concern is so massively large in volume that it
must be divided and processed as subsets by separate host computers. In this
situation, the host computers are connected by a network. A host computer is
often called a node in recognition of its role as member of the network. Since
the computational workload has been distributed across the network, the
scheme is referred to as a distributed computing environment. Our interest
lies in the algorithmic component of the distributed computing solution, and
specifically the situation in which each node executes the same algorithm but
on a subset of the data. When all nodes have completed their computations,
the results from each node are combined. This is a good strategy if it works:
only one algorithm is needed and everything can be automated. If the nodes
are garden-variety computing units, there’s no obvious limit to the amount
of data that can be handled.
The final result should not depend on how the data is divided into subsets.
To be more precise, recall that a partition of a set A is a collection of disjoint
subsets A1 , . . . , An such that A = ∪i Ai . Now, consider divisions of the data
that are partitions (hence no observation or record is included in more than
one data set). An algorithm is scalable if the results are identical for all
possible partitions of the data. If the scalability condition is met, then only
the number of subsets and nodes need be increased if the data volume should
increase. The only limitation is hardware and financial cost. On the other
hand, if the algorithm yields different results depending on the partition,
then we should determine the partition that yields the best solution. The
question of what is the best solution is ambiguous without criteria to judge
what’s best. Intuition argues that the best solution is that obtained from a
single execution of the algorithm using all the data. Under that premise, we
turn to scalable algorithms.
The term scalable is used because the scale, or volume, of the data does not
limit the functionality of the algorithm. Scalable data reduction algorithms
have been encountered earlier in the form of data mappings. The uses of data
mappings were limited to elementary operations such as computing totals and
building lists. To progress to more sophisticated analyses, we need to be able
to compute a variety of statistics using scalable algorithms, for example, the
least squares estimator of a parameter vector β or a correlation matrix.
Not all statistics can be computed using a scalable algorithm. Statistics
that can be computed using scalable algorithms are referred to as associative.
The defining characteristic of an associative statistic is that when a data set
is partitioned into a collection of disjoint subsets and the associative statistic
is computed from each subset, the statistics can be combined or aggregated
to yield the same value as would be obtained from the complete data set. If
the function of the algorithm is to compute an associative statistic, then the
algorithm is scalable.
To make these ideas more concrete, the next section discusses an exam-
ple involving the Centers for Disease Control and Prevention’s BRFSS data
sets. Then, we discuss scalable algorithms for descriptive analytics. A tuto-
rial guides the reader through the mechanics of summarizing the distribution
of a quantitative variable via a scalable algorithm. Scalable algorithms for
computing correlation matrices and the least squares estimator of the lin-
ear regression parameter vector β are the subject of two additional tutorials.
This first look at predictive analytics demonstrates the importance of scalable
algorithms and associative statistics in data science.
3.2 Example: Obesity in the United States 53
1
We discussed the BRFSS data briefly in Chap. 1, Sect. 1.2.
54 3 Scalable Algorithms and Associative Statistics
d×1
From s(D), we can estimate the population mean using the sample mean
$ = s1 /s2 . The median is an example of a statistic that is not associative.
µ
For example, if D = {1, 2, 3, 4, 100}, D1 = {1, 2, 3, 4}, and D2 = {100}, then
median(D1 ) = 2.5, and there is no method of combining median(D1 ) and
median(D2 ) = 100 to arrive at median(D) = 3 that succeeds in the general
case.
3.4 Univariate Observations 55
may be used to estimate the mean and variance. Let s(D) = [s1 s2 s3 ]T .
Then, the estimators are
s1
$=
µ ,
s3 . /
s2 s1
2 (3.3)
$2 =
σ − .
s3 s3
Thus, a scalable algorithm for computing estimators of the mean and variance
computes the associative statistic s(D) and then the estimates according to
Eq. (3.3). If the volume of D is too large for a single host, then D can be parti-
tioned as D1 , . . . , Dr and the subsets distributed to r network nodes. At node
j, s(Dj ) is computed. When all nodes have completed their respective "r tasks
and returned their statistic to a single host, we compute s(D) = j s(Dj )
followed by µ $ and σ $2 using formula (3.3).
Let’s consider a common situation. The data set D is too large to be stored
in memory but it’s not so large that it cannot be stored on a single computer.
These data may be a complete set or one subset of a larger partitioned set. In
any case, the data at hand D is still too large to be read at once into mem-
ory. The memory problem may be solved using two algorithmic approaches
sufficiently general to be applied to a wide range of problems. For simplicity
and concreteness, we describe the algorithms for computing the associative
statistic given in Eq. (3.2).
56 3 Scalable Algorithms and Associative Statistics
s1 + xj → s1
s2 + x2j → s2 (3.5)
s3 + 1 → s3 .
3.4.1 Histograms
2
Right-skewness is common when a variable is bounded below as is the case with body
mass index since no one may have a body mass index less than or equal to zero.
58 3 Scalable Algorithms and Associative Statistics
Relative Frequency
Second
The first sample was collected
in the years 2000 through 2003 0.06
and the second sample was
collected in the years 2011
through 2014. All data were 0.03
collected by the U.S. Centers
for Disease Control and Pre-
vention, Behavioral Risk Fac- 0.00
tor Surveillance System 10 20 30 40 50
Body mass index
The number of intervals is h. The first interval is b1 = [l1 , u1 ] and for i > 1,
the ith interval is bi = (li , ui ]. The second element of the pair, pi , is the
relative frequency of observations belonging to the interval. The term pi is
often used as an estimator of the proportion of the population belonging to
bi . The interval bi+1 takes its lower bound from the previous upper bound,
i.e., li+1 = ui . We show the intervals as open on the left and closed on the
right but there’s no reason not to define the intervals to be closed on the
left and open on the right instead. In any case, equal-length intervals are
formed by dividing the range into h segments. Usually, every observation in
the data set belongs to the base of the histogram, [l1 , uh ]. Admittedly, the
stipulation that all observations are included in the histogram sometimes is
a disadvantage as it may make it difficult to see some of the finer details.
Suppose that the data set D is massively large and cannot reside in mem-
ory. A scalable algorithm is needed to construct the histogram H, and formu-
lating the algorithm requires a precise description of the process of building
H. In brief, H is computed by counting the numbers of observations falling in
each interval. The intervals must be constructed before counting, and hence,
the data set must be processed twice, both times using a scalable algorithm.
The first pass determines the number and width of the intervals. The second
pass computes the numbers of observations each interval.
3.4 Univariate Observations 59
In essence, the algorithm maps the data set D to a set of intervals, or bins,
B = {b1 , . . . , bh } and we may write D (→ B.
The second algorithm maps D and B to a dictionary C in which the keys
are the intervals and the values are the counts of observations falling into a
particular interval, and we may write (D, B) (→ C.
Computationally, the second algorithm fills the dictionary C by counting
the number of observations belonging to each interval. Determining which
interval an observation belongs to amounts to testing whether the observa-
tion value belongs to bi , for i = 1, . . . , h.3 When the interval is found that
contains the observation, the count of observations belonging to the interval
is incremented by one, and the next observation is processed. After all ob-
servations are processed, the relative frequency of observations belonging to
each interval is computed by dividing the interval count by n. The dictionary
may then be rendered as a visual.
If the algorithm is to be scalable, then the statistics from which H is built
must be associative. The key to associativity and scalability is that a single
set of intervals is used to form the histogram base.
As described above, the first pass through D computes the two-element
vector s(D) = [min(D) max(D)]T . We’ll argue that s(D) is associative by
supposing that D1 , D2 , . . . , Dr are a partition of D. Let
! #
min(Dj )
s(Dj ) = , j = 1, . . . , r. (3.7)
max(Dj )
Let s1 (D) = min(D) and s2 (D) = max(D) so that s(D) = [s1 (D) s2 (D)]T .
Then,
3
Of course, once it’s been determined that the observation belongs to an interval, there’s
no need to test any other intervals.
60 3 Scalable Algorithms and Associative Statistics
s1 (D) = min(D)
= min(D1 ∪ · · · ∪ Dr )
(3.8)
= min{min(D1 ), . . . , min(Dr )}
= min{s1 (D1 ), . . . , s1 (Dr )}.
Similarly, max(D) = max{max(D1 ), . . . , max(Dr )}. Changing the notation,
we see that s2 (D) = max{s2 (D1 ),. . . , s2 (Dr )}. Since s(D) can be computed
from s(D1 ), . . ., s(Dr ), the statistic s is associative.
The range and intervals of the histogram depend entirely on the associative
statistic s and the choice of h. Given h, the same set of intervals is created
no matter how D is partitioned. Thus, a scalable algorithm that computes B
is feasible.
The second algorithm fills the dictionary C = {(b1 , c1 ), . . . , (bh , ch )} by
determining cj , the count of observations in the data set D that belong to
bj , for j = 1, . . . , h. If D has been partitioned as r subsets, then B and
a partition Dj are used to construct the jth set of counts. Mathematically,
(Dj , B) (→ Cj . Then, the sets C1 , . . . , Cr are aggregated by adding the counts
across sets for each interval.
3.5 Functions
The next tutorial instructs the reader to create a user-defined function. Rea-
sons for using functions are to make a program easier to understand and to
avoid repeatedly programming the same operation. A function consists of a
code segment that is executed by a one line instruction such as y = f(x). The
code segment should have a clearly defined purpose and generally, the code
will be used more than once, say, in more than one location in a program, or
in multiple programs, or in a for loop.
The function definition is located not within the main program but at
the top of the program or in a separate file. However, it’s useful to write
the code that belongs in a function in the main program. Writing the code
in the location where the function will be called allows the programmer to
access the values of the variables within the function code for debugging.
Once the code segment is included in a function, the variables become local
and are undefined outside the function and cannot be inspected. When the
code segment executes correctly, then move the code out of the main program
and into a function.
The function is initialized by the keyword def and is ended (usually) by the
return statement. For example, the function product computes the product
of the two arguments x and y passed to the function:
def product(x,y):
xy = x*y
return xy
3.6 Tutorial: Histogram Construction 61
def product(x,y):
print(x*y)
Note that xy will be undefined outside of the function because it’s not re-
turned.
The function must be compiled by the Python interpreter before the func-
tion is called. Therefore, it should be positioned at the top of the file; alter-
natively, put it in a file with a py extension, say functions.py, that contains
a collection of functions. The function product is imported in the main pro-
gram using the instruction
The objective of this tutorial ostensibly is to construct Fig. 3.1. In the pro-
cess of building the histograms, the reader will estimate the distribution of
body mass index of adult U.S. residents for two periods separated by approx-
imately 10 years. The tutorial also expands on the use of dictionaries for data
reduction and exposes the reader to weighted means. The data sets used in
the tutorial originate from the Centers for Disease Control and Prevention
(CDC) and are remarkable for the amount of largely untapped information
contained within.
The data were collected in the course of the Behavioral Risk Factor Surveil-
lance System’s (BRFSS) annual survey of U.S. adults. Telephone interviews
of a sample of adults were conducted in which a wide range of questions
on health and health-related behavior were asked of the respondents. The
respondents were selected by sampling land-line and cell phone numbers,
calling the numbers, and asking for interviews. Land-line and cell phone
numbers were sampled using different protocols and so the sampled indi-
viduals did not have the same probability of being sampled. Consequently,
certain sub-populations, say young land-line owners may be over- or under-
represented in the sample. Disregarding the sampling design may lead to
biased estimates of population parameters. To adjust for unequal sampling
probabilities, the CDC has attached a sampling weight to each observation.
The sampling weights may be incorporated into an estimator to correct for
62 3 Scalable Algorithms and Associative Statistics
1. Create a directory for storing the data files. We recommend that you keep
the files related to this book in a single directory with sub-directories for
each chapter. Some data files and some functions will be used in more
than one chapter so it’s convenient to build a sub-directory to save these
data files and a sub-directory to contain the functions.4 The following
directory structure is suggested:
Algorithms
DataMaps
PythonScripts
RScripts
Data
ScalableAlgorithms
PythonScripts
RScripts
Data
Data
LLCP2014.ASC
LLCP2013.ASC
ModuleDir
functions.py
Table 3.1 BRFSS data file names and sub-string positions of body mass index, sam-
pling weight, and gender. Positions are one-indexed
Body mass index Sampling weight Gender
Year File Start End Start End field
2000 cdbrfs00asc.ZIP 862 864 832 841 174
2001 cdbrfs01asc.ZIP 725 730 686 695 139
2002 cdbrfs02asc.ZIP 933 936 822 831 149
2003 CDBRFS03.ZIP 854 857 745 754 161
2011 LLCP2011.ZIP 1533 1536 1475 1484 151
2012 LLCP2012.ZIP 1644 1647 1449 1458 141
2013 LLCP2013.ZIP 2192 2195 1953 1962 178
2014 LLCP2014.ZIP 2247 2250 2007 2016 178
4
Chapter 7 uses these BRFSS data files in all of the tutorials.
64 3 Scalable Algorithms and Associative Statistics
3. If you want to look at the contents of one of the data files, open a Linux
terminal and submit a command of the form:
Each record in a BRFSS data file is a character string and without de-
limiters to identify specific variables. The file format is fixed-width, im-
plying that variables are located according to an established and un-
changing position in the string (recall that the record is a character
string). Consequently, variables are extracted as substrings from each
record. Regrettably, string or field position depends on year and the
field positions must be determined anew each time a different year is
processed. Table 3.1 contains the field positions for several variables.
The field positions are exactly as presented in the BRFSS codebooks.
The codebooks describe the variables and field positions. For example,
https://www.cdc.gov/brfss/annual_data/2014/pdf/codebook14_llcp.pdf
is the year 2014 codebook.
Table 3.1 field positions are one-indexed. When one-indexing is used,
the first character in the string s is s[1]. Python uses zero-indexing to
reference characters in a string5 so we will have to adjust the values in
Table 3.1 accordingly.
4. Create a Python script. The first code segment creates a dictionary that
contains the field positions of body mass index and sampling weight.
We’ll create a dictionary of dictionaries. The outer dictionary name is
fieldDict and the keys of this dictionary are years, though we use only
the last two digits of the year rather than all four. The first instruction
in the following code segment initializes fieldDict. The keys are defined
when the dictionary is initialized.
The values of fieldDict are dictionaries in which the keys are the
variable names and the values of these inner dictionaries are pairs iden-
tifying the first and last field positions of the variable. The dictionaries
for year 2000 (field[0]) and 2001 (field[1]) are shown in the two lines
following the initialization of fieldDict:
5. Using the information in Table 3.1, add the remaining inner dictionaries
entries to fieldDict, that is, for the years 2002, 2003, 2011, 2012, 2013,
5
The first character in the string in Python is s[0].
3.6 Tutorial: Histogram Construction 65
and 2014. Check your code by printing the contents of fieldDict and
comparing to Table 3.1. Iterate over year and print:
6. We’ll use fieldDict in the several other tutorials and add additional
variables and years. To keep our scripts short and easy to read, move the
code segment that builds fieldDict to a function. For the moment, put
the function definition at the top of your Python script.
def fieldDictBuild():
fieldDict[0] = {’bmi’:(862,864),’weight’:(832,841)}
...
fieldDict[14] = {’bmi’:(2247,2250),’weight’:(2007,2016)}
return fieldDict
fieldDict = functions.fieldDictBuild()
However, before this function call will execute successfully, the functions
module must be imported using the instruction
import os,sys
parentDir = r’/home/Algorithms/’
if parentDir not in set(sys.path):
sys.path.append(parentDir)
print(sys.path)
from ModuleDir import functions
dir(functions)
Your path, (parentDir), may be different.6 Notice that the path to the
directory omits the name of the directory containing the function. Adding
r in front of the path instructs Python to read backslashes literally. You’ll
probably need this if you’re working in a Windows environment.
When you modify functions.py, it will have to be reloaded by the
interpreter for the changes to take effect. You have to instruct the inter-
preter to reload it or else you have to restart the interpreter.7 You can
reload a module using a function from a library. If you’re using Python
3.4 or later, then import the library importlib using the instruction
import importlib. The instructions are
import importlib
reload(functions) # Python 3.4 or above
import imp
imp.reload(functions) # Python 2 and 3.2-3.3
6
It will not begin with /home/... if your operating system is Windows.
7
In Spyder, close the console, thereby killing the kernel, and start a new console to
restart the interpreter.
3.6 Tutorial: Histogram Construction 67
After reloading, check the contents of the module functions with the in-
struction dir(functions). The dir(functions) will list all of the func-
tions that are available including a number of built-in functions.8
10. The function fieldDictBuild will build fieldDict when called so:
fieldDict = functions.fieldDictBuild()
fields = functions.fieldDict[shortYear]
sWt, eWt = fields[’weight’]
sBMI, eBMI = fields[’bmi’]
file = path+filename
print(file,sWt, eWt,sBMI, eBMI)
except(ValueError, KeyError):
pass
The field positions of the sampling weight and body mass index are ex-
tracted in the middle block of three instructions. The first instruction
extracts the fields dictionary from fieldDict using the two-digit year
as the key. Then, the starting and ending positions of the variables are
extracted. The starting and ending positions are the field positions trans-
lated from the BRFSS codebook. The BRFSS codebook for a specific year
8
Execute functions.py if your function in functions.py is not compiling despite calling
the reload function.
68 3 Scalable Algorithms and Associative Statistics
can be found on the same webpage as the data file.9 The codebook lists
the field positions using one-indexing. One-indexing identifies the first
field in the string as column 1. However, Python uses zero-indexing for
strings and so we will have to adjust when extracting the values.
12. The following code segment processes each data file as the program iter-
ates over the list of files. The code must be nested within the try branch
of the exception handler (instruction 11). Insert the code segment after
the print statement above.
In Python 3, the instruction with open forces the file to close when the
for loop is interrupted or when the end of the file is reached. We do not
need the instruction f.close(). Hence, all instructions that are to be
carried out while the file is being read must be nested below the with
open instruction.
13. In the for loop of instruction 12, extract body mass index and sampling
weight from each record by slicing. If a string named record is sliced using
the instruction record[a:b], then the result is a substring consisting of
the items in fields a, a + 1, . . . , b − 1. Note that the character in field b is
not included in the slice. Convert the sampling weight string to a float
using the one-index field positions sWt and eWt. Also extract the string
containing the body mass index value using sBMI and eBMI extracted in
instruction 11.
weight = float(record[sWt-1:eWt])
bmiString = record[sBMI-1:eBMI]
9
The codebook contains a wealth of information about the data and data file structure.
3.6 Tutorial: Histogram Construction 69
bmi = 0
if shortYear == 0 and bmiString != ’999’:
bmi = .1*float(bmiString)
if shortYear == 1 and bmiString != ’999999’:
bmi = .0001*float(bmiString)
if 2 <= shortYear <= 10 and bmiString != ’9999’:
bmi = .01*float(bmiString)
if shortYear > 10 and bmiString != ’ ’:
bmi = .01*float(bmiString)
print(bmiString, bmi)
The length of the blank string must be the same as the length of
bmiString.
15. When the conversion of the string containing body mass index to the
decimal expression works correctly, then turn it into a function by placing
the declaration
before the code segment. Indent the code segment and add the instruction
return bmi at the end of the code segment.
16. Add the instruction to call the function:
after the function. Run the program. If it is successful, then move the defi-
nition of convertBMI to functions.py. The function will not be available
until the functions is recompiled, so execute the script functions.py.
Call the function using the instruction
17. Go back to the beginning of the program and set up a dictionary to con-
tain the histograms. One histogram will be created for each year, and
each histogram will be represented by a dictionary that uses intervals as
keys and sums of sampling weights as values. (Ordinarily, the value would
be a count of observations that fell between the lower and upper bounds
of the interval). Each histogram interval is a tuple in which the tuple ele-
ments are the lower and upper bounds of the interval. The value is a float
since it is a sum of sampling weights. The set of intervals corresponding
to a histogram are created once as a list using list comprehension:
70 3 Scalable Algorithms and Associative Statistics
Place this instruction before the for loop that iterates over fileList.
The first and last keys are (10,11) and (74,75). It will become appar-
ent momentarily that the histogram spans the half-open interval (10, 75]
because of the way we test whether an observation falls in an interval. A
few individuals have a body mass index outside this range. We will ignore
these individuals since there are too few of them to affect the histogram
shape and including them interferes with readability.
18. Build a dictionary of histograms in which the keys are years and the
values are dictionaries.
The value associated with the key year is a dictionary. The keys of these
inner dictionaries are the histogram intervals for the year. The values of
the inner dictionary are initialized as 0.
19. Return to the for loop that processes fileList. We’ll fill the histogram
corresponding to the data file or equivalently, the year, as the file is pro-
cessed. The first step is to identify the histogram to be filled by adding the
instruction histogram = histDict[year] immediately after extracting
the field positions for weight and body mass index (instruction 11).
20. We will assume that a ValueError has not be thrown and thus body
mass index and sampling weight have been extracted successfully from
the record. Increment the sum of the weights for the histogram interval
that contains the value of body mass index. The for loop below iter-
ates over each interval in histogram. The lower bound of the interval is
interval[0] and the upper bound is interval[1].
The break instruction terminates the for loop when the correct interval
has been found.
This code segment must be located inside the for loop initiated by
for record in f so that it executes every time bmiString is converted
to the float bmi. Indentation should be the same as the statement if
shortYear > 10 and bmiString != ‘ ’:.
3.6 Tutorial: Histogram Construction 71
When the end of the file has been reached, histogram will con-
tain the sum of the weights shown in the numerator of Eq. (3.12).
The dictionary histDict[year] will also have been filled since we
set histogram = histDict[year] and the result of this instruction
is that histDict[year] and histogram reference the same location
in memory. You can test this by executing print(id(histogram),
id(histDict[year])). The function id reveals the unique identifier of
the object.10
21. It may be of interest to count the number of body mass index values
that are outside the interval (10, 75]. Initialize two variables before the
files are processed by setting them equal to zero. Give them the names
outCounter and n. Add the following instructions and indent them so
that they execute immediately after the code segment shown in instruc-
tion 20.
n += 1
outCounter += int(bmi < 10 or bmi > 75)
if n % 10000 == 0:
print(year,n,outCounter)
decadeWts = [0]*2
decades = [0, 1]
decadeDict = {}
for decade in decades:
decadeDict[decade] = dict.fromkeys(intervals, 0)
10
It’s informative to submit the instruction a = b = 1 at the console. Then, submit a
= 2 and print the value of b. The moral of this lesson is be careful when you set two
variables equal.
72 3 Scalable Algorithms and Associative Statistics
Since the for loop is to execute for each year, it must be aligned with
the instruction histogram = histDict[year].
25. We may now scale the decade histograms to contain the estimated pro-
portions of the population with body mass index falling in a particular
interval.
Again, we’ve used the fact that assigning one variable equal to another
variable only generates two names for the same variable (or memory
location).
26. We will use the Python module matplotlib to graph the histograms for
the two decades. In preparation for plotting, import the plotting function
pyplot from matplotlib and create a list x containing the midpoints of
the histogram intervals. Also create a list y containing the estimated pro-
portions associated with each interval. Exclude from x and y the intervals
beyond 50 since relatively few individuals have a body mass index greater
than 50.
3.6 Tutorial: Histogram Construction 73
28. Compute the percent change relative to the earlier decade, say 100($ π2010 −
$2000 )/$
π π2000 %. Does this statistic support the assertion that an epidemic
is occurring?11
11
We think so.
74 3 Scalable Algorithms and Associative Statistics
3.6.1 Synopsis
12
NASDAQ is the abbreviation for the National Association of Securities Dealers Au-
tomated Quotations system.
3.7 Multivariate Data 75
The diagonal elements σ12 , . . . , σp2 are the variances of each univariate variable.
The standard deviation σj = E[(Xj − µj )2 ]1/2 may be interpreted as the
76 3 Scalable Algorithms and Associative Statistics
average (absolute) difference between the mean µj and the realized values of
the variable. The off-diagonal elements of Σ are referred to as the covariances.
The principal use of the covariances is to describe the strength and direction
of association between two variables. Specifically, the population correlation
coefficient
σjk
ρjk = = ρkj (3.15)
σj σk
quantifies the strength and direction of linear association between the jth
and kth random variables. The correlation coefficient is bounded by −1 and
1, and values of ρjk near 1 or −1 indicate that the two variables are nearly
linear functions of each other. Problem 3.6 asks the reader to verify this
statement. When ρjk is positive, the variables are positively associated and
as the values of one variable increase, the second also tends to increase. If
ρjk < 0, the variables are negatively associated and as the values of one
variable increase, the second tends to decrease. Values of ρjk near 0 indicate
that the relationship, if any, is not linear.
The correlation matrix is
1 ρ12 · · · ρ1p
ρ21 1 · · · ρ2p
ρ = . . . .. .
p×p .. . .
ρp1 ρp2 · · · 1
3.7.2 Estimators
We will not use the traditional moment estimator of σi2 but instead use
an estimator that is nearly equivalent and not as messy for developing the
estimator as a function of an associative statistic. Specifically, we use n as
the denominator instead of the traditional n − 1. The choice of n simplifies
computations but introduces a downward bias in the estimators. When n is
not small, but instead is at least 100, the bias is negligible. In any case, the
variance estimator is
"
(xi,j − xj )2
$j2 = i
σ
n
= n−1
&" 2 2
' (3.16)
i xi,j − nxj
"
= n−1 i x2i,j − x2j .
Our estimator σ $jk also differs from the traditional moment estimator of σjk
by using n as the denominator in place of n − 1 for the previous reason—
there’s no practical difference and the mathematics are simpler.
The variance matrix Σ is estimated by
2
$1 · · · σ
σ $1p
. .
$ =
Σ .. . . . ..
$p2
$p1 · · · σ
σ
% %
n−1 x2i,1 − x21 · · · n−1 xi,1 xi,p − x1 xp (3.18)
i i
= .. .. ..
. .
%.
.
%
n−1 n−1 x2i,p − x2p
x i,p xi,1 − xp x1 · · ·
i i
13
The inner product of a vector w with itself is the scalar wT w (Chap. 1, Sect. 1.10.1).
78 3 Scalable Algorithms and Associative Statistics
$ = n−1 M − xxT ,
Σ (3.19)
where
" 2 " "
" xi,1 "xi,1 xi,2 · · · " xi,1 xi,p
xi,2 xi,1
x2i,2 ··· xi,2 xi,p
M = .. .. .. ..
.
" . " . ". 2
p×p
xi,p xi,1 xi,p xi,2 ··· xi,p
where "
(xi,j − xj )(xi,k − xk )
rjk = . (3.20)
$j σ
σ $k
The matrix R is symmetric because rjk = rkj for each 1 ≤ j, k ≤ p. Further-
more, R is a product of two matrices D and Σ
$ given by
R = DΣD,
$ (3.21)
3.7.4 Synopsis
then educational level also may be associated with obesity. We may as well
include education in our investigation. The BRFSS asks survey respondents
to report household income, or more precisely to identify an income bracket
containing their income. Respondents also report their highest attained level
of education, body weight, and height. The CDC computes body mass index
from the reported weight and height of the respondents.
The clinical definition of obesity is a body mass index greater than or
equal to 30 kg/m2 . From the data analytic standpoint, obesity is a binary
variable and not suited for computing correlation coefficients.14 Furthermore
it’s derived from a quantitative variable, body mass index. Body mass index
generally is a better variable from an analytical standpoint than obesity. One
need only realize that a person is obese if their body mass index is 30 or 60
kg/m2 . The consequences of excessive body fat differ substantially between
the two levels of body mass index yet the binary indicator of obesity does
not reflect the differences. Our investigation will use body mass index and
examine associations by computing the (pair-wise) correlation coefficients
between body mass index, income level, and education level of respondents
using BRFSS data.
Income level is recorded in the BRFSS databases as an ordinal variable. An
ordinal variable has properties of both quantitative and categorical variables.
The values x and y of an ordinal variable are unambiguously ordered so
that there’s no debate of whether x < y or vice versa. On the other hand, the
practical importance of numerical difference x−y may depend on x and y. For
example, there are eight values of income: 1, 2, . . . , 8, each of which identifies
a range of annual household incomes. A value of 1 identifies incomes less than
$10,000, a value of two identifies incomes between $10,000 and $15,000, and
so on. It’s unclear whether a one-level difference has the same meaning at the
upper and lower range of the income variable. Education is also an ordinal
variable ranging from 1 to 6, with 1 identifying the respondent as never
having attended school and 6 representing 4 or more years of college. Values
of income and education outside these ranges denote a refusal to answer the
question or inability to answer the question. We will ignore such records.
Rather than computing each correlation coefficient associated with the
three pairs of variables one at a time, a single correlation matrix containing
all of the coefficients will be computed. The BRFSS data files for the years
2011, 2012, 2013, and 2014 will suffice for the investigation. We build on the
tutorial of Sect. 3.6. The tutorials of Sect. 3.10 and of Chaps. 7 and 8 build
on this tutorial, so as a practical matter, you should plan on re-using your
code.
1. Create a Python script and import the necessary Python modules:
14
Pearson’s correlation coefficient is a measure of linear association. Linear association
is meaningful when the variables are quantitative or ordinal.
82 3 Scalable Algorithms and Associative Statistics
import sys
import os
import importlib
parentDir = ’/home/Algorithms/’
if parentDir not in set(sys.path):
sys.path.append(parentDir)
print(sys.path)
from ModuleDir import functions
reload(functions)
dir(functions)
Table 3.2 BRFSS data file names and field locations of the income, education, and age
variables
Income Education Age
Year File Start End field Start End
2011 LLCP2011.ZIP 124 125 122 1518 1519
2012 LLCP2012.ZIP 116 117 114 1629 1630
2013 LLCP2013.ZIP 152 153 150 2177 2178
2014 LLCP2014.ZIP 152 153 150 2232 2233
3. Create the field dictionary using the function fieldDictBuild that was
built in the tutorial of Sect. 3.6.
fieldDict = functions.fieldDictBuild()
4. Read all the files in the directory containing the data files. You may be
able to use your code from the Tutorial of Sect. 3.6 for this purpose. Ignore
any files from a year before 2011 and any other files not listed in Table 3.2
by intentionally creating an exception of the type ZeroDivisionError.
3.8 Tutorial: Computing the Correlation Matrix 83
path = r’../Data/’
fileList = os.listdir(path)
for filename in fileList:
try:
shortYear = int(filename[6:8])
if shortYear < 11:
1/0
year = 2000 + shortYear
file = path + filename
fields = functions.fieldDict[shortYear]
except(ValueError, ZeroDivisionError):
pass
This code segment executes immediately after extracting the field posi-
tions of the three variables.
84 3 Scalable Algorithms and Associative Statistics
7. Extract the income variable from record and test whether the income
string is missing. Income is missing if the field consists of two blanks, i.e.,
’ ’. If this is the case, then assign the integer 9 to income. We’ll use the
value 9 as a flag indicating that the record should be ignored.
incomeString = record[sInc-1:eInc]
if incomeString != ’ ’:
income = int(incomeString)
else:
income = 9
8. The next tutorial also uses income, and so to reduce effort, copy the
code segment and create a function for processing the income string. The
function and its call can be set up as
def getIncome(incomeString):
if incomeString != ’ ’:
income = int(incomeString)
else:
income = 9
return income
income = functions.getIncome(record[sInc-1:eInc])
13. If the three variables, income, body mass index, and education, have
been successful converted to integers or floats, then create a vector w
containing values of the variables.
if education < 9 and income < 9 and 0 < bmi < 99:
w = np.matrix([1,income,bmi,education]).T
q = 4
A = np.zeros(shape=(q, q))
15. After computing w, add the outer product of w with itself to the aug-
mented matrix A:
A += w*w.T
16. Instead of waiting until all of the data has been processed to compute
the correlation matrix, we’ll compute it whenever the number of valid
observations is a multiple of 10,000 since it takes some time to process
all of the data. The following code segment computes the mean vector x
and extracts the moment matrix M from A:
86 3 Scalable Algorithms and Associative Statistics
if n % 10000 == 0:
M = A[1:,1:]
mean = np.matrix(A[1:,0]/n).T
Note that the Numpy function diag has been used in two different ways.
In the first application, np.diag(SigmaHat) extracted the diagonal el-
ements as a vector s because diag was passed a matrix. In the second
application, diag was passed a vector containing the reciprocals of the
standard deviation estimates σ $i−1 , i = 1, . . . , p, and the function inserted
the vector as the diagonal of a p × p matrix of zeros.
19. The product of two conformable Numpy matrices A and B is computed by
the instruction A*B. Compute the correlation matrix R = DΣD. $
20. Verify that your code works correctly by computing R after the 10,000 ob-
servations have been processed. You may use the instruction sys.exit()
to stop execution immediately after computing R. The diagonal elements
of R must be equal to 1. and the off-diagonal elements must be between
−1 and 1; if not, then there is an error in the code.
If your correlation matrix does not conform to these constraints,
then it may be helpful to check the calculations of Σ $ and D using R.
First, initialize a matrix to contain the data using the instruction X =
np.zeros(shape = (10000,3)). Store the data as it is processed. Be-
cause X is zero-indexed, subtract one from n:
if n <= 10000:
X[n-1,:] = [x for x in w[1:]]
3.8 Tutorial: Computing the Correlation Matrix 87
21. Write the data to a file using the following instruction when n has reached
10,000:
np.savetxt(’../X.txt’,X,fmt=’%5.2f’)
The file will be space delimited. The field width is at least 5 and there
are two places to the right of the decimal point for each number. Using
R, read the data into a data frame using the instruction
X = read.table(’../X.txt’)
Table 3.3 The sample correlation matrix between income, body mass index, and
education computed from BRFSS data files from years 2011, 2012, 2013, and 2014,
n = 1,587,668
Income Body mass index Education
Income 1 −.086 .457
Body mass index −.086 1 −.097
Education .457 −.097 1
There’s a very weak association between body mass index and income
since the correlation coefficient between the variables is −.086.
3.8.1 Conclusion
A correlation coefficient rij between −.3 and .3 indicates weak or little linear
association between variables i and j. Moderate levels of association are in-
dicated by .3 ≤ |rij | ≤ .7, and strong linear association is indicated by values
greater than .7 in magnitude. Based on Table 3.3, it is concluded that there
is little association between body mass index and income and between body
mass index and education. The negative association indicates that there is a
tendency for higher levels of income and education to be associated with lower
levels of body mass index. Considering the complex relationship between diet,
behavior, genetics, and body mass, the results are not unexpected. The data
do not directly measure these variables, but we might expect that income and
88 3 Scalable Algorithms and Associative Statistics
education would reflect these variables. In hindsight, these variables are in-
adequate proxies for diet quality and nutritional knowledge. We’re not ready
to give up on these data though. Let us investigate a related question with
the data: to what extent do income, education, and body mass index jointly
explain variation in a person’s perception of their health? The next section
introduces a method of investigating the question.
Our interest lies in the relationship between a target variable and a set of
predictor variables. The target has a preeminent role in the analysis whereas
the predictor variables are in a sense, subordinate to the target. The linear
regression model describes the target value as two terms: one describing the
expected value of the target as a function of the predictors and second being
a random term that cannot be explained by the predictors. In statistical
terms, the target variable is a random variable Y and the predictor vector
x is considered to be a vector of known values. The model is Y = f (x) + ε.
The term f (x) is unknown and needs to be estimated, and the term ε is
a random variable, commonly assumed to be normal in distribution, and
basically, a nuisance. A recurrent theme of regression analysis is minimizing
the importance of ε by finding a good form for the model f (x). We need not
be concerned with the distribution of ε at this point.
For example, Y may represent an individual’s health measured as a score
between 1 and 6, and x may represent concomitant measurements on variables
thought to be predictive of overall health. To investigate the relationship or
to predict an unobserved target value using observed predictor variables, a
linear model is adopted of the expected value of Y . We may think of observing
many individuals, all with the same set of predictor variable values. Their
health scores will vary about some mean value, and the linear regression
model describes that mean as a linear function of the predictor variables. It
is hoped that the magnitudes of the random terms are small so that Y ≈ f (x).
If this is the case, then the target can be predicted from the linear model with
some degree of accuracy.
The linear regression model specifies that E(Y |x), the expected value, or
mean of Y , is
E(Y | x) = β0 + β1 x1 + · · · + βp xp .
The predictor vector consists of the constant 1 and the predictor vari-
able values, hence, x = [1 x1 · · · xp ] . The parameter vector is β =
T
Minimizing S(β) with respect to β is the least squares criterion. The objec-
tive function S(·) can be expressed in matrix form by stacking the predictor
vectors as rows of the matrix
T
x1
1×q
X = ... .
n×q T
xn
1×q
E(Y|X) = X β (3.26)
n×q q×1
Exercise 3.3 guides the reader through the derivation. The solution to the
normal equations, and therefore the least squares estimator of β, is
$ = (XT X)−1 XT y.
β (3.29)
Not only are the matrices A and z small, but they can be computed without
holding all of the data in memory. To understand why A = XT X (for-
mula (3.22)), let xk denote the n × 1 column vector of observations on the
kth predictor variable, k = 1, 2, . . . , p, and 1 denote a n × 1 vector of ones.
Then, XT X can be expressed as a matrix consisting of the q 2 inner products
since
T
1 9 :
x1T
X X = . 1 x 1 · · · xp
T
q×q .. n×q
xpT
q×n
1 T 1 1 T x1 · · · 1 T xp
x1T 1 x1T x1 · · · x1T xp
= . .. .. .. (3.31)
.. . . .
xpT 1 xpT x1 · · · xpT xp
q×q
" "
" n "i xi,1 · · · " i xi,p
i xi,1 2
i xi,1 · · · i xi,1 xi,p
%
n
= .. .. .. .. = xi xiT .
.
". " . ". 2
i=1
i xi,p i x i,p xi,1 · · · i xi,p
92 3 Scalable Algorithms and Associative Statistics
The vector xiT is the ith row vector. (We’re using the index i for row vec-
tors and k column vectors.) Both A and z can be computed by iterating
over observation pairs (x1 , y1 ), . . ., (xn , yn ). It’s useful to have one further
expression for β
$ that explicitly shows the predictor vectors:
Let t(D) = (A, z) denote the pair of statistics. It’s also associative since
A and z are both associative. A scalable algorithm for computing t(D) is
obtained by a minor modification of the algorithm for computing the corre-
lation matrix. As before, the algorithm iterates over the observations in the
data set D = {(y1 , x1 ), . . . , (yn , xn )}. The preliminary steps are:
1. Initialize A as a q × q matrix of zeros.
2. Initialize z as a q-vector of zeros.
Process D sequentially and carry out the following steps:
1. From the ith pair of D, extract yi , xi,1 , . . . , xi,p and construct xi according
to
{xi,1 · · · xi,p } → [1 xi,1 · · · xi,p ] = xi .
T
q×1
A + xi xiT → A.
and
z + yi xi → z.
3. If the data has been partitioned as sets D1 , . . . , Dr , then compute
A1 , . . . , Ar and z1 , . . . , zr as described above. Upon completion, com-
pute
3.9 Introduction to Linear Regression 93
"r
A= Aj
(3.34)
j=1
"r
z= j=1 zj .
$ = A−1 z.
4. Compute β
where y$i = xiT β$ is the ith fitted value. Except in special circumstances, it’s
2
difficult to interpret σ
$reg $2 . This difficulty motivates a
without reference to σ
relative measure of model fit, the adjusted coefficient of determination
2
$2 − σ
σ 2
$reg
Radjusted = . (3.36)
$2
σ
94 3 Scalable Algorithms and Associative Statistics
2
Turning now to the matter of computing Radjusted , formula (3.36) is a simple
2 2
function of the terms σ$ and σ $reg . Equation (3.35) suggests that the data
must be processed twice to compute the terms: the first time to compute y
and β,
$ and the second time to compute the sum of the squared deviations.
However, we saw earlier (formula (3.16)) that "σ$2 may be computed using a
single pass through the data since σ 2
$ =n −1
yi2 − y 2 . The regression mean
square error also can be computed without a second pass through the data
since
" 2
2 y − zT β
$
$reg = i i
σ , (3.37)
n−q
where
n
%
z = xi y i .
q×1 q×11×1
i=1
2
$2 − σ
σ 2
$reg
Radjusted = 2
$"
σ 4" 5
2 2 2
iy −z β
−1 −1 T$
n iy −y −n (3.38)
≈
$2
σ
$ − y2
n−1 zT β
= .
$2
σ
3.10 Tutorial: Computing β
$ 95
2
The alternative to computing σ$reg as described above utilizes a second
pass through the data file. The sole purpose is to compute the sum of the
squared prediction errors
n
%
$ 2 = (n − q)$
(yi − xiT β) 2
σreg . (3.39)
i=1
For a long time it has been argued that income is related to health [16,
17]. The evidence supporting this contention is clouded by the difficulty of
quantifying a condition as complex as health. The problem is also confounded
with confidentiality issues that arise with potentially sensitive data on the
health of individuals. In this tutorial, we investigate the question by exploring
data from the Behavioral Risk Factor Surveillance System Survey (BRFSS).
Specifically, we build a regression model of a very simple measure of health,
the answer to the question would you say that in general your health is: The
question is multiple choice and the possible answers are shown in Table 3.4.
Table 3.4 Possible answers and codes to the question would you say that in general
your health is:
Code Descriptor
1 Excellent
2 Very good
3 Good
4 Fair
6 Poor
7 Don’t know or not sure
9 Refused
Blank Not asked or missing
Table 3.5 BRFSS data file names and field positions of the general health variable.
The general health variable contains a respondent’s assessment of their health. Table 3.4
describes the general health codes and meaning of each
Year File Position
2011 LLCP2011.ZIP 73
2012 LLCP2012.ZIP 73
2013 LLCP2013.ZIP 80
2014 LLCP2014.ZIP 80
This tutorial builds on the correlation tutorial of Sect. 3.8. The instructions
that follow assume that the reader has written a script that computes a
correlation matrix for some of the BRFSS variables. If you do not have a
script, then you should go through the tutorial and create it.
1. Load modules and functions that were used in the previous tutorial:
import sys
import os
import importlib
parentDir = ’/home/Algorithms/’
if parentDir not in set(sys.path):
sys.path.append(parentDir)
print(sys.path)
from ModuleDir import functions
reload(functions)
dir(functions)
fieldDict = functions.fieldDictBuild()
3.10 Tutorial: Computing β
$ 97
4. Create a list of files in your data directory and iterate through the list.
Process the files from the years 2011, 2012, 2013, and 2014.
path = r’.../Data/’
fileList = os.listdir(path)
print(fileList)
n = 0
for filename in fileList:
try:
shortYear = int(filename[6:8])
if shortYear < 11:
1/0
year = 2000 + shortYear
file = path+filename
print(filename)
fields = fieldDict[shortYear]
except(ValueError, ZeroDivisionError):
pass
Since there is an exception handler in the code to skip files that are
not data files, we also use the exception handler to skip data from years
preceding 2011. The last statement in the code segment retrieves the field
positions for the current year.
5. Extract the field positions for education, income, and body mass index
using the functions that you created in the previous tutorial. Get the field
position fGH for general health.
fEduc = fields[’education’]
sInc, eInc = fields[’income’]
sBMI, eBMI = fields[’bmi’]
fGH = fields[’genhlth’]
6. Process the data file by iterating over records. Extract the predictor vari-
able values:
This code segment executes within the for loop that iterates over
fileList. Therefore, the for record in f statement must have the
98 3 Scalable Algorithms and Associative Statistics
def getHlth(HlthString):
...
return genHlth
y = functions.getHlth(record[fGH-1])
Check that the function getHlth works correctly by printing the return
value y. When the function works correctly, move it to the functions
module. Delete the code segment from the main program. Call getHlth
after computing bmi.
10. If the extracted values are all valid, then form the predictor vector x from
education, income, and body mass index. The vector x is created as a
4×1 Numpy matrix so that we may easily compute the outer product xxT .
If one or more extracted values are not valid, then a ZeroDivisionError
exception is created and program flow is directed to the next record.
try:
if education < 9 and income < 9 and 0 < bmi < 99 and y != -1:
x = np.matrix([1, income, education, bmi]).T
n += 1
else:
1/0
except(ZeroDivisionError):
pass
11. The next set of operations build the matrices A and z from x and y.
But before computing the matrices A and z, it is necessary to initialize
3.10 Tutorial: Computing β
$ 99
q = 4
A = np.zeros(shape = (q, q))
z = np.matrix(np.zeros(shape = (q, 1)))
sumSqrs = 0
n = 0
"
The variable sumSqrs will contain i yi2 (needed to compute σ$2 ).
12. Returning to the inner for loop, as each record is processed, update
A and z. This code follows the test for valid values of the target and
predictor variables.
A += x*x.T
z += x*y
sumSqrs += y**2
n += 1
These instructions are executed only if the test for valid values is True
(instruction 10). Thus, the code segment follows immediately after up-
dating A.
13. After processing a large number of observations, compute
$ = A−1 z.
β (3.40)
You can compute A−1 using the Numpy function linalg.inv(), say,
invA = np.linalg.inv(A), and then compute betaHat = invA*z. How-
ever, from a numerical standpoint it’s preferable not to compute A−1 but
instead solve the linear system of equations Aβ = z for β using a LU-
factorization optimized for symmetric matrices.15 The solution to the
system will be β.
$ Compute β $ after processing successive sets of 10,000
observations:
if n % 10000 == 0 and n != 0:
b = np.linalg.solve(A,z)
print(’\t’.join([str(float(bi)) for bi in b]))
15
The LU-factorization method is faster and more accurate than computing the inverse
and then multiplying.
100 3 Scalable Algorithms and Associative Statistics
2
$2 − σ
σ 2
$reg
Radjusted = 2
(3.41)
$
σ
using the variance estimates
" 2
$2 = n−1
σ 2
i yi − y
4" 5 (3.42)
2
and σ
$reg = n−1 yi2 − zT β
$ .
2
All terms necessary to compute Radjusted have been computed except for
" "
y = n −1
i yi . The sum $2 is
i yi is stored in row zero of z. Hence, σ
computed using the instruction
ybar = z[0]/n
varEst = sumSqrs/n - ybar**2
The .dot operator is used to compute the inner product of Numpy ar-
rays z and b. Since the Numpy function linalg.solve does not return
a matrix (it returns an array), the matrix multiplication operator * will
not multiply z and b correctly. You can determine the type of an object
using the function type().
2
16. Print the value of Radjusted and β
$ whenever n mod 10,000 = 0 and n > 0
by putting the following code within the if branch of the conditional
statement of instruction 13.
3.10.1 Conclusion
16
Numpy matrices and arrays cannot be rounded even if they are of length 1 or 1 × 1 in
dimension.
17
Interpretation of regression coefficients is discussed at length in Chap. 6.
102 3 Scalable Algorithms and Associative Statistics
3.11 Exercises
3.11.1 Conceptual
3.1. Suppose that s is a statistic and that for any two sets D1 and D2 that
form a partition of the set D, s(D1 ∪ D2 ) = s(D1 ) + s(D2 ). Suppose that
r > 2 and D1 , D2 , . . . , Dr is also a partition of D. Argue that s(D) = s(D1 )+
s(D2 ) + · · · + s(Dr ).
3.2. Show that the statistic t(D) = (A, z) (Eqs. (3.22) and (3.32)) is an
associative statistic. Recall that pairs are added component-wise; hence,
(x1 , y1 ) + (x2 , y2 ) = (x1 + x2 , y1 + y2 ).
3.3. (Requires multivariable calculus.) Show that the least squares estimator
is the solution to the normal equations (3.28). Specifically, minimize the ob-
jective function S(β) (Eq. (3.25)) with respect to β. Differentiate S(β) with
respect to β. Set the vector of partial derivatives equal to 0 and solve for
q×1
β. The solution is the least squares estimator.
$2 can be written as
3.4. Recall that σ
"
2 $)2
(xi − µ
$ =
σ ,
n
and consider the variance estimator presented in Eq. (3.16).
(a) Verify that Eq. (3.3) are correct; that is, verify that µ $2 can be
$ and σ
computed from s1 , s2 , and s3 according to the formulae.
(b) Conventional statistical texts advocate using the sample variance
"
2 $)2
(xi − µ
s = .
n−1
to estimate σ 2 because it is unbiased. In fact, σ
$2 is a biased estimator of
2
σ . Show that the difference between σ $ and s2 tends to 0 as n → ∞.
2
(c) Assuming that j and X have been constructed as Numpy matrices, give a
one-line Python instruction that will compute xT .
(d) Note the similarity between Eq. (3.43) and the least squares estimator of
β. What can you deduce about the sample mean as an estimator?
3.6. Suppose that X2 = aX1 + b, where X1 is a random variable with fi-
nite mean and variance and a += 0 and b are real numbers. Show that the
population correlation coefficient ρ12 = 1.
3.7. Show that
R = DΣD
$
$ T X T Xβ
Then, prove that β $ = zT β.
$
3.11.2 Computational
3.9. The standard deviation of the ith parameter estimator β$i is estimated
by the square root of the ith diagonal element of the matrix
2
var(
; β) $ =σ
$reg (XT X)−1 . (3.45)
website https://www.cdc.gov/brfss/annual_data/2014/pdf/codebook14_llcp.pdf.
Males are identified in the BRFSS data files by the value 1 and females are
identified by 2. Let xj denote the gender of the jth respondent.
Compute two estimates: the conventional sample proportion, and a weighted
proportion using the BRFSS sampling weights. The two estimators are
"n
p1 = n−1 j=1 IF (xj )
"n
j=1 wj IF (xj )
and p2 = "n , (3.47)
j=1 wj
where IF (xj ) takes on the value 1 if the jth sampled individual is female and 0
otherwise, wj is the BRFSS sampling weight assigned to the jth observation,
and n is the number of observations. Compare to published estimates of the
proportion of females in the adult U.S. population. The U.S. Census Bureau
reports that in 2010, 50.8% of the population were female.