HW1 Updated Sep9 2014
HW1 Updated Sep9 2014
HW1 Updated Sep9 2014
Homework 1
Note: Only a subset of questions will be graded. However, you are required to submit solutions to
all eight questions (1 - 8). Practice question solutions should not be turned in for grading. You are
encouraged to try them out on your own.
For questions specific to this homework please contact Xi Chen and Guruprasad Nayak.
Q1.
Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative
(nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one
interpretation, so briefly indicate your reasoning if you think there may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio
1. Intensity of a pixel in an 8-bit Grayscale image (http://en.wikipedia.org/wiki/Grayscale)
2. Frequency of light incident on Earths surface (http://en.wikipedia.org/wiki/Frequency#Light)
3. Brightness as measured by peoples judgments
4. HTML color codes (for example, Yellow is encoded as FFFF00. http://html-colorcodes.info/color-names/)
5. Year, as measured on the Gregorian calendar
Q2.
Decide which of the similarity measures described in Chapter 2 would be most appropriate for the
following situations and why.
1. Suppose two of your friends are numismatists (collecting coins from different countries as a
hobby). You also have coins from various countries. You want to decide which friend has the most
similar collection to you. Hint: You can represent each collection as a vector of length 196 of the
official independent countries of the world, where the corresponding entry denotes the number of
coins collected for that country. State any assumption you make about the coin collections.
2. Suppose you measure the temperatures every day at a particular location (e.g., the airport) in 10
widely distributed major cities across the country. Similarity is to be computed between the
temperatures of these cities today and same day of last month. (You have a vector of all 10 cities
temperatures for today and a vector for all 10 cities temperatures for the same day of last month.)
3. A nutritionist wants to measure the dissimilarity of you and your friend based on following
attributes: your height (in meters), weight (in pounds), and your dietary requirement (in calories).
Note that the feature set includes only continuous features.
Q3.
Mapping fire activity is an important problem for supporting climate and carbon cycle studies as
well as forest management. Automated approaches to address this problem use reflectance data
collected from satellites orbiting the Earth to learn a classifier that can differentiate burned areas on
the ground from the unburned ones. Consider the output of two such classification algorithms
The input to the algorithm is the reflectance in 7 wavelength ranges for each 1km by 1km region in
a 1200km by1200km region (tile), i.e., the data is a matrix with 1200*1200=1,440,000 rows and
7 columns, having the reflectance values for each of the 1,440,000 pixels in each of the 7
wavelength regions.
The algorithms output a probability of fire activity for each pixel on the day the satellite imagery
was taken. This probability is also part of the data.
1. We are interested in knowing if the two algorithms agree on the presence or absence of fire at a
given location (pixel) a True/False answer at each location. How would you pre-process the given
data to answer this?
2. The consensus of the two algorithms on the fire activity in the given region (the whole image)
can be quantified by counting the number of such agreements and disagreements with respect to the
predictions of burned/unburned. Considering the output of each algorithm on the whole region as a
vector of length 1,440,000 (one value for each pixel) and doing the preprocessing in step 1, suggest
a suitable similarity measure to quantify the consensus between the two outputs. Note that
consensus between the outputs at a location implies that they both agree that it is a positive (burned
location) or they both call it a negative (unburned location). (For this question, assume that the
number of fire and non-fire pixels is roughly equal)
3. Typically, the number of fire pixels is much less in comparison to the number of non-fire pixels
(<0.1%). Does this have any effect on the similarity measure suggested in part 2? Explain.
4. Given that the data in hand is spatial in nature, how would you address the issue of missing
reflectance data for a certain pixel?
Q4
For each of the following vectors, x and y, calculate the values of the indicated similarity or
distance measures:
1. x = (1 1 0 0 0), y = (0 0 0 1 1) Jaccard, Cosine, Euclidean, Correlation, mutual information
2. x = (0 1 2 4 5 3), y = (5 6 7 9 10 8) Cosine, Euclidean, Correlation
3. x = (0 1 2 4 5 3), y = (5 5 5 5 5 5) Cosine, Euclidean, Correlation
Q6
Both the L1 distance and correlation are widely used to compare two time series. The most
appropriate measure depends on the specific problem or domain requirements. Decide the proper
measure for each of the following scenarios in climate data analysis:
1. We want to compare the temperatures of two cities for 100 days with respect to their levels.
2. We want to compare the temperatures of two cities for 100 days with respect to the trends (up
and down) that occur in temperature during the 100 days.
Q7.
Data reduction sampling, dimensionality reduction, or selecting a subset of features is necessary
or useful for a wide variety of reasons, but can be problematic if information necessary to the
analysis is lost in the process. The following questions explore several issues at a conceptual level.
1. Assume the property of interest is the rate, at which a particular event occurs, i.e.,
rate = number of times a particular type of event occurs / total number of all events.
a) If the event occurs at a rate of 0.001, i.e., 0.1% of the time, then what problems, if any, would
you encounter in trying to estimate the rate from a single sample of size 100?
b) If the event occurs at a rate of 0.50, i.e., 50% of the time, then what problems, if any, would you
encounter in trying to estimate the rate from a single sample of size 100?
2. You are given a data set of 10,000,000 time series, each of which records the temperature of the
Earth at a particular location on the surface of the Earth daily for 10 years. The locations are
arranged in a regular grid that covers the surface of the Earth. (Details of the exact nature of the
grid are unimportant. The important fact is that each point has neighbors to the left and right, up
and down.) Note that temperature displays considerable autocorrelation, i.e., the temperature at a
given location and time is similar to that of nearby locations and times. The size of the data needs
to be reduced so that you can apply your favorite data analysis algorithm. Both aggregation and
sampling could be used to reduce the amount of data.
a). If you use aggregation, would you aggregate over location or time or both?
b). How would you use the spatial and temporal autocorrelation of temperature to guide you in
aggregating the data?
c). If you use sampling, would you sample over location or time or both?
d). Would you prefer aggregation or sampling or both? (You can argue any of these as long as you
support your answer.)
Q8.
1. You are given a set of m objects that is divided into K groups, where the ith group is of size mi. If
the goal is to obtain a sample of size n<m, what is the difference between the following two
sampling schemes? (Assume sampling with replacement.)
(a) We randomly select n * mi / m elements from each group.
(b) We randomly select n elements from the data set, without regard for the group to which an
object belongs.
2. Consider the problem of learning a classifier for forest fire mapping as described in question 3.
The classification algorithm needs to be trained to recognize the signatures of burned pixels (and
how they are different from non-burned ones). Typically, there is a huge imbalance in the number
of fire (<0.1% of the total area) and non-fire pixels in any given region. Keeping this in mind,
answer the following questions:
(a) What would happen if one were to randomly sample some locations (pixels) for training the
classifier ignoring the presence/absence of fire at the location during sampling?
(b) How would stratified sampling help here?
Practice questions:
Q1
Consider a data set where the objects are images from a weather satellite and each image consists
of one million pixels. (Assume that each pixel consists of a real value representing the brightness.
Also, assume that the images are snapshots of different areas and do not represent images of the
same area at successive intervals in time.) The data can be represented as record data, where each
image is a record (object) and each pixel is an attribute.
1. Is there any spatial autocorrelation? Explain.
2. An image often has missing values for scattered pixels. (A pixel is missing, but those around it
are not.) Which of the three techniques for handling missing values (p. 40-41) would be the most
appropriate for this situation and why?
3. Some images are missing large blocks of pixels. Which of the three techniques for handling
missing values (p. 40-41) would be the most appropriate for this situation and why?
4. Would any of the following proximity measures correlation, cosine, or Euclidean be suitable for
computing similarity/distance among images? Explain.
Q2
Consider a document-term matrix, where tfij is the frequency of the ith word (term) in the jth
document and m is the number of documents. Consider the variable transformation that is defined
by
tfij = tfij * log (m/dfi)
(1)
where dfi is the number of documents in which the ith term appears and is known as the document
frequency of the term. This transformation is known as the inverse document frequency
transformation.
1) What is the effect of this transformation if a term occurs in one document? In every document?
2) What might be the purpose of this transformation?
Q3
Justify your answers.
1. Is the Jaccard coefficient for two binary strings (i.e., string of 0s and 1s) always greater than or
equal to their cosine similarity?
2. The cosine measure can range between [-1,1]. Give an example of a type of data for which the
cosine measure will always be non-negative.
Q4
Consider the following distance measure:
d(x, y) = |max(x) max(y)|
where x and y are real-valued vectors.
1. Let x = [3, 4, 2, 5, 1] and y = [8, 6, 9, 3, 0]. Compute their distance using the above
measure.
2. State whether the above measure has the following properties.
i. Positive definiteness
ii. Symmetry
iii. Triangle Inequality
iv. Is d (x, y) a metric?
Q5
The population for a clinical study has 500 Asian, 1000 Hispanic and 500 Native American people.
What is good way of sampling this population to ensure that the distribution of various
subpopulations is maintained if only 100 samples have to be chosen? Give the distribution of the
various sub-populations in the final sample.
Q6
Suppose you want to analyze the blood-pressure data collected for 100 Intensive Care Unit patients
measured at every single hour over a period of month. However, many of the patients have missing
values over some time points. Among the three techniques for missing value estimation discussed
in the book (Page 41): (i) eliminating data objects, (ii) estimating missing values, and (iii)
designing robust algorithm, which one will you prefer and why? Explain briefly.
Q7
Classify
the
following
attributes
as
binary,
discrete,
or
continuous.
Also
classify
them
as
qualitative
(nominal
or
ordinal)
or
quantitative
(interval
or
ratio).
Some
cases
may
have
more
than
one
interpretation,
so
briefly
indicate
your
reasoning
if
you
think
there
may
be
some
ambiguity.
Example:
Age
in
years.
Answer:
Discrete,
quantitative,
ratio
a.
b.
c.
d.
Age as measured by whether the age in years is 30 (value = 0) or greater than 30 (value = 1).
Speed of a vehicle measured in mph.
Intensity of rain as indicated using the values: no rain, intermittent rain, incessant rain.
Different flavors of ice cream.
Q8
A
group
of
biologists
conducts
a
field
study
to
evaluate
the
relative
occurrence
of
different
types
of
birds
at
a
number
of
locations.
There
data
is
collected
in
a
table
where
the
rows
correspond
to
locations,
the
columns
correspond
to
different
species
of
birds,
and
the
(i,j)th
entry
is
the
number
of
birds
of
the
jth
species
at
the
ith
location.
The
following
table
indicates
the
representation,
but
is
intended
only
for
illustration.
Location/Species
Blue Jay
Crow
Robin
Sparrow
Alabama
30
50
Arkansas
15
50
Wisconsin
20
10
Answer
the
following
questions
based
on
above
information
a. To
which
type
of
data
described
in
Chapter
2
is
this
data
most
similar?
b. Describe
the
attribute
type.
c. What
proximity
measure
would
you
use
if
you
wanted
to
find
areas
that
were
similar
in
terms
of
the
percentage
of
birds
of
each
species?
Explain.
d. Suppose
that
you
only
care
about
the
presence
or
absence
of
a
bird
species
at
a
location.
How
would
you
transform
the
data
and
to
which
type
of
data
set
in
Chapter
2
would
this
correspond?
e. What
types
of
pairs
of
locations
(represented
as
pairs
of
rows
in
the
data
matrix)
would
yield
the
same
similarity
score
even
after
performing
the
transformation
proposed
in
(d).
Also
provide
the
similarity
measure
used.
Q9
Label
each
of
the
following
similarity
measures
as
good
or
bad
for
finding
similarity
in
document-term
data.
Provide
a
one-line
justification
for
each
answer
you
provide.
a. Correlation
b. Cosine
c. Euclidean
Q10
For
each
of
the
following
questions,
give
a
True
/
False
answer
and
a
one
sentence
justification.
a.
b.
c.
d.
There
are
cases
in
which
Euclidean
distance
may
not
be
symmetric,
i.e.,
d(x,y)
d(y,x).
Quantitative
variables
are
always
continuous.
If
the
correlation
of
two
attributes
is
-0.95,
then
they
are
completely
unrelated.
Let
a1,
a2,
and
a3
be
three
attributes.
If
correlation
of
(a1,
a2)
=
0.5
and
correlation
of
(a2,
a3)
=
0.5,
then
correlation
of
(a1,
a3)
=
0.5.
e. The
Hamming
distance
between
two
binary
strings
(i.e.,
strings
of
0s
and
1s),
will
never
exceed
the
Euclidean
distance.
Q11
There
is
a
group
of
n
female
students,
and
another
group
of
n
male
students.
Two
n-dimensional
vectors
A
and
B
record
the
heights
of
the
two
groups
of
students
respectively.
Consider
the
variable
transformation
that
is
defined
by
A
=
(A
-
mean(A))
/
std(A)
(1)
B = (B - mean(B)) / std(B)
(2)