Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
30 views

IDS 2nd Unit Notes

notes

Uploaded by

Naineni Shiny
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

IDS 2nd Unit Notes

notes

Uploaded by

Naineni Shiny
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Sub Code / Sub Name: – Introduction to Data Science

Unit: 02 Lecture No.: Date: 31-10-2022


Topic Name: Data Types and Statistical Description.
Name of the Faculty: Mahesh Hundekar

We need to get the data ready. This involves having a closer look at attributes and data values. Real-
world data are typically noisy, enormous in volume (often several gigabytes or more), and may originate
from a hodgepodge of heterogeneous sources. Knowledge about your data is useful for data
preprocessing, , the first major task of the data mining process.

You will want to know the following:

 What are the types of attributes or fields that make up your data?
 What kind of values does each attribute have?
 Which attributes are discrete, and which are continuous-valued?
 What do the data look like?
 How are the values distributed?
 Are there ways we can visualize the data to get a better sense of it all?
 Can we spot any outliers?
 Can we measure the similarity of some data objects with respect to others?

Various attribute types: nominal attributes, binary attributes, ordinal attributes, and numeric attributes.

Basic statistical descriptions can be used to learn more about each attribute’s values, Given a
temperature attribute, for example, we can determine its mean (average value), median (middle value),
and mode (most commont value). These are measures of central tendency, which give us an idea of the
“middle” or center of distribution.

Data Objects and Attribute Types:

Data sets are made up of data objects. A data object represents an entity—in a sales database, the
objects may be customers, store items, and sales; in a medical database, the objects may be patients; in
a university database, the objects may be students, professors, and courses. Data objects are typically
described by attributes. Data objects can also be referred to as samples, examples, instances, data
points, or objects. If the data objects are stored in a database, they are data tuples. That is, the rows of a
database correspond to the data objects, and the columns correspond to the attributes. In this section,
we define attributes and look at the various attribute types.

What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a data object. The nouns attribute,
dimension, feature, and variable are often used.

The term dimension is commonly used in data warehousing. Machine learning literature tends to use
the term feature, while statisticians prefer the term variable. Data mining and database professionals
commonly use the term attribute.

Attributes describing a customer object can include, for example, customer ID, name, and address.
Observed values for a given attribute are known as observations. A set of attributes used to describe a
given object is called an attribute vector (or feature vector). The distribution of data involving one
attribute (or variable) is called univariate. A bivariate distribution involves two attributes, and so on.

The type of an attribute is determined by the set of possible values—nominal, binary, ordinal, or
numeric—the attribute can have.

Nominal Attributes:

Nominal means “relating to names.” The values of a nominal attribute are symbols or names of things.
Each value represents some kind of category, code, or state, and so nominal attributes are also referred
to as categorical. The values do not have any meaningful order. In computer science, the values are also
known as enumerations.

Example, Nominal attributes: Suppose that hair color and marital status are two attributes describing
person objects. In our application, possible values for hair color are black, brown, blond, red, auburn,
gray, and white. The attribute marital status can take on the values single, married, divorced, and
widowed. Both hair color and marital status are nominal attributes. Another example of a nominal
attribute is occupation, with the values teacher, dentist, programmer, farmer, and so on. Although we
said that the values of a nominal attribute are symbols or “names of things,” it is possible to represent
such symbols or “names” with numbers. With hair color, for instance, we can assign a code of 0 for
black, 1 for brown, and so on. Another example is customor ID, with possible values that are all numeric.
However, in such cases, the numbers are not intended to be used quantitatively. That is, mathematical
operations on values of nominal attributes are not meaningful. It makes no sense to subtract one
customer ID number from another, unlike, say, subtracting an age value from another (where age is a
numeric attribute). Even though a nominal attribute may have integers as values, it is not considered a
numeric attribute because the integers are not meant to be used quantitatively.

Binary Attributes:

A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically
means that the attribute is absent, and 1 means that it is present. Binary attributes are referred to as
Boolean if the two states correspond to true and false.

Example, Binary attributes: Given the attribute smoker describing a patient object, 1 indicates that the
patient smokes, while 0 indicates that the patient does not. Similarly, suppose the patient undergoes a
medical test that has two possible outcomes. The attribute medical test is binary, where a value of 1
means the result of the test for the patient is positive, while 0 means the result is negative. A binary
attribute is symmetric if both of its states are equally valuable and carry the same weight; that is, there
is no preference on which outcome should be coded as 0 or 1. One such example could be the attribute
gender having the states male and female. A binary attribute is asymmetric if the outcomes of the states
are not equally important, such as the positive and negative outcomes of a medical test for HIV. By
convention, we code the most important outcome, which is usually the rarest one, by 1 (e.g., HIV
positive) and the other by 0 (e.g., HIV negative).

Ordinal Attributes:

An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among
them, but the magnitude between successive values is not known.

Example, Ordinal attributes: Ordinal attributes are useful for registering subjective assessments of
qualities that cannot be measured objectively; thus ordinal attributes are often used in surveys for
ratings. In one survey, participants were asked to rate how satisfied they were as customers. Customer
satisfaction had the following ordinal categories: 0: very dissatisfied, 1: somewhat dissatisfied, 2:
neutral, 3: satisfied, and 4: very satisfied.

Note that nominal, binary, and ordinal attributes are qualitative. That is, they describe a feature of an
object without giving an actual size or quantity. The values of such qualitative attributes are typically
words representing categories. If integers are used, they represent computer codes for the categories,
as opposed to measurable quantities (e.g., 0 for small drink size, 1 for medium, and 2 for large).

Numeric Attributes:

A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real
values. Numeric attributes can be interval-scaled or ratio-scaled.

Interval-Scaled Attributes: Interval-scaled attributes are measured on a scale of equal-size units. The
values of interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition to
providing a ranking of values, such attributes allow us to compare and quantify the difference between
values.

Example, Interval-scaled attributes: A temperature attribute is interval-scaled. Suppose that we have


the outdoor temperature value for a number of different days, where each day is an object. By ordering
the values, we obtain a ranking of the objects with respect to temperature. In addition, we can quantify
the difference between values. For example, a temperature of 20◦C is five degrees higher than a
temperature of 15◦C. Calendar dates are another example. For instance, the years 2002 and 2010 are
eight years apart.

Ratio-Scaled Attributes: A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That
is, if a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another
value. In addition, the values are ordered, and we can also compute the difference between values, as
well as the mean, median, and mode.
Example, Ratio-scaled attributes: Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K)
temperature scale has what is considered a true zero-point (0◦K = −273.15◦C): It is the point at which the
particles that comprise matter have zero kinetic energy. Other examples of ratio-scaled attributes
include count attributes such as years of experience (e.g., the objects are employees) and number of
words (e.g., the objects are documents). Additional examples include attributes to measure weight,
height, latitude and longitude coordinates (e.g., when clustering houses).

Discrete versus Continuous Attributes: We have organized attributes into nominal, binary, ordinal, and
numeric types. There are many ways to organize attribute types. The types are not mutually exclusive.

Classification algorithms developed from the field of machine learning often talk of attributes as being
either discrete or continuous. Each type may be processed differently.

A discrete attribute has a finite or countably infinite set of values, which may or may not be represented
as integers. The attributes hair color, smoker, medical test, and drink size each have a finite number of
values, and so are discrete. Note that discrete attributes may have numeric values, such as 0 and 1 for
binary attributes or, the values 0 to 110 for the attribute age. An attribute is countably infinite if the set
of possible values is infinite but the values can be put in a one-to-one correspondence with natural
numbers. For example, the attribute customer ID is countably infinite. The number of customers can
grow to infinity, but in reality, the actual set of values is countable (where the values can be put in one-
to-one correspondence with the set of integers). Zip codes are another example.

Basic Statistical Descriptions of Data

Basic statistical descriptions can be used to identify properties of the data and highlight which data
values should be treated as noise or outliers. We start with measures of central tendency which
measure the location of the middle or center of a data distribution. Intuitively speaking, given an
attribute, where do most of its values fall? In particular, we discuss the mean, median, mode, and
midrange. In addition to assessing the central tendency of our data set, we also would like to have an
idea of the dispersion of the data. That is, how are the data spread out? The most common data
dispersion measures are the range, quartiles, and interquartile range; the five-number summary and
boxplots; and the variance and standard deviation of the data.

Measuring the Central Tendency: Mean, Median, and Mode

In this section, we look at various ways to measure the central tendency of data. Suppose that we have
some attribute X, like salary, which has been recorded for a set of objects. Let x1,x2,...,xN be the set of N
observed values or observations for X. Here, these values may also be referred to as the data set (for X).
If we were to plot the observations for salary, where would most of the values fall? This gives us an idea
of the central tendency of the data. Measures of central tendency include the mean, median, mode, and
midrange.
The most common and effective numeric measure of the “center” of a set of data is the (arithmetic)
mean. Let x1,x2,...,xN be a set of N values or observations, such as for some numeric attribute X, like
salary. The mean of this set of values is

Example Mean. Suppose we have the following values for salary (in thousands of dollars), shown in
increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Using Eq, we have;

Thus, the mean salary is $58,000.

Sometimes, each value xi in a set may be associated with a weight wi for i = 1,...,N. The weights reflect
the significance, importance, or occurrence frequency attached to their respective values. In this case,
we can compute.

This is called the weighted arithmetic mean or the weighted average.


Although the mean is the singlemost useful quantity for describing a data set, it is not always the best
way of measuring the center of the data. A major problem with the mean is its sensitivity to extreme
(e.g., outlier) values. Even a small number of extreme values can corrupt the mean. For example, the
mean salary at a company may be substantially pushed up by that of a few highly paid managers.
Similarly, the mean score of a class in an exam could be pulled down quite a bit by a few very low scores.
To offset the effect caused by a small number of extreme values, we can instead use the trimmed mean,
which is the mean obtained after chopping off values at the high and low extremes. For example, we can
sort the values observed for salary and remove the top and bottom 2% before computing the mean. We
should avoid trimming too large a portion (such as 20%) at both ends, as this can result in the loss of
valuable information.

For skewed (asymmetric) data, a better measure of the center of data is the median, which is the middle
value in a set of ordered data values. It is the value that separates the higher half of a data set from the
lower half. In probability and statistics, the median generally applies to numeric data; however, we may
extend the concept to ordinal data. Suppose that a given data set of N values for an attribute X is sorted
in increasing order. If N is odd, then the median is the middle value of the ordered set. If N is even, then
the median is not unique; it is the two middlemost values and any value in between. If X is a numeric
attribute in this case, by convention, the median is taken as the average of the two middlemost values.

Example, Median. Let’s find the median of the data from Example 2.6. The data are already sorted in
increasing order. There is an even number of observations (i.e., 12); therefore, the median is not unique.
It can be any value within the two middlemost values of 52 and 56 (that is, within the sixth and seventh
values in the list). By convention, we assign the average of the two middlemost values as the median;
that is, 52+56 /2 = 108 /2 = 54. Thus, the median is $54,000. Suppose that we had only the first 11 values
in the list. Given an odd number of values, the median is the middlemost value. This is the sixth value in
this list, which has a value of $52,000.

The median is expensive to compute when we have a large number of observations. For numeric
attributes, however, we can easily approximate the value. Assume that data are grouped in intervals
according to their xi data values and that the frequency (i.e., number of data values) of each interval is
known. For example, employees may be grouped according to their annual salary in intervals such as
$10–20,000, $20–30,000, and so on. Let the interval that contains the median frequency be the median
interval. We can approximate the median of the entire data set (e.g., the median salary) by interpolation
using the formula.

where L1 is the lower boundary of the median interval, N is the number of values in the entire
data set, is the sum of the frequencies of all of the intervals that are lower than the
median interval, freqmedian is the frequency of the median interval, and width is the width of the median
interval.

The mode is another measure of central tendency. The mode for a set of data is the value that occurs
most frequently in the set. Therefore, it can be determined for qualitative and quantitative attributes. It
is possible for the greatest frequency to correspond to several different values, which results in more
than one mode. Data sets with one, two, or three modes are respectively called unimodal, bimodal, and
trimodal. In general, a data set with two or more modes is multimodal. At the other extreme, if each
data value occurs only once, then there is no mode.

Example, Mode. The data from above Example are bimodal. The two modes are $52,000 and $70,000.
For unimodal numeric data that are moderately skewed (asymmetrical), we have the following empirical
relation:

mean − mode ≈ 3 × (mean − median).

This implies that the mode for unimodal frequency curves that are moderately skewed can easily be
approximated if the mean and median values are known. The midrange can also be used to assess the
central tendency of a numeric data set. It is the average of the largest and smallest values in the set. This
measure is easy to compute using the SQL aggregate functions, max() and min().

Example, Midrange. The midrange of the data of above Example is 30,000+110,000/ 2 = $70,000.

In a unimodal frequency curve with perfect symmetric data distribution, the mean, median, and mode
are all at the same center value, as shown in Figure 2.1(a).

Data in most real applications are not symmetric. They may instead be either positively skewed, where
the mode occurs at a value that is smaller than the median (Figure 2.1b), or negatively skewed, where
the mode occurs at a value greater than the median (Figure 2.1c).
Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile
Range

We now look at measures to assess the dispersion or spread of numeric data. The measures include
range, quantiles, quartiles, percentiles, and the interquartile range. The five-number summary, which
can be displayed as a boxplot, is useful in identifying outliers. Variance and standard deviation also
indicate the spread of a data distribution.

Range, Quartiles, and Interquartile Range

To start off, let’s study the range, quantiles, quartiles, percentiles, and the interquartile range as
measures of data dispersion. Let x1,x2,...,xN be a set of observations for some numeric attribute, X. The
range of the set is the difference between the largest (max()) and smallest (min()) values. Suppose that
the data for attribute X are sorted in increasing numeric order. Imagine that we can pick certain data
points so as to split the data distribution into equal-size consecutive sets, as in Figure 2.2. These data
points are called quantiles. Quantiles are points taken at regular intervals of a data distribution, dividing
it into essentially equalsize consecutive sets. (We say “essentially” because there may not be data values
of X that divide the data into exactly equal-sized subsets. For readability, we will refer to them as equal.)
The kth q-quantile for a given data distribution is the value x such that at most k/q of the data values are
less than x and at most (q − k)/q of the data values are more than x, where k is an integer such that 0 < k
< q. There are q − 1 q-quantiles.

The 2-quantile is the data point dividing the lower and upper halves of the data distribution. It
corresponds to the median. The 4-quantiles are the three data points that split the data distribution into
four equal parts; each part represents one-fourth of the data distribution. They are more commonly
referred to as quartiles. The 100-quantiles are more commonly referred to as percentiles; they divide
the data distribution into 100 equal-sized consecutive sets. The median, quartiles, and percentiles are
the most widely used forms of quantiles.

The quartiles give an indication of a distribution’s center, spread, and shape. The first quartile, denoted
by Q1, is the 25th percentile. It cuts off the lowest 25% of the data. The third quartile, denoted by Q3, is
the 75th percentile—it cuts off the lowest 75% (or highest 25%) of the data. The second quartile is the
50th percentile. As the median, it gives the center of the data distribution. The distance between the
first and third quartiles is a simple measure of spread that gives the range covered by the middle half of
the data. This distance is called the interquartile range (IQR) and is defined as IQR = Q3 − Q1.

Five-Number Summary, Boxplots, and Outliers

The five-number summary of a distribution consists of the median (Q2), the quartiles Q1 and Q3, and
the smallest and largest individual observations, written in the order of Minimum, Q1, Median, Q3,
Maximum. Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-
number summary as follows: Typically, the ends of the box are at the quartiles so that the box length is
the interquartile range. The median is marked by a line within the box. Two lines (called whiskers)
outside the box extend to the smallest (Minimum) and largest (Maximum) observations.

Boxplot. Figure 2.3 shows boxplots for unit price data for items sold at four branches of AllElectronics
during a given time period. For branch 1, we see that the median price of items sold is $80, Q1 is $60,
and Q3 is $100. Notice that two outlying observations for this branch were plotted individually, as their
values of 175 and 202 are more than 1.5 times the IQR here of 40. Boxplots can be computed in
O(nlogn) time. Approximate boxplots can be computed in linear or sublinear time depending on the
quality guarantee required.

Variance and Standard Deviation

Variance and standard deviation are measures of data dispersion. They indicate how spread out a data
distribution is. A low standard deviation means that the data observations tend to be very close to the
mean, while a high standard deviation indicates that the data are spread out over a large range of
values.

The variance of N observations, x1,x2,...,xN , for a numeric attribute X is

where x¯ is the mean value of the observations, The standard deviation, σ, of the observations is the
square root of the variance, σ 2 .

ExamplE: Variance and standard deviation. In early Example, we found x¯ = $58,000 using Eq. (2.1) for
the mean. To determine the variance and standard deviation of the data from that example, we set N =
12 and use above Eq. to obtain

The basic properties of the standard deviation, σ, as a measure of spread are as follows:

 σ measures spread about the mean and should be considered only when the mean is chosen as
the measure of center.
 σ = 0 only when there is no spread, that is, when all observations have the same value.
Otherwise, σ > 0.

Importantly, an observation is unlikely to be more than several standard deviations away from the
mean. Mathematically, using Chebyshev’s inequality, it can be shown that at least ( 1 – 1/ k 2 ) × 100% of
the observations are no more than k standard deviations from the mean. Therefore, the standard
deviation is a good indicator of the spread of a data set.

The computation of the variance and standard deviation is scalable in large databases.

Graphic Displays of Basic Statistical Descriptions of Data

These include quantile plots, quantile–quantile plots, histograms, and scatter plots. Such graphs are
helpful for the visual inspection of data, which is useful for data preprocessing. The first three of these
show univariate distributions (i.e., data for one attribute), while scatter plots show bivariate
distributions (i.e., involving two attributes).

Quantile Plot
A quantile plot is a simple and effective way to have a first look at a univariate data distribution. First, it
displays all of the data for the given attribute (allowing the userto assess both the overall behavior and
unusual occurrences). Second, it plots quantile information (see Section 2.2.2). Let xi , for i = 1 to N, be
the data sorted in increasing order so that x1 is the smallest observation and xN is the largest for some
ordinal or numeric attribute X. Each observation, xi , is paired with a percentage, fi , which indicates that
approximately fi × 100% of the data are below the value, xi . We say “approximately” because there may
not be a value with exactly a fraction, fi , of the data below xi . Note that the 0.25 percentile corresponds
to quartile Q1, the 0.50 percentile is the median, and the 0.75 percentile is Q3.

Let

Quantile–Quantile Plot

Aquantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution against the
corresponding quantiles of another. It is a powerful visualization tool in that it allows the user to view
whether there is a shift in going from one distribution to another. Suppose that we have two sets of
observations for the attribute or variable unit price, taken from two different branch locations. Let
x1,...,xN be the data from the first branch, and y1,..., yM be the data from the second, where each data
set is sorted in increasing order. If M = N (i.e., the number of points in each set is the same), then we
simply plot yi against xi , where yi and xi are both (i − 0.5)/N quantiles of their respective data sets. If M
< N (i.e., the second branch has fewer observations than the first), there can be only M points on the q-q
plot. Here, yi is the (i − 0.5)/M quantile of the y
Histograms

Histograms (or frequency histograms) are at least a century old and are widely used. “Histos” means
pole or mast, and “gram” means chart, so a histogram is a chart of poles. Plotting histograms is a
graphical method for summarizing the distribution of a given attribute, X. If X is nominal, such as
automobile model or item type, then a pole or vertical bar is drawn for each known value of X. The
height of the bar indicates the frequency (i.e., count) of that X value. The resulting graph is more
commonly known as a bar chart.

If X is numeric, the term histogram is preferred. The range of values for X is partitioned into disjoint
consecutive subranges. The subranges, referred to as buckets or bins, are disjoint subsets of the data
distribution for X. The range of a bucket is known as the width. Typically, the buckets are of equal width.
For example, a price attribute with a value range of $1 to $200 (rounded up to the nearest dollar) can be
partitioned into subranges 1 to 20, 21 to 40, 41 to 60, and so on. For each subrange, a bar is drawn with
a height that represents the total count of items observed within the subrange. Histograms and
partitioning rules are further discussed in Chapter 3 on data reduction.
The scatter plot is a useful method for providing a first look at bivariate data to see clusters of points
and outliers, or to explore the possibility of correlation relationships. Two attributes, X, and Y, are
correlated if one attribute implies the other. Correlations can be positive, negative, or null
(uncorrelated). Figure 2.8 shows examples of positive and negative correlations between two attributes.
If the plotted points pattern slopes from lower left to upper right, this means that the values of X
increase as the values of Y increase, suggesting a positive correlation (Figure 2.8a). If the pattern of
plotted points slopes from upper left to lower right, the values of X increase as the values of Y decrease,
suggesting a negative correlation (Figure 2.8b). A line of best fit can be drawn to study the correlation
between the variables. Figure 2.9 shows three cases for which there is no correlation relationship
between the two attributes in each of the given data sets.

In conclusion, basic data descriptions (e.g., measures of central tendency and measures of dispersion)
and graphic statistical displays (e.g., quantile plots, histograms, and scatter plots) provide valuable
insight into the overall behavior of your data. By helping to identify noise and outliers, they are
especially useful for data cleaning.

Scatter Plots and Data Correlation

A scatter plot is one of the most effective graphical methods for determining if there appears to be a
relationship, pattern, or trend between two numeric attributes. To construct a scatter plot, each pair of
values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane.

The scatter plot is a useful method for providing a first look at bivariate data to see clusters of points
and outliers, or to explore the possibility of correlation relationships. Two attributes, X, and Y, are
correlated if one attribute implies the other. Correlations can be positive, negative, or null
(uncorrelated).

If the plotted points pattern slopes from lower left to upper right, this means that the values of X
increase as the values of Y increase, suggesting a positive correlation.

If the pattern of plotted points slopes from upper left to lower right, the values of X increase as the
values of Y decrease, suggesting a negative correlation.

In conclusion, basic data descriptions (e.g., measures of central tendency and measures of dispersion)
and graphic statistical displays (e.g., quantile plots, histograms, and scatter plots) provide valuable
insight into the overall behavior of your data. By helping to identify noise and outliers, they are
especially useful for data cleaning.

You might also like