Data Analysis
Data Analysis
Data Analysis
EXPLORATORY ANALYSIS
Exploratory data analysis:
In statistics, exploratory data analysis (EDA) s an approach to analyzing data sets to summarize their
main characteristics, often with visual methods. A statistical model can be used or not, but primarily
EDA is for seeingwhat the data can tell us beyond the formal modeling or hypothesis testing task.
Exploratory data analysis is a concept developed by John Tuckey (1977) that consists on a new
perspective of statistics. Tuckey's idea was that in traditional statistics, the data was not being explored
graphically, was just being used to test hypotheses. The first attempt to develop a tool was done in
it
Stanford, the project was called prim9. The tool was able to visualize data in nine dimensions, therefore
it was able to provide a multivariate perspective of the data.
In recent days, exploratory data analysis is a must and has been included in the big data analytics life
cycle. The ability to find insight and be able to communicate it effectively in an organization is fueled
with strong EDA capabilities.
Based on Tuckey's ideas, Bell Labs developed the S programming languagein order to provide an
interactive interface for doing statistics. The idea of S was to provide extensive graphical capabilities
on
with an easy-to-use language. In today's world, in the context of Big Data, R that is based
the Sprogramming language is the most popular software for analytics.
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their
main characteristics, often with visual methods. A statistical model can be used or not, but primarily
EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
Exploratory data analysis was promoted by John Tuckey to encourage statisticians to explore the data,
and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is
different from initial data analysis (DA),which focuses more narrowly on checking assumptions
required for model fiting and hypothsis testing, and handling missing values and making
transformations of variables as needed. EDA encompasses IDA.
Exploratory Data Analysis in Tuckey held that too much emphasis in statistics was placed on statistical
hypothesis testing(confirmatory data analysis); more emphasis needed to be placed on using data to
suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing
them on the same set of data can lead to systematicbias owing to the issues inherent in testing
hypotheses suggested by the data.
Many EDA techniques have been adopted into data mining, as well as into big data analytics.They are
also being taught to young students as a way to introduce them to statistical thinking.
DESCRIPTIVE ANALYSIS:
Descriptive statistics are used to describe the basic features of the data in a study. They provide simple
summaries about the sample and the measures. Together with simple graphics analysis, they form
the
basis of virtually every quantitative analysis of data.
Descriptive statistics are typically distinguished from inferential statistics. With descriptive statistics you
are simply describing what is or what the data shows. With inferential
statistics, you are trying to reach
conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try
to infer from the sample data what the population might think. Or, we use inferential statistics to make
judgments of the probability that an observed difference between groups is a dependable one or one that
might have happencd by chance in this study. Thus, we use inferentialstatistics to make inferences from
our data to more general conditions; we uso deseriptive statistics simply to deseribe what's going on in
our Explorntory data nalysis data.
Descriptivo Statistics are uscd to resent quantitative descriptions in a managcable form. In a research
study we may have lots of mcasures, Or we may mcasure a large number of people on any measure.
Descriptive statistics hlp us to sinplify large amounts of data in a sensible way. Each descriptive
statistic reduccs lots of data into a simpler summary. For instance, consider a simple number used to
sumnarize how well a batter is performing in bascbal, the batting average. This single number is simply
the number of hits divided by the number of times at bat (reported to three significant digits). A batter
who is hitting .333 is getting a hit one time in every thrce at bats. One batting.250 is hitting one time in
four. The single number describes a large number of discrete events. Or, consider the scourge
of many
students, the Grade Point Average (GPA). This single number describes the general performance of a
student across a potentially wide range of course experiences.
Every time you try to describe a large sct of observations with a single indicator you run the risk of
distorting the original data or losing important detail. The batting average doesn't tell you whether the
batter is hitting home runs or singles. It doesn't tellwhether she's been in a slump or on a streak. The
GPA doesn't tell you whether the student was in difficult courses or easy ones, or whether they were
courses in their major field or in other disciplines. Even given these limitations, descriptive statistics
provide a powerful summary that may enable comparisons across people or other units.
Univariate Analysis:
Univariate analysis involves the examination across cases of one variable at a time. There are three
major characteristics of a single variable that we tend to look at:
the distribution
the central tendency
the dispersion
In most situations, we would describe all three of these characteristics for each of the variables in our
study.
The Distribution: The distribution is a summary of the frequency of individual values or ranges of
values for a variable. The simplest distribution would list every value of a variable and the number of
persons who had cach value. For instance, a typical way to describe the distribution of college, students
is by ycar in college, listing the number or percent of students at each of the four years. Or, we describe
uender by listino thc number or ncrccnt of males and females In thesc cases the variable has few
enough values that we can list each one and summarize how many sample cases had the value. But what
do we do for a variable like income or GPA? With these variables there can be a large number of
possible values, with relatively few people having each one. In this case, we group the raw scores into
categories according to ranges of values. For instance, we might look at GPA according to the letter
grade ranges. Or, we might group income into four or five ranges of income values.
Frequency distribution table.
One of the most common ways to describe a single variable is with a frequency distribution. Depending
on the particular variable, all of the data values may be represented, or you may group the values into
categories first (e.g., with age, price, or temperature variables, it would usually not be sensible to
determine the frequencies for each value. Rather, the value are grouped into ranges and the frequencies
determined.). Frequency distributions can be depicted in two ways, as a table or as a graph. Table 1
shows an age frequency distribution with five categories of age ranges defined. The same frequency
distribution can be depicted in agraph as shown in Figure 1. This type of graph is often referred to as a
histogram or bar chart.
Frequency distribution bar chart.
Distributions may also be displayed using percentages. For example, you could use percentages to
describe the:
percentage of people in different income levels
percentage of people in different age ranges
percentageof people in different ranges of standardized test scores
Central Tendency: The central tendency of a distribution is an estimate of the "center" of a distribution
of values. There are three major types of estimates of central tendency:
Mean
Median
Mode
The Mean or average is probably the most commonly used method of describing central tendency. To
compute the mean allyou do is add up all the values and divide by the number of values. For example.
the mean or average quiz score is determined by summing all the scores and dividing by the number of
students taking the exam. For example, consider the test score values:
15, 20, 21, 20, 36, 15, 25, 15
The Median is the scorc found at the exact middle of the set of values. One way to compute the median
isto list allscores in numerical order, and then locato the score in the center of the sample. For example,
there are 500 scores in the list, score #250 would be the median. If we order the
1f
8 scores shown above,
we would get:
15, 15,15,20,20,21,25,36
There are 8 scores and score #4 and #5 represent the halfway point. Since
both of these scores are 20,
the median is 20. If the two middle scores had different values, you would have to interpolate to
determine the median.
The mode is the most frequently occurring value in the set of scores. To determine the mode, you might
again order the scores as shown above, and then count each one. The most frequently occurring value is
the mode. In our example, the value 15 occurs three times and is the model. In some distributions there
is more than one modal value. For instance, in a bimodal distribution there are two values that occur
most frequently.
Notice that for the same set of8 scores we got three different values -- 20.875, 20, and 15 -- for the
mean, median and mode respectively. If the distribution is truly normal (i.e., bell-shaped), the mean,
median and mode are all equal to each other.
Dispersion:Dispersion refers to the spread of the values around the central tendency. There are two
common measures of dispersion, the range and the standard deviation. The range is simply the highest
value minus the lowest value. In our example distribution, the high value is 36 and the low is 15, so the
range is 36 - 15 = 21.
The Standard Deviation is a more accurate and detailed estimate of dispersion because an outlier can
greatly exaggerate the range (as was true in this example where the single outlier value of 36 stands
apart from the rest of the values. The Standard Deviation shows the relation that set of scores has to the
mean of the sample. Again let's take the set of scores:
15, 20,21,20,36,15,25, 15
to compute the standard deviation, we first find the distance between each value and the mean.
Comparative analysis:
comparative analysis as comparison analysis: Use comparison analysis to measurethe
financial relationships between variables Over two or more reporting periods. Businesses use
comparative analysis as a way to identify their competitive positions and operating results over a defined
period. Larger organizations may often comprise the resources to
perforn financial comparative analysis monthly or quarterly, but it is recommended to perform an
annual financial comparison analysis at a minimum.
Financial Comparatives:
Financial statements outline comparatives, which are the variables
the financial
defining operating activities, investing activities and financing activities for a company. Analysts
assess company financial statements using percentages, ratios and amounts when making financial
comparative analysis. This information is the business intelligence decision maker's use for determining
future businessdecisions. A financial comparison also be performed to
analysis may
determine companyprofitability and stability. For example, management of a new venture may make
a financial comparison analysis periodically to evaluate company
performance. Determining losses
prematurely and redefining processes in a shorter period will favor.compared to unforeseen-annual
losses.
Comparative Format:
The comparative format for comparative analysis in accounting is a side by side view
of the financial
comparatives in the financial statements. Comparative analysis accounting identifies an
organization's
financial performance. For example, income statements identify financial comparables
such
as company income, expenses, and profit over a period
of time. A comparison analysis report identifies
where a business meets or exceeds budgets. Potential lenders will also utilize this
information to
determine a company's credit limit.
CLUSTERING:
Cluster Analysis
Cluster is a group of objects that belongs to the same class. In other words,
similar objects are grouped
in one cluster and dissimilar objects are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to changes and helps
single out useful features that distinguish different groups.
Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes
with similar functionalities and gain insight into structures inherent to populations.
Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to house type,
value, and geographic location.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of
data to observe characteristics of each cluster.
The following points throw light on why clustering is required in data mining -
Scalability - we need highly scalable clustering algorithms to deal with large databases.
-
Ability to deal with different kinds of attributes Algorithms should be capable to be applied on
any kind of data such as interval-based (numerical)
data, categorical, and binary data.
Discovery of clusters with attribute shape -- the clustering algorithm should be capable
of
detecting clusters of arbitrary shape. They should not be bounded to only distance measures
that
tend to find spherical cluster of small sizes.
-
Ability to deal with noisy data Databases
contain noisy, missing or erroneous data.
Some
algorithms are sensitive to such data
and may lead to poor quality clusters.
-
Interpretability the clustering results should
be interpretable, comprehensible, and usable.
Clustering Methods:
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database
of 'n' objects and the partitioning method constructs k
data. Each partition will represent a partition of
cluster and k<n. means
groups, which satisfy that it wil classify
the data into k
the following requirements -
• Each group contains at
least one object.
Each object must belong to
exactly one group.
Points to remember
Hierarchical Methods
This method creates a hierarchical
decomposition of the given set
hierarchical methods on the basis
of data objects. We can classify
of how the hierarchical decomposition is
approaches here - formed. There are two
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the
bottom-up approach. In this, we start
with each object forming a
separate group. It keeps on merging
the objects or groups that are close to one another.
It keep on doing
sO until all of the groups are
merged into one or until the termination
condition holds.
Divisive Approach
This approach is also known as the top-down
approach. In this, we start with all
same cluster. In the continuous of the objects in the
a
iteration, cluster is split up into smaller clusters.
It is down until each
object in one cluster or the termination condition
holds. This method is rigid, i.e., once a
merging or
splitting is done, it can never be undone.
Density-based Method
This method is based on the notion of density. The basic idea is to,
continue growing the given cluster
as long as the density in the neighborhood exceeds some
threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a minimum number
of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number
of cells that
form a grid structure.
pressing tisse
quantizsd spae
is tod is fast dissonsion in the
g
f
usorf sells in cach
k
y
et dalafor
odel. a given
ee csis
of
Giustgpous by te
oGopation of ur
peporties of dosicod
r
clusloring sesuis
aa
ed s
e
y
e
ae
yssio
c
way
aypliation
tie
uication
suquitt
witi
clustering process
tic
ifrfuon
salsrncnts that heip
ies asc sinpie tuposilories
c s ss asi scietioal aialass
or otor dala
aeke
Gusgig
y
guost
soquices just a litle
aa and
ecgaie
caisguical
for sosuti.
selational
as
aatas such
Sud as kais of
Advantages
Model-based methods
In this method, a model is hypothesized for each cluster to find
the best fit of data for a given model.
This method locates the clusters by clustering the density function.
It reflects spatial distribution of the
data points.
understood as a retail store's association rule to target their customers better. If the above rule is a result
of thorough analysis of some data sets, it can be used to not only improve customer service but also
improve the company's revenue.
Association rules are created by thoroughly analyzing data and looking for frequent iffthen patterns.
Then, depending on the following two parameters, the important relationships are observed:
Support: Support indicates how frequently the ifthen relationship appears in the database.
Confidence: Confidence tells about the number of times these relationships have been found to
be true.
So, in a given transaction with multiple items, Association Rule Mining primarily tries to find the rules
that govern how or why such products/items are often bought together. For example, peanut butter and
jelly are frequently purchased together because a lot of people like tomake PB&J sandwiches.
Association Rule Mining is sometimes referred to as Market Basket Analysis", as it was the first
application area of association mining. The aim is to discover associations of items occurring together
more often than you'd expect from randomly sampling all the possibilities. The classic anecdote of Beer
The story goes like this: young American men who go to the stores on Fridays to buy diapers have a
predisposition to grab a bottle of beer too. However unrelated and vague that may sound to us laymen,
association rule mining shows us how and why!
too.
However, as surprising as it may seem, the figures
tell us that 80% (=6000/7500) of the people who buy
diapers also buy beer.
2. Medical Diagnosis:
Association rules in medical diagnosis can
be useful for assisting physicians
for curing patients.
Diagnosis is not an easy process and has a scope errors
of which may result in unreliable end-results.
Using relational association rule mining, we can
identify the probability of the occurrence
concerning various factors and symptoms.
of an illness
Further, using learning techniques, this
interface can be
Cxtended by adding new symptoms and defining relationships between the
new signs and the
coresponding diseases.
3. Census Data:
Every government has tonnes of census data. This data can be used to plan efficient public
services(education, health, transport) as well as help public businesses (for setting up new factories,
shopping malls, and even marketing particular products). This application of association rule mining and
data mining has immense potential in supporting sound public policsy and bringing forth an efficient
4. Protein Sequence:
Proteins are sequences made up of twenty types of amino acids. Each protein bears a unique 3D
structure which depends on the sequence of these amino acids. A slight change in the sequence can
cause a change in structure which might change the functioning of the protein. This dependency of the
protein functioning on its amino acid sequence has been a subject of great research. Earlier it was
thought that these sequences are random, but now it's believed that they aren't. Nitin Gupta, Nitin
Mangal, Kamal Tiwari, and Pabitra Mitra have deciphered the nature of associations between different
amino acids that are present in a protein. Knowledge and understanding of these association rules will
come in extremely helpful during the synthesis of artificial proteins.
Hypothesis Generation:
In a nutshell, hypothesis generation is what helps you come up with new ideas for what you need to
change. Sure, you can do this by sitting around in a room and brainstorming new features, but reaching
out and learning from your users is a much faster way of getting the right data.
Imagine you were building a product to help people buy shoes online. Hypothesis generation might
includethingslike:
Talking to people who buy shoes online to explore what their problems are
Talking to people who don't buy shoes online to understand why
to understand what their
Watching people attempt to buy shoes both online and offline in order
problems really are rather than what they tell you they are
confusing
Watching people use your product to figure out if you've done anything particularly
that is keeping themn from buying shoes from you
At some point, you need to observe people or talk to people in order to understand them better.
However, you can sometimes use data mining or other metrics analyzation to begin to generate a
hypothesis. For example, you might look at your registration flow and notice a severe drop off half way
through. This might give you a clue that you have some sort of user problem half way though your
registration process that you might want to look into with Some qualitative research.
Hypothesís Validation:
Hypothesis validation is different. In this case, you already have an idea of what is wrong, and you have
an idea ofhow you might possibly fix it. You now have to go out and do some research to figure out if
your assumptions and decisions were correct.
For our fictional shoc-buying product, hypothesis validation might look something
like:
-Hypothesis:
or she is
or assertion of an analyst about the problem he
Simply put, a hypothesis is a possible view
may not be true.
working upon. It may be true or
customers are likely to lapse
build a credit risk model to identify which
For example, if youare asked to
are which are not, these can a possible set of hypothesis:
much. Why? Because, even if you understand the distribution of all 500 variables, you would need to
understand their correlation and a lot of other information, which can take hell of a time. This strategy is
typically known as boiling the ocean. So, you don't know exactly what you are looking for and you are
exploring every possible variable and relationship in a hope to use all - very difficult and time
consuming.
Approach2:Hypothesisdrivenanalysis
In this case, you list down a comprehensive set of analysis first - basically whatever comes to your mind.
Next, you see which out of these variables are readily available or can be collected. Now, this list should
give you a set of smaller, specific individual pieces of analysis to work on. For example, instead of
understanding all 500 variables first, you check whether the bureau provides number of past defaults or
not and use it in your analysis. This saves a lot of time and effort and if you progress on hypothesis in
order of your expected importance, you will be able to finish the analysis in fraction of time.
If you have read through the examples closely, the benefit of hypothesis driven approach should be
pretty clear. You can further read books "The McKinsey Way" and"The Pyramid Principle"
for gaining
more insight into this process.
MODULE IV -
VISUALIZATION-1
Data Visualization:
In order to understand data, it is often useful to visualize it. Normally in Big Data applications,
the
interest relies in finding insight rather than just making beautiful plots. The following are examples
of
different approaches to understanding data using plots.
Tostart analyzing the flights data, we can start by checking if there are corelations between numeric
variables.
Visualization or visualization (is any technique for creating images, diagrams, or animations to
communicate a message. Visualization through visual imagery has been an effective way to
communicate both abstract and concrete ideas since the dawn of humanity. Examples from history