Introduction To Analytics
Introduction To Analytics
Analytics
By
Dr. Kingshuk Srivastava
Changing life under Digital Age
Transformation is Critical…
are vulnerable
72% to
disruption
within
three years
Why?….Suddenly!!..
External Threats
Born-on-digital companies that steal 274,000
market share or rewrite customer
Estimated
expectations
worldwide startups
New business models that reinvent our each day
industry and change the game altogether
4
The shift to a Data-Driven Organization
Data
Data Decision Science
Efficiency
Modernization
Monetizati
on
Uses of Data
What is Data Science?
• All companies are moving towards using Business Analytics to understand data to develop their
business goals
Engineering Research
Intelligence
Scripting, SQL
Python, R Scala Computer Math &
Data Pipelines Machine
Big Data/ Science Stats
Mathematics
Apache Spark
Learning Computational
Domains Analytics
Sector Specific specializations
Graph Analytics
Types of Data
Types of Data
The V’s of Big Data
Data Collection Techniques
• Observations,
• Tests,
• Surveys,
• Document analysis
(the research literature)
Quantitative Methods
y = f(x)
Which is which here?
Key Factors for High Quality
Experimental Design
Data should not be contaminated by poor measurement or errors
in procedure.
Not accurate
But precise
Neither accurate
nor precise
Sampling
Sampling is the problem of accurately acquiring the necessary
data in order to form a representative view of the problem.
37
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
38
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer
error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
39
Incomplete (Missing) Data
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values per
attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class:
smarter
• the most probable value: inference-based such as Bayesian
formula or decision tree
41
Noisy Data
42
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers)
43
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels)
44
THANK YOU