Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
306 views

1 - Introduction To Datascience

The document provides an overview of an introduction to data science course. It includes a disclaimer acknowledging that content has been obtained from various sources and modified for course requirements. The course structure lists 9 modules covering topics such as data analytics, data science process, data visualization, and ethics. It also lists textbooks and reference books for the course. The document discusses the course platform as Python/Jupyter Notebook/Google Colab and notes datasets will be chosen as appropriate.

Uploaded by

Manisha M
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
306 views

1 - Introduction To Datascience

The document provides an overview of an introduction to data science course. It includes a disclaimer acknowledging that content has been obtained from various sources and modified for course requirements. The course structure lists 9 modules covering topics such as data analytics, data science process, data visualization, and ethics. It also lists textbooks and reference books for the course. The document discusses the course platform as Python/Jupyter Notebook/Google Colab and notes datasets will be chosen as appropriate.

Uploaded by

Manisha M
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 444

DSECL ZG523

Introduction to Data Science


BITS Pilani
Pilani Campus Dr.Vijayalakshmi
Disclaimer and Acknowledgement

• The content for these slides has been obtained from books and various
other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the
course

BITS Pilani, Pilani Campus


C OURSE
STRUCTURE
M1 Introduction to Data Science
M2 Data Analytics
M3 Data Science Process
M4 Data Science Teams
M5 Data and Data Models
M6 Data Wrangling and
Feature Engineering
M7 Data Visualization
M8 Storytelling with Data
M9 Ethics for Data
Science
In tro d u c ti o n TO DatA SC I E N C E BITS Pilani, Pilani Campus
T E X T AND
R EFERENCE B OOKS
T EXT B OOKS
T1 Introducing Data Science by Cielen, Meysman and Ali
T2 Storytelling with Data, A data visualization guide for business professionals,
by Cole, Nussbaumer Knaflic; Wiley
T3 Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar

R EFERENCE B OOKS
R1 The Art of Data Science by Roger D Peng and Elizabeth Matsui
R2 Ethics and Data Science by DJ Patil, Hilary Mason, Mike Loukides
R3 Python Data Science Handbook: Essential tools for working with data by Jake
VanderPlas
R4 KDD, SEMMA and CRISP-DM: A Parallel Overview , Ana Azevedo and M.F.
Santos, IADS-DM, 2008
In tro d u c ti o n TO DatA SC I E N C E BITS Pilani, Pilani Campus
C A N VA S

Most relevant and up to date info on Canvas


Handout
Schedule for Webinar, Quiz, and Assignments.
Session Slide Deck
Demo Lab Sheets
Quiz-I, Quiz-II
Mini-project

The video recording


will be available in
Impartus.
In tro d u c ti o n TO DatA SC I E N C E BITS Pilani, Pilani Campus
P L AT F O R M /
D AT A S E T

Platform
• Python / Jupyter Notebook / Google Colab
Dataset
• Datasets as we deem appropriate.
Webinar
• As per schedule

In tro d u c ti o n TO DatA SC I E N C E BITS Pilani, Pilani Campus


Module1
• Fundamentals of Data Science
Why Data Science
Defining Data Science
Data Science Process
• Real World applications
• Data Science vs BI
• Data Science vs Statistics
• Roles and responsibilities of a Data Scientist
• Software Engineering for Data Science
• Data Scientists Toolbox
• Data Science Challenges

BITS Pilani, Pilani Campus


Fundamental of data science

BITS Pilani, Pilani Campus


Why Data Science?

• Let's look into this question from the following perspectives:


Data Science as a field
Various data science job roles
Market revenue
Skills

Source: https://analyticsindiamag.com/why‐a‐career‐in ‐data‐science ‐should ‐be ‐your ‐prime‐focus ‐in ‐2020/

BITS Pilani, Pilani Campus


Why Data Science?

• Data Science as a field


"Data Science is the sexiest job in the 21st century"
 ‐‐IBM
Data Science is one of the fastest growing fields in the world
In 2019, 2.9 million data science job openings were required globally
According to the U.S. Bureau of Labor Statistics, 11.5 million new jobs
will be created by the year 2026
Even with COVID‐19 situation, and the amount of shortage in talent, there
might not be a dip in data science as a career option

Source: https://analyticsindiamag.com/why‐a ‐career‐in‐data ‐science ‐should ‐be ‐your ‐prime ‐focus ‐in ‐
2020/

BITS Pilani, Pilani Campus


Why Data Science?

• The Hottest Job Roles and Trends In Data Science 2020


The increase in data science as a career choice in 2020 will also see the
rise in its various job roles
 Data Engineer
 Data Administrator
 Machine Learning Engineer
 Statistician
 Data and Analytics Manager
In India, the average salary of a data scientist as of January 2020 is ₹10L/yr,
which is pretty attractive (Glassdoor, 2020)

Source: https://analyticsindiamag.com/why ‐a ‐career ‐in ‐data ‐science ‐should ‐be ‐your ‐prime ‐focus ‐in ‐
2020/

BITS Pilani, Pilani Campus


Why Data Science?

• Market Revenue
In recent years, with the amount of shift in analytics and data
science, the market revenue has increased
Analytics, data science, and big data industry in India generated
about
 $2.71 billion annually in 2018, and the revenue grew to
 $3.03 billion in 2019
This 2019 figure, is expected to double by 2025 nearly.

Source: https://analyticsindiamag.com/why‐a‐career‐in‐data‐science‐should‐be‐your‐prime‐focus ‐in ‐2020/

BITS Pilani, Pilani Campus


Most wanted Data Science Skills in 2019
• Survey Details
 LinkedIn data scientist job openings
 Survey conducted in the US in April 2019
 24,697 respondents
 76.13 percent of data scientist job openings
on LinkedIn required a knowledge of the
programming language Python
 Wider industry data may vary

Source: https://softwarestrategiesblog.com/2019/06/13/how ‐to ‐get ‐your ‐data ‐scientist ‐


career‐started/

BITS Pilani, Pilani Campus


Define Datascience Data Science
Defined

BITS Pilani, Pilani Campus


Data Science Defined
• There is no consistent definition to what constitutes data science
• "Data Science is a study of data."
• "Data Science is an art of uncovering useful patterns, connections, insights, and trends that
are hiding behind the data"
• "Data Science helps to translate data into a story. The story telling helps in uncovering
insights. The insights in turn help in making decisions or strategic choices"
• "Data Science involves extracting meaningful insights from any data"
 Requires a major effort of preparing, cleaning, scrubbing, or standardizing the
data
 Algorithms are then applied to crunch pre‐processed data
 This process is iterative and requires analysts’ awareness of the best practices
 Automation of tasks allows us focus on the most important aspect of data
science:
 Interpreting the results of the analysis in order to make decisions

BITS Pilani, Pilani Campus


Data Science Defined

• Data science is an inter‐disciplinary practice that draws from


 Data engineering, statistics, data mining, machine learning, and predictive analytics
• Similar to operations research, data science focuses on…
 Implementing data‐driven models and managing their outcomes
• To uncover nuggets of wisdom and actionable insights, Data science
combines the
 Data‐driven approach of statistical data analysis
 Computational power of the computer
 Programming acumen of a computer scientist
 Domain‐specific business intelligence

Source: Data Science using Python and R – Chantal Larose & Daniel Larose

BITS Pilani, Pilani Campus


Artificial Intelligence, Machine Learning,
Fundamentals of Data
and Data Science
Science
• Artificial Intelligence
 AI involves making machines capable of mimicking human behavior, particularly
cognitive functions:
 For e.g., facial recognition, automated driving, sorting mail based on postal code
• Machine learning
 Considered a sub‐field of or one of the tools of AI
 Involves providing machines with the
capability of learning from experience
 Experience for machines comes in the
form of data
 Data that is used to teach machines is
called training data
• Data Science
 Data science is the application of machine
learning, artificial intelligence, and other
quantitative fields like statistics,
visualization, and mathematics

Source: Data Science, 2nd Edition by Bala


Deshpande; Vijay Kotu

BITS Pilani, Pilani Campus


Data Science
Data Science process Process

BITS Pilani, Pilani Campus


Various Stages of Data Science Process
Fundamentals of Data
Science

Source: Simplilearn ‐
https://www.youtube.com/watch?v=
X3paOmcrTjQ
BITS Pilani, Pilani Campus
Phases of Data Science Process

Determine
whether your
The coding of models are any
the data science good
Establish Deploy the
process
baseline model model for real‐
Select the best world problems
performance Apply various performing
Clearly algorithms to model from a set
describe Partition the uncover hidden
project Gain insights data, Balance relationships
objectives into the data the data, if Deployment
Create a through needed
Data cleaning graphical Evaluation
problem
/preparation is exploration
statement that Modeling
more labor‐
can be solved
intensive
using data Setup
science
Explorator
y Data
Data Analysis
Preparation
Problem
Description

Source: Data Science Using Python & R by Chantal Larose and Daniel Larose

BITS Pilani, Pilani Campus


Key features and motivations of Data
Science

• Extracting Meaningful Patterns


• Building Representative Models
• Combination of Statistics, Machine Learning, and
Computing
• Learning Algorithms
• Associated Fields
Descriptive Statistics
Exploratory Visualization
Dimensional Slicing
Hypothesis Testing
Data Engineering
Business Intelligence
Source: Data Science, 2nd Edition, by Bala Deshpande;
Vijay Kotu

BITS Pilani, Pilani Campus


Application of data science

BITS Pilani, Pilani Campus


Data Science Use Cases
Real‐World
Applications

Source: https://data‐flair.training/blogs/data ‐science‐use ‐cases/

BITS Pilani, Pilani Campus


Facebook – Social Analytics

BITS Pilani, Pilani Campus


Facebook – Social Analytics

• Text Analysis
 Uses a home grown tool called "DeepText" to analyze words
from the user posts and extracts meaning from them
 E.g., extract people’s interest and align photographs with texts

• Targeted Advertising
 Uses deep learning for targeted advertising
 Forms clusters of users based on their preferences
and runs advertisements that appeal to them

Source: https://data‐flair.training/blogs/data ‐science‐use ‐cases/

BITS Pilani, Pilani Campus


Facial Recognition
This facial recognition software identified people in pictures uploaded on
Facebook and suggested users to tag these people in the photos, thereby
linking them to the tagged person’s profile. 

BITS Pilani, Pilani Campus


Amazon – Improving E‐Commerce
Experience

• Personalized recommendation
 Heavily relies on predictive analytics (a personalized
recommender system) to increase customer satisfaction
 Analyzes the purchase history of customers, other customer
suggestions, and user ratings to recommend products

Source: https://data‐flair.training/blogs/data‐science‐
use‐cases/

BITS Pilani, Pilani Campus


Amazon – Improving E‐Commerce
Experience
• Anticipatory shipping model
 Uses big data for predicting the products that are most likely to be purchased by its
users
 Analyzes pattern of customer purchases and keeps products in the nearest
warehouse which the customers may utilize in the future
• Price discounts
 Using parameters such as the user activity, order history, prices offered by the
competitors, product availability, etc., Amazon provides discounts on popular items
and earns profits on less popular items.
• Fraud Detection
 Amazon has its own novel ways and algorithms to detect fraud sellers and fraudulent
purchases
• Improving Packaging Efficiency
 Amazon optimizes packaging of products in warehouses and increases efficiency of
packaging lines through the data collected from the workers

Source: https://data‐flair.training/blogs/data ‐science‐use ‐cases/ BITS Pilani, Pilani Campus


Uber – Improving Rider Experience

• Uber is a popular smartphone application that allows a user to book a cab


• Uber maintains large database of drivers, customers, and several other records
• Makes extensive use of Big Data and crowdsourcing to derive insights and
provide best services to its customers
• Dynamic pricing
 The concept is rooted in Big Data and data science to calculate fares based on various
parameters
 Uber matches customer profile with the most suitable driver
 Charges the customer based on the time it takes to cover the distance rather than the
distance itself
 The time of travel is calculated using algorithms that make use of data related to traffic density and weather
conditions
 When the demand is higher (more riders) than supply (less drivers), the price of the ride goes up
 When the demand for Uber rides is less, then Uber charges lower rate

Source: https://data‐flair.training/blogs/data ‐science‐use ‐cases/

BITS Pilani, Pilani Campus


Bank of America and other Banks –
Improving Customer Exp
• Fraud detection
Uses data science and predictive analytics to detect frauds in
 payments, insurance, credit cards, and customer information
Banks employ data scientists to use their quantitative knowledge
where they apply algorithms like association, clustering, forecasting,
and classification.
• Risk modeling
Banks use data science for risk modeling to regulate financial
activities
E.g., Credit risk, Operational Risk, Market Risk, Liquidity Risk

Source: https://data‐flair.training/blogs/data ‐science‐use ‐cases/


BITS Pilani, Pilani Campus
Bank of America and other Banks –
Improving Customer Exp
• Erica – a virtual financial assistant
(BoA)
 Considered as the world's finest innovation
in finance domain
 Serves as a customer advisor to over 45
million users around the world
 Makes use of Speech Recognition to take
customer inputs, which is a technological
advancement in the field of Data Science
 https://promo.bankofamerica.com/erica/

Source: https://data‐flair.training/blogs/data ‐science‐use ‐cases/

BITS Pilani, Pilani Campus


Bank of America and other Banks –
Improving Customer Exp

• Customer segmentation
 Using various data‐mining techniques, banks
are able to segment their customers in the
high‐value and low‐value segments
 Data scientists makes use of clustering,
logistic regression, decision trees, etc., to
help banks to understand the Customer
Lifetime Value (CLV) and group them in
the appropriate segments

Source: https://data‐flair.training/blogs/data ‐science‐use ‐cases/

BITS Pilani, Pilani Campus


Airbnb

• Providing better search results


• Providing ideal lodgings and localities
 Airbnb uses knowledge graphs where the user’s preferences
are matched with the various parameters to provide ideal
lodgings and localities

Source: https://data‐flair.training/blogs/data ‐science‐use ‐cases/

BITS Pilani, Pilani Campus


Real‐World
Applications
Airbnb
• Detecting bounce rates
 Airbnb makes use of demographic analytics to analyze bounce rates from their
websites
 In 2014, Airbnb found that users in certain Asian countries had a high bounce rate
when visiting the home page
 Airbnb discovered that users would click the "neighborhood link", browse photos
and never come back to make any booking
 To mitigate this issue, Airbnb released a different version in those countries and
replaced neighborhood links with the top travel destinations in China, Japan, Korea,
and Singapore
 This resulted in a 10% improvement in the lift rate for those users

Source: https://data‐flair.training/blogs/data ‐science‐use ‐cases/

BITS Pilani, Pilani Campus


Spotify

• Providing better music streaming


experience
 Spotify uses Data Science and leverages big data for
providing personalized music recommendations
 It has over 100 million users
 Uses over 600 GBs of daily data generated by the
users to build its algorithms to boost user experience

Source: https://data‐flair.training/blogs/data ‐science‐use ‐cases/

BITS Pilani, Pilani Campus


ApplSpotify-Improving experience for artists ications

Spotify for Artists application


Spotify for Artists is the way to pitch new songs to
some of the world’s most followed playlists
Fans help artists reach their
goals, whether that’s finding
the first listener or getting
nominated for Best New Artist
Provides necessary tools to build
fan base for artists

Source: https://data‐flair.training/blogs/data ‐science‐use ‐cases/

BITS Pilani, Pilani Campus


Real‐World
Applications
Spotify
• Others
 Spotify uses data science to gain insights about which universities had the highest
percentage of party playlists and which ones spent the most time on it
 "Spotify Insights" publishes information about the ongoing trends in the music
 Spotify's Niland, an API based product, uses machine learning to provide better
searches and recommendations to its users
 Spotify analyzes listening habits of its users to predict the Grammy Award
Winners
 In the year 2013, Spotify made 4 correct predictions out of 6

Source: https://data‐flair.training/blogs/data ‐science‐use ‐cases/

BITS Pilani, Pilani Campus


Real World Examples
• Anomaly detection
Fraud, disease, crime, etc.
• Automation and decision‐making
Background checks, credit worthiness, etc.
• Classifications
In an email server, this could mean classifying emails as “
important” or “junk”
• Forecasting
Sales, revenue, and customer retention
• Pattern detection
Weather patterns, financial market patterns, etc.
• Recognition
Facial, voice, text, etc.
• Recommendations
Based on learned preferences, recommendation engines
can refer you to movies, restaurants and books you may like BITS Pilani, Pilani Campus
Real‐World
Applications
Real World Examples
• Sales and Marketing
Google AdSense collects data from internet users so relevant commercial
messages can be matched to the person browsing the internet
MaxPoint (http://maxpoint.com/us) is another example of real‐time
personalized advertising
• Customer Relationship Management
Commercial companies in almost every industry use
data science to gain insights into their customers,
processes, staff, competition, and products
Many companies use data science to offer customers a better user
experience, as well as to cross‐sell, up‐sell, and personalize their offerings
Source: Introducing Data Science ‐ Cielen, Meysman and Ali

BITS Pilani, Pilani Campus


Real‐World
Applications
Real World Applications
• Human Resources
 Human resource professionals use people analytics and text mining to:
 Screen candidates, monitor the mood of employees, and study
informal networks among coworkers
 People analytics is the central theme in the book
 Moneyball: The Art of Winning an Unfair Game
 The traditional scouting process for American baseball was
random, and replacing it with correlated signals changed
everything
 Relying on statistics allowed them to hire the right players and pit
Source: Introducing Data Science ‐ Cielen, Meysman and Ali

them against the opponents provided the biggest advantage


 Video

BITS Pilani, Pilani Campus


Real‐World
Applications
Real World Applications
• Finance
 Financial institutions use data science to:
 Predict stock markets,
 Determine the risk of lending money, and
 Learn how to attract new clients for their services
 As of 2016, at least 50% of trades worldwide are performed
automatically by machines based on algorithms developed by quants
with the help of big data and data science techniques
 Data scientists who work on trading algorithms are often called as
quants
Source: Introducing Data Science ‐ Cielen, Meysman and Ali

BITS Pilani, Pilani Campus


Real‐World
Applications
Real World Applications
• Non‐Governmental Organizations (NGOs)
 NGOs are also bigtime into data science
 They use it to raise money and defend their causes
 The World Wildlife Fund (WWF), for instance, employs data scientists to
increase the effectiveness of their fundraising efforts
 Many data scientists devote part of their time to helping NGOs, because NGOs
often lack the resources to collect data and employ data scientists
 DataKind is one such data scientist group that devotes its time to the benefit of
mankind

Source: Introducing Data Science ‐ Cielen, Meysman and Ali

BITS Pilani, Pilani Campus


Real‐World
Applications
Real World Applications
• Education
 Universities use data science in their research but also to enhance the study
experience of their students
 The rise of massive open online courses (MOOC) produces a lot of data
 This data allows universities to study how this type of learning can complement traditional
classes
 MOOCs are an invaluable asset if you want to become a data scientist and big data
professional:
 Coursera, Udacity, and edX
 The big data and data science landscape changes quickly, and MOOCs allow you to
stay up to date by following courses from top universities
Source: Introducing Data Science ‐ Cielen, Meysman and Ali

BITS Pilani, Pilani Campus


Real‐World
Applications

Source: https://data‐flair.training/blogs/data‐science‐applications/

BITS Pilani, Pilani Campus


Roles and
Responsibilitie
Who is data scientist
s of a Data
Scientist

BITS Pilani, Pilani Campus


Google defines the data scientist as
• A person employed to analyse and interpret complex digital data such as
the usage statistics of the website, especially inorder to assist a business
in decision making

BITS Pilani, Pilani Campus


Example:Customer Exit Rate at a Bank

Source:https://www.simplilearn.com/tutorials/data-science-tutorial/introduction-to-data-science? BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Responsibilities of data scientist

• Identify the data analytics problem that can give more business to the
organization.
• Discover solutions after working on the data.
• Understanding of all the critical datasets and variables.
• Collect the structured and unstructured data from reliable sources.
• Work on unstructured data, such as images and videos.
• Analyze data and find out the hidden patterns and insights.
• Clean data by removing the missing values and outliers to get accuracy.
• Apply different models and algorithms to find out the business solutions.
• Communicate the insights to clients with the help of visualization tools.

BITS Pilani, Pilani Campus


Skillset required for the data scientist
1.Critical Thinking: A Data Scientist must have critical thinking ability to analyze the facts before
concluding
2.Analytical Thinking: A Data Scientist must think analytically to solve the business problems.
3.Programming: A Data Scientist must know at least one programming language to the right algorithms.
They must be comfortable in writing code in programming languages such as Python, R, and SQL
4.Computer Science: A Data Scientist must be able to apply different principles of Computer Science,
including software engineering, database system, Artificial Intelligence, and numerical analysis.
5.Interpersonal Skills: A Data Scientist must have excellent communication skills to interact with
different audiences across the organization.
6.Business Intuition: A Data Scientist must be able to communicate with clients to understand the
problem

BITS Pilani, Pilani Campus


D AT A S C I E N C E V S . B U S I N E S S
INTELLIGENCE

In tro d u c ti o n TO DatA SC I E N C E BITS Pilani, Pilani Campus


D AT A S C I E N C E V S . B U S I N E S S
INTELLIGEN C E
Data Science Business Intelligence
Perspective Looking forward Looking backward
Analysis Predictive Descriptive
Explorative Comparative
Data Same data, New Data,
New analysis Listens to Same analysis Speaks
data Distributed for data Warehoused

Scope Specific to business question Unlimited


Expertise Data scientist Business analyst
Deliverable Insight or story Table or report
Applicability Future, correction for influences Historic, confounding factors

In tro d u c ti o n TO DatA SC I E N C E BITS Pilani, Pilani Campus


D AT A S C I E N T I S T V S . B U S I N E S S
A NALY S T

In tro d u c ti o n TO DatA SC I E N C E BITS Pilani, Pilani Campus


D AT A S C I E N C E V S .
S T AT I S T I C S Data Science Statistics
Type of problem Semi structured or unstruc- Well structured
tured
Inference model Explicit inference No inference
Analysis Objective Need not be well formed Well formed objective
Type of Analysis Explorative Confirmative
Data collection Data collection is not linked to Data collected based on
the objective the objective
Size of dataset Large Small
Heterogeneous Homogeneous

In tro d u c ti o n TO DatA SC I E N C E BITS Pilani, Pilani Campus


Thank you

BITS Pilani, Pilani Campus


DSECL ZG523
Data analytics
BITS Pilani
Pilani Campus Dr.Vijayalakshmi
Disclaimer and Acknowledgement

• The content for these slides has been obtained from books and various
other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the
course

BITS Pilani, Pilani Campus


Introduction to Data
Science
Data Analytics Module Topics
• Defining Analytics
• Types of data analytics
 Descriptive, Diagnostic
 Predictive, Prescriptive
• Data Analytics – methodologies
 CRISP‐DM Methodology
 SEMMA
 BIG DATA LIFE CYCLE
 SMAM
• Analytics Capacity Building
• Challenges in Data‐driven decision
making

BITS Pilani, Pilani Campus


What is
Analytics?

BITS Pilani, Pilani Campus


Defining Analytics
What is
Analytics?

What route should I How to retain


What home should I
follow? customers?
buy?

What treatment should What stock should I


I recommend? purchase?

BITS Pilani, Pilani Campus


Defining
Analytics
What is Analytics? – Dictionary meaning
• It can be used as a noun or verb
Is Analytics the "name" of your department, or do you actually "do"
Analytics?
Analytics as a noun. "He is the HOD of Business Analytics department!"
• Analytics as adictionary
Let's look at some verb. "She applied AnalyticsCambridge
to the data to find patterns"
meanings
Collins

Source: http://the‐data‐guy.blogspot.com/2016/10/is‐analytics‐noun‐or‐ BITS Pilani, Pilani Campus


Defining
Analytics
What is Analytics? – Dictionary
meaning Oxford
Dictionary.com

BITS Pilani, Pilani Campus


Defining What is Analytics? – From the
Analytics Internet
• Analytics is the process of discovering, interpreting, and communicating significant patterns in data. Analytics helps us
see insights and meaningful data that we might not otherwise detect.
 ‐‐Oracle
• Analytics uses data and math to answer business questions, discover relationships, predict unknown outcomes
and automate decisions. This diverse field of computer science is used to find meaningful patterns in data and
uncover new knowledge based on applied mathematics, statistics, predictive modeling and machine learning
techniques.
 ‐‐ www.sas.com
• Analytics is the systematic computational analysis of data or statistics. It is used for the discovery, interpretation, and
communication of meaningful patterns in data. It also entails applying data patterns towards effective decision‐
making
 ‐‐ Wikipedia
• Analytics is the scientific process of discovering and communicating the meaningful patterns which can be found in
data. It is concerned with turning raw data into insight for making better decisions. It relies on the application of
statistics, computer programming, and operations research in order to quantify and gain insight to the meanings of
data
 ‐‐ technopedia
• “Information is the oil of the 21st century, and analytics is the combustion engine.”
 – By Peter Sondergaard, Gartner Research BITS Pilani, Pilani Campus
Source: Internet
Defining What is Analytics?
Analytics
• Analytics is the process of extracting and creating information from raw data
by using techniques such as:
filtering, processing, categorizing, condensing and contextualizing the data
• Analytics is a broad term that encompasses the processes, technologies,
frameworks, and algorithms to extract meaningful insights from data
• The information thus obtained is then used to infer knowledge about the
system and/or its users, and its operations to make the systems smarter and
more efficient

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay
Madisetti
BITS Pilani, Pilani Campus
Characteristics of Big Data

BITS Pilani, Pilani Campus


Characteristics of Big Volume
Data (5Vs)
• “There were 5 exabytes of information created between the dawn of civilization
through 2003, but that much information is now created every 2 days.”
 – By Eric Schmidt, Executive Chairman of Google
• Why there is an increase in volume?
 The data generated by modern IT, industrial, healthcare, Volume
Internet of Things, and other systems is growing
exponentially Velocity
 Cost of storage and processing architectures is going down Big
 Growing need to extract valuable insights from the data Dat
to improve business processes, efficiency and service to
consumers Valu a Variet
e y

• What is considered big? Veracit


y

 There is no fixed threshold for the volume of data to be considered as big data
 However, the term big data is used for massive scale data that is difficult to store, manage,
and process using traditional databases and data processing architectures
 Specialized tools and frameworks are required to store, process and analyze such data. For
example:
 Hadoop Ecosystem, NoSQL Databases, Programming Languages (Python, R), Etc.,.

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay
BITS Pilani, Pilani Campus
Characteristics of Big Velocity
Data (5Vs)
• Velocity is the primary reason for the exponential growth
of data in short span of time
• Velocity of data refers to how fast the data is generated Volume
• Data generated by certain sources can arrive at very
Velocity

Big
high velocities. For Dat
example: Valu a Variet
e y
 social media data or sensor data
Veracit
y

• Some applications can have strict deadlines for data analysis


(such as trading or online fraud detection) and the data needs to
be analyzed in real‐time
• Specialized tools are required to ingest such high velocity data into the
big data infrastructure and analyze the data in real‐time

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay
BITS Pilani, Pilani Campus
Characteristics of Big Variety & Veracity
Data (5Vs)
• Variety
 Variety refers to different forms of the data Volume
 Big data comes in different forms such as structured, unstructured or
semi‐structured Velocity

 E.g., text data, image, audio, video and sensor data Big
 Big data systems need to be flexible enough to handle such variety of Dat
data Valu a Variet
e y
• Veracity Veracit
y
 Veracity refers to how accurate is the data

 To extract value from the data, the data needs to be cleaned to remove
noise
 Data‐driven applications can reap the benefits of big data only when the
data is meaningful and accurate
 Therefore, cleansing of data is important so that incorrect and faulty data
can be filtered out

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay BITS Pilani, Pilani Campus
Characteristics of Big Value

Data (5Vs)
• Value of data refers to the usefulness of data for
the intended purpose Volume

• The end goal of any big data analytics system is Velocity

to extract value from the data Big


Dat
• The value of the data is also related to the Valu a Variet
veracity or accuracy of the data e y
Veracit
y

• For some applications value also depends on how fast we are


able to process the data

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay
Madisetti BITS Pilani, Pilani Campus
Analytics
Goals
The Driver
• What drives the choice of technologies, algorithms, and frameworks used for
analytics?
Goals of the analytic task at hand
 To predict something
 To find patterns in the data
 To find relationships in the data

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay
Madisetti

BITS Pilani, Pilani Campus


Analytics
Terminology
Some references
• https://marketing.adobe.com/resources/help/en_US/reference/gloss
ary.html
• https://analyticstraining.com/analytics‐terminology/
• https://blog.hubspot.com/marketing/hubspot‐google‐analytics‐
glossary

Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay
Madisetti

BITS Pilani, Pilani Campus


Types of analytics

BITS Pilani, Pilani Campus


There are 4 types of data analytics

• Descriptive analysis

• Diagonistics analysis

• Predictive analysis

• Prescriptive analysis

BITS Pilani, Pilani Campus


Types of
Analytics
Types of analytics according to the
objective

Source: https://blogs.gartner.com/jason‐mcnellis/2019/11/05/youre‐likely‐investing‐lot‐marketing‐analytics‐getting‐right‐
insights/

BITS Pilani, Pilani Campus


Descriptive Analytics

Answers the question ofwhat happened.


Summarize past data usually in the form of dashboards.
Insights into the past.
Also known asstatistical analysis.
Raw data from multiple data sources.

Introduction to Data Science77 / 47 BITS Pilani, Pilani Campus


Descriptive Analytics Example

Introduction to Data Science78 / 47 BITS Pilani, Pilani Campus


Descriptive Analytics

Types of Descriptive Analysis


• Measures of Frequency
• Measures of Central Tendency
• Measures of Dispersion
• Measures of Position

Introduction to Data Science79 / 47 BITS Pilani, Pilani Campus


Diagnostic Analytics

Answers the question ofwhy something happened.


Gives in-depth insights into data.
Identify relationshipbetween data andidentify
patternsof behavior.

Introduction to Data Science80 / 47 BITS Pilani, Pilani Campus


Diagnostic Analytics Example

1. Examining Market Demand


2. Explaining Customer Behavior
3. Identifying Technology Issues
4. Improving Company Culture

Introduction to Data Science81 / 47 BITS Pilani, Pilani Campus


Diagnostic Analytics

Pattern recognition to identify patterns.


Linear / Logistic regression to identify relationship.
Neural Network
Deep Learning techniques

Introduction to Data Science82 / 47 BITS Pilani, Pilani Campus


Predictive Analytics

Answers the question ofwhat is likely to happen.


Predict future trends.
Being able to predict allows one to make better decisions.
Analysis based on machine or deep learning.
Accuracy of the forecasting or prediction highly depends on data quality and stability
of the situation.

Introduction to Data Science83 / 47 BITS Pilani, Pilani Campus


Predictive Analytics Example

Introduction to Data Science84 / 47 BITS Pilani, Pilani Campus


Predictive Analytics

Techniques / Algorithms:
) Regression
) Classification
) ML algorithms like Linear regression, Logistic regression, SVM
) Deep Learning techniques

Introduction to Data Science85 / 47 BITS Pilani, Pilani Campus


Prescriptive Analytics

Answers the question ofwhat might happen.


Data-driven decision making and corrective actions
Prescribe what action to take to eliminate a future problem or take full advantage of a
promising trend.
Need historical internal data and external information like trends.
Analysis based on machine or deep learning, business rules.
Use of AI to improve decision making.

Introduction to Data Science86 / 47 BITS Pilani, Pilani Campus


Example

• Marketing and sales


• Transportation industry
• Financial markets

Introduction to Data Science87 / 47 BITS Pilani, Pilani Campus


Cognitive Analytics – What Don’t I Know?

https://www.10xds.com/blog/cognitive-analytics-to-reinvent-business/
Introduction to Data Science88 / 47 BITS Pilani, Pilani Campus
Cognitive Analytics

Next level of Analytics


Human cognition is based on the context and reasoning.
Cognitive systems mimic how humans reason and process.
Cognitive systems analyze information and draw inferences using probability.
They continuously learn from data and reprogram themselves.
According to one source:
”The essential distinction between cognitive platforms and artificial
intelligence systems is that you want an AI to do something for you. A
cognitive platform is something you turn to for collaboration or for advice.”

https://interestingengineering.com/cognitive-computing-more-human-than-artificial-intelligence
Introduction to Data Science89 / 47 BITS Pilani, Pilani Campus
Cognitive Analytics
Involves Semantics, AI, Machine learning, Deep Learning, Natural Language
Processing, and Neural Networks.
Simulates human thought process to learn from the data and extract the hidden
patterns from data.
Uses all types of data: audio, video, text, images in the analytics process.
Although this is the top tier of analytics maturity, Cognitive Analytics can be used in
the prior levels.
According to Jean Francois Puget:
”It extends the analytics journey to areas that were unreachable with more
classical analytics techniques like business intelligence, statistics, and
operations research.”

https://www.ecapitaladvisors.com/blog/analytics-maturity/ https://www.x
enonstack.com/insights/what-is-cognitive-analytics/
Introduction to Data Science30 / 47 BITS Pilani, Pilani Campus
Descriptive Analytics – Example #1

Data captured
Problem Statement : Gender
“Market research team at Aqua Analytics Age (In years)
Pvt. Ltd is assigned a task to identify pro- Education (In years)
file of a typical customer for a Digital fit- Relationship Status (Single or Partnered)

ness band that is offered by Titanic Corp. Annual Household income


Average number of times customer tracks activity each
The market research team decides to week
inves- tigate whether there are differences Number of miles customer expect to walk each week
across the usage patterns and product Self-rated fitness on a scale 1–5 where 1 is poor shape
and 5 is excellent.
lines with respect to customer
Models of the product purchased - IQ75, MZ65, DX87
characteristics”

https://medium.com/@ashishpahwa7/first-case-study-in-descriptive-analytics-a744140c39a4
Introduction to Data Science91 / 47 BITS Pilani, Pilani Campus
Descriptive Analytics – Example #1

Introduction to Data Science92 / 47 BITS Pilani, Pilani Campus


Diagnostic Analytics – Example #1

Problem Statement :
“During the 1980s General Electric was selling different products to its customers such as
light bulbs, jet engines, windmills, and other related products. Also, they separately sell
parts and services this means they would sell you a certain product you would use it until it
needs repair either because of normal wear and tear or because it’s broken. And you
would come back to GE and then GE would sell you parts and services to fix it. Model for
GE was focusing on how much GE was selling, in sales of operational equipment, and in
sales of parts and services. And what does GE need to do to drive up those sales?”

https://medium.com/parrotai/
understand-data-analytics-framework-with-a-case-study-in-the-business-world-15bfb421028d
Introduction to Data Science93 / 47 BITS Pilani, Pilani Campus
Diagnostic Analytics – Example #1

https://www.sganalytics.com/blog/change-management-analytics-adoption/
Introduction to Data Science94 / 47 BITS Pilani, Pilani Campus
Predictive Analytics – Example #1
• Google launched Google Flu Trends (GFT), to collect predictive analytics
regarding the outbreaks of flu. It’s a great example of seeing big data
analytics in action.
• So, did Google manage to predict influenza activity in real-time by
aggregating search engine queries with this big data and adopting
predictive analytics?
• Even with a wealth of big data analytics on search queries, GFT
overestimated the prevalence of flu by over 50% in 2012-2013 and
2011-2012.
• They matched the search engine terms conducted by people in different
regions of the world. And, when these queries were compared with traditional
flu surveillance systems, Google found that the predictive analytics of the flu
season pointed towards a correlation with higher search engine traffic for
BITS Pilani, Pilani Campus
Predictive Analytics – Example #1

https://www.slideshare.net/VasileiosLampos/
usergenerated-content-collective-and-personalised-inference-tasks
Introduction to Data Science96 / 47 BITS Pilani, Pilani Campus
Predictive Analytics – Example #2
Colleen Jones applied predictive analytics to FootSmart (a niche online catalog
retailer) on a content marketing product. It was called the FootSmart Health
Resource Center (FHRC) and it consisted of articles, diagrams, quizzes and the like.
On analyzing the data around increased search engine visibility, FHRC was found
to help FootSmart reach more of the right kind of target customers.
They were receiving more traffic, primarily consisting of people that cared about foot
health conditions and their treatments.
FootSmart decided to push more content at FHRC and also improve its
merchandising of the product.
The result of such informed data-driven decision making? A 36
increase in weekly sales.

https://www.footsmart.com/pages/health-resource-center
Introduction to Data Science97 / 47 BITS Pilani, Pilani Campus
Predictive Analytics – Example #2

Predictive Policing (Self study)


https://www.brennancenter.org/our-work/research-reports/
predictive-policing-explained
https://www.youtube.com/watch?v=YxvyeaL7NEM

Introduction to Data Science98 / 47 BITS Pilani, Pilani Campus


Prescriptive Analytics – Example #1
A health insurance company analyzes its data and determines that many of its diabetic
patients also suffer from retinopathy.

With this information, the provider can now use predictive analytics to get an idea of how
many more ophthalmology claims it might receive during the next year.

Then, using prescriptive analytics, the company can look at scenarios where the
reimbursement costs for ophthalmology increases, decreases, or holds steady. These
scenarios then allow them to make an informed decision about how to proceed in a way that’s
both cost-effective and beneficial to their customers.

Analyzing data on patients, treatments, appointments, surgeries, and even radiologic


techniques can ensure hospitals are properly staffed, the doctors are devising tests and
treatments based on probability rather than gut instinct, and the facility can save costs on
everything from medical supplies to transport fees to food budgets.

Introduction to Data Science99 / 47 BITS Pilani, Pilani Campus


Prescriptive Analytics – Example #2

Whenever you go to Amazon, the site recommends dozens and dozens of


products to you. These are based not only on your previous shopping
history (reactive), but also based on what you’ve searched for online, what
other people who’ve shopped for the same things have purchased, and
about a million other factors (proactive).
Amazon and other large retailers are taking deductive, diagnostic, and
predictive data and then running it through a prescriptive analytics system
to find products that you have a higher chance of buying.
Every bit of data is broken down and examined with the end goal of
helping the company suggest products you may not have even
known you wanted.
https://accent-technologies.com/2020/06/18/examples-of-prescriptive-analytics/
Introduction to Data Science100 / 47 BITS Pilani, Pilani Campus
Healthcare Case
Study

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare
Practices
• Private Practice
• Group Practice
• Large HMOs
• Hospital Based
• Locum Tenens

Source: https://www.micromd.com/blogmd/5‐types‐medical‐practices/
Image Source: https://technologyadvice.com/blog/healthcare/best‐medical‐practice‐management‐software/

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Practice
• Private Practice
 A physician practices alone without any partners and typically with
minimal support staff
 Ideally works for physicians who wish to own and manage their own
practice
 Benefits
 Individual freedom, closer relationships with patients,
and the ability to set their own practice’s growth
pattern
 Drawback
 Longer work hours, financial extremes, and a
greater amount of business risk.
Source: https://www.micromd.com/blogmd/5‐types‐medical‐practices/
Image Source: https://www.mediqfinancial.com.au/blog/the‐essential‐guide‐to‐starting‐your‐own‐medical‐practice/

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Practice
• Group Practice
 A group practice involves two or more physicians who provide medical care within
the same facility
 They utilize the same personnel and divide
the income in a manner previously agreed
upon by the group
 Group practices may consist of providers
from a single specialty or multiple specialties
 Benefits
 Shorter work hours, built‐in on‐call coverage,
and access to more working capital
 All of these factors can lead to less stress
 Drawback
 Less individual freedom, limits on the ability to rapidly grow income, and the need for a
consensus on business decisions.
Source: https://www.micromd.com/blogmd/5‐types‐medical‐practices/
Image source: https://www.physicianleaders.org/news/selling‐medical‐practice‐you‐need‐exit‐strategy

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Practice
• Large HMOs (Health Maintenance Organizations)
 Provide health insurance coverage for a monthly or annual fee
 An HMO limits member coverage to medical care provided through a network
of doctors and other healthcare providers who are under contract with the
HMO
 HMO, employs providers to care for their members and beneficiaries
 The goal of HMOS is to decrease medical costs for those consumers
 Examples:
 The Kaiser Foundation Health Plan in California
 The Health Insurance Plan of Greater New York
 Group Health Cooperative of Puget Sound
Source: https://www.micromd.com/blogmd/5‐types‐medical‐
practices/

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Practice
• Large HMOs (Health Maintenance Organizations)
 Benefits
 More stable work life and regular hours for providers (physicians)
 Less paperwork and regulatory responsibilities and a regular salary along with
bonus opportunities
 These bonuses are based on productivity or patient satisfaction
 Drawback
 The main drawback for physicians working for an HMO is the lack of autonomy
 HMO’s required physicians to follow their guidelines in providing care.

Source: https://www.micromd.com/blogmd/5‐types‐medical‐
practices/

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Practice
• Hospital Based
 In hospital based work, physicians earn a predictable income, have a
regular patient base, and a solid referral network
 Physicians who are employed by a hospital will either work in a hospital‐
owned practice or in a department of the hospital itself
 Benefits
 A regular work schedule, low to no business and legal
risk, and a steady flow of income
 Drawback
 Lack of physician autonomy
 Employee constraints and the expectation that
physicians become involved in hospital committee work

Source: https://www.micromd.com/blogmd/5‐types‐medical‐
practices/

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Practice
• Locum Tenens
 Locum tenens is derived from the Latin phrase for "to hold the place of."
 In locum tenens, physicians re‐home to areas hurting for healthcare professionals
 These types of positions offer temporary employment and may offer higher pay than
more permanent employment situations
 Benefits
 Physicians working in locum tenens scenarios enjoy the benefits of variety and the
ability to experience numerous types of practices and geographic locations
 They also enjoy schedule flexibility and lower living costs
 Drawback
 The possibility that benefits are not included, and a potential lack of steady work
 Physicians need to regularly uproot their families

Source: https://www.micromd.com/blogmd/5‐types‐medical‐
practices/

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Scenario – Sources of Data
• Common sources of practice data existing within the practice:
 Revenue Cycle Management System (billing for a procedure, processing denials,
collecting payments)
 Electronic Health Records (EHRs) (provides services to a patient)
 Scheduling & Information System (schedules an appointment)
 Survey Results
 Peer Review System
 Other sources
• Analytics provides a platform to convert that data into
actionable information for the practice
• As reimbursement shifts to a value‐based care model, it is critical to
have insight into both clinical and business metrics to prepare for the
future
Source: https://integratedmp.com/healthcare‐practice‐analytics‐101‐numbers‐charts‐dashboards‐oh‐my/

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Scenario – Integrated Medical Partners (IMP)
• https://www.youtube.com/watch?v=olpuyn6kemg
• IMP
 Primarily offers
 Revenue Cycle Management (RCM) and Advanced Analytics among others
 Helps maximize collections for independent physician practices
 Provides practice performance analytics
 Offers tailored solutions to partner with physicians and hospitals so that they
can
 Improve patient care, enhanced compliance, operational efficiency, and increase
profitability

Source: https://integratedmp.com/healthcare‐practice‐analytics‐101‐numbers‐charts‐dashboards‐oh‐my/

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Scenario – Denial of Claims
• Consider that the healthcare practice has seen an increase in
denied claims over the past several months
• Increased denials impact negatively on the financial performance
of the organization
• Therefore, the company needs tools to help identify and resolve
the root cause of the denials
 This trend needs to be reversed as soon as possible, but how?
• The different types of analytics provide necessary insights into
the data and the cause of the denial increase

Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Scenario – Descriptive Analytics
• Descriptive – what happened?
 Descriptive analytics will tell you what is happening in the practice
 In this example, there has been an increase in the number of denied claims
over the past several months
 Further research identifying a trend revealed that the increase in denials is
specific to a particular denial code
 What is this denial code?
 This denial code is for a referring provider (doctor) who is not enrolled in the Medicare
Provider Enrollment, Chain, and Ownership System (PECOS).

• Now, the question is why this provider is not enrolled in PECOS?

Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Scenario – Diagnostic Analytics
• Diagnostic – why did it happen?
 The descriptive analytics has identified an increase in denials specific to a
referring provider
 The next step is to diagnose to understand why this change occurred
 Identify why the referring provider is not enrolled in PECOS
 So, we utilize diagnostic analytics to understand why this change occurred
 Looking for changes in referring provider patterns at the time the increase in
PECOS denials began would help to identify new referring providers
 It will also identify whether these new referring providers are enrolled in PECOS
 For the purpose of this example, assume we identify one referring provider
who is new to referring to your practice that is not enrolled in PECOS.

Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Scenario – Predictive Analytics
• Predictive – what will happen?
 Predictive Analytics allows us to learn from historical trends to predict what
will happen in the future
 Utilizing descriptive and diagnostic analytics, we can determine the
historical referral pattern of the new referring provider not enrolled in
PECOS
 Assuming this provider remains not enrolled in PECOS and refers a steady
stream of patients to the organization
 predictive analytics will tell us the expected denials associated with these claims
 The resulting impact of this situation is identified

Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/

BITS Pilani, Pilani Campus


Healthcare Analytics –
Case Study
Healthcare Scenario – Prescriptive Analytics
• Prescriptive – what should I do?
 Prescriptive Analytics assists in determining the best course of action from the
information gathered from descriptive, diagnostic and predictive analytics
 In this scenario, the best course of action would be to contact the new referring
provider, express appreciation for the new referrals, but also evidence that the
provider’s referrals are resulting in denied claims due to PECOS enrollment
 By working with this provider to become enrolled in PECOS, denied claims can
now be rebilled
 In addition, future claims are less likely to be denied due to the provider not
being certified/eligible to be paid for the procedure/service on the claim date
of service
 The key to prescriptive analytics is implementing a solution that will prevent the
same breakdown from occurring again in the future
 By reducing the likelihood of claim denials, the overall financial health of the
physician practice benefits
Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/

BITS Pilani, Pilani Campus


Data Analytics Tools

BITS Pilani, Pilani Campus


Data Analytics Applications in Different
Fields
• Data Analytics in Finance
• Data Analytics in Healthcare
• Data Analytics in Marketing
• Data Analytics in HR
• Data Analytics in IoT
• Data Analytics for Business

BITS Pilani, Pilani Campus


Types of analytics according to the type of data
1 Text analytics
2 Real-time data analytics
3 Multimedia analytics
4 Geo analytics

BITS Pilani, Pilani Campus


Healthcare Analytics – Case Study

Self study
https://integratedmp.com/
4-key-healthcare-analytics-sources-is-your-practice-using-them/
https://www.youtube.com/watch?v=olpuyn6kemg

Introduction to Data Science119 / 47 BITS Pilani, Pilani Campus


References:
Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti

Introduction to Data Science120 / 47 BITS Pilani, Pilani Campus


T HANK YOU

BITS Pilani, Pilani Campus


DSECL ZG523
Introduction to Data Science
Dr.Vijayalakshmi
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 Bi g DatA LI F E-C Y C L E
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
D AT A
A N A LY T I C
S
Data Analytics is defined as a process of
cleaning, transforming, and modeling data to
discover useful information for business
decision-making.

In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 Bi g DatA LI F E-C Y C L E
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
D AT A A N A LY T I C S
M E T H OD OL OG I E S
• Methodology is a set of guiding principles and processes used to plan,
manage, and execute projects.
• It helps data analysts to reduce risks, avoid duplication of efforts and to
ultimately increase the impact of the project.

Use standard methodology to ensure a good outcome.


1 CRISP-DM
2 SEMMA
3 SMAM
4 Big Data Life-cycle

In tro d u c ti o n TO DatA SC I E N C E
NEED FOR A
Methodology
• Framework for recording experience.
• Allows projects to be replicated
• Aid to project planning and management.
• “Comfort factor” for new adopters
• Demonstrates maturity of Data Mining
• Reduces dependency on “stars”
• Encourage best practices and help to obtain better
results.

In tro d u c ti o n TO DatA SC I E N C E
D AT A A n a l y t i c s
METHODOLOGY
10 Questions the process aims to answer
Problem to Approach
1 What is the problem that you are trying to solve?
2 Are there available solutions to similar problems?
Working with Data
3 What data do you need to answer the question?
4 Where is the data coming from? Identify all Sources. How will you acquire it?
5 Is the data that you collected representative of the problem to be solved?
6 What additional work is required to manipulate and work with the data?
Delivering the Answer
7 In what way can the data be visualized to get to the answer that is required?
8 Does the model used really answer the initial question or does it need to be adjusted?
9 Can you put the model into practice?
10 Can you get constructive feedback into answering the question?
In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 Bi g DatA LI F E-C Y C L E
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
CRISP-
DM
CRISP-DM Phases
 Cross Industry Standard Process for Data
Mining
 People realized they needed a process to
define data mining steps applicable across any
Industry such as Retail, E-Commerce,
Healthcare etc.
 Conceived by Daimler-Benz and Integral
Solutions Ltd in the year 1996

 6 high-level phases

 Iterative approach to the development of


analytical models.
In tro d u c ti o n TO DatA SC I E N C E
CRISP-DM
P HASES
1. Business understanding – What does the business need?
• Understand project objectives and requirements.
• Based on domain knowledge and business strategies.
2. Data understanding – What data do we have / need? Is it clean?
• Initial data collection and familiarization.
• Identify data quality issues.
• Identify initial obvious results.
3. Data preparation – How do we organize the data for modeling?
• Record and attribute selection.
• Data cleansing.

In tro d u c ti o n TO DatA SC I E N C E
CRISP-DM
P HASES
4. Modeling – What modeling techniques should we apply?
• Run the data mining tools.
5. Evaluation – Which model best meets the business objectives?
• Determine if results meet business objectives.
• Identify business issues that should have been addressed earlier.
6. Deployment – How do stakeholders access the results?
• Put the resulting models into practice.
• Set up for continuous mining of the data.

In tro d u c ti o n TO DatA SC I E N C E
C R I S P - D M P HASES AND
T ASKS

In tro d u c ti o n TO DatA SC I E N C E
W HY CRISP-
DM?
• CRISP-DM provides a uniform framework  
• This methodology is cost-effective as it includes a number of processes to take
out simple data mining tasks and the processes are well established across
industry.
• CRISP-DM encourages best practices and allows projects to replicate.
• This methodology provides a uniform framework for planning and managing a
project.
• Being cross-industry standard, CRISP-DM can be implemented in any Data
Science project irrespective of its domain.

In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 Bi g DatA LI F E-C Y C L E
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
SEMMA

• SAS Institute developed SEMMA as the process


for data mining.
• 5 stages - Sample, Explore, Modify, Model,
Assess
• Used to solve a wide range of business
problems, including fraud identification,
customer retention and turnover, database
marketing, customer loyalty, bankruptcy
forecasting, market segmentation, as well as
risk, affinity, and portfolio analysis.

In tro d u c ti o n TO DatA SC I E N C E
SEM
MA
• SEMMA is not a data mining methodology but rather a logical organization of the
functional tool set of SAS Enterprise Miner for carrying out the core tasks of data
mining.
• Enterprise Miner is a Data Mining Software to create predictive and descriptive
models for large volumes of data.
• Enterprise Miner can be used as part of any iterative data mining methodology
adopted by the client. Naturally steps such as formulating a well defined business or
research problem and assembling quality representative data sources are critical to
the overall success of any data mining project.
• SEMMA is focused on the model development aspects of data mining.
• SEMMA overlaps with Data Preparation, Modelling and Evaluation phases of CRISP-
DM
In tro d u c ti o n TO DatA SC I E N C E
SEMMA
STAGES
1. Sample
•1 Sampling the data by extracting a portion of a large data set big enough to contain the
significant information, yet small enough to manipulate quickly.
• Partitioning the data to create training and test samples.
• Identifying dependent and independent variables influencing the process.
2. Explore
• Exploration of the data by searching for unanticipated trends and anomalies in order to
gain understanding and ideas.
• Perform Univariate analysis (single variable) and multivariate analysis (relationships)
3. Modify
• Modification of the data by creating, selecting, and transforming the variables to focus
the model selection process.

In tro d u c ti o n TO DatA SC I E N C E
SEMMA
STAGES
4. Model
• Apply variety of data mining techniques to produce a projected model [ML, Deep Learning,
Transfer Learning]
5. Assess
• Assessing the data by evaluating the usefulness and reliability of the findings from the
data mining process and estimate how well it performs.

In tro d u c ti o n TO DatA SC I E N C E
Advantages and Disadvantages
Advantages:
• Focus on only “Model aspects of Data Mining”
• Useful in most Machine Learning Projects where data comes from single datasource
Ex: Prima Indian Diabetes Dataset [Predict Diabetes], Titanic Dataset [Predict
Passenger Survival] from Kaggle

Disadvantages:
• Does not take into account the business understanding of a problem
• Disregards Data Collection and Processing from different data sources

https://www.diva-portal.org/smash/get/diva2:1250897/FULLTEXT01.pdf

In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 Bi g DatA LI F E-C Y C L E
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
SMAM
SMAM
(Standard Methodology for
Analytics Models)

http://www.datascienceassn.org/content/standard-methodology-analytical-models

In tro d u c ti o n TO DatA SC I E N C E
SMAM
P HASES
Phase Description
Use-case identification Selection of the ideal approach from a list of candidates
Model requirements Understanding the conditions required for the model to func-
gathering tion
Data preparation Getting the data ready for the modeling
Modeling experiments Scientific experimentation to solve the business question
Insight creation Visualization and dash-boarding to provide insight
Proof of Value: ROI Running the model in a small scale setting to prove the value
Operationalization Embedding the analytical model in operational systems
Model life-cycle Governance around model lifetime and refresh

In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase I - Use Case Identification
• Brainstorming of Business / Management / SMEs (Domain) / IT (Data Scientist)
teams
• Discussion revolves around:
• Business Needs
• Expert inputs on the domain
• Data Availability
• Analytical Model Complexity – time and effort
• Outcome: Selected Use Case and roadmap for next phases

In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase II – Model Requirements Gathering
• Involved parties include Business / End-users / Data Scientists / IT
• Preparation of Model Requirement Document
• Business requirements
• IT requirements
• End user requirements
• Scoring requirements
• Data requirements
• Analytical model requirements

In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase III – Data Preparation
• Involved parties include IT / Data Administrators / DBA / Data Modelers and Data
Scientists
• Discussion on:
• Data Access
• Data Location
• Data Understanding
• Data Validation
• Data format [prepared by DBAs and consumed by Data Scientist]
• The process is agile; the data scientist tries out various approaches on smaller sets
and then may ask IT/ DBAs to perform the required transformation in large. 

In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase IV – Modeling Experiments
• Data Scientist:
• Creates testable hypothesis
• Model features
• Creates analytical model
• Evaluates the analytical model

In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase V – Insight Creation
• Data Scientist:
• Analytical reporting [Inference] and Operational reporting [Prediction]
• Visualization and Dashboards
• Provide business usable insights

In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase VI – Proof of Value: ROI
• Quality of the analytical model is observed [Ex: Accuracy of the model is >90%]
• Analytical model is applied to new data and outcomes are measured to verify if
financially viable [for small POC].
• If ROI is positive for POC:
• Set up full-scale experiment with control groups
• Measure the model effectiveness
• Compute ROI and success criteria
• Involve Finance department / IT / End-users and Data Scientists in this phase

In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase VII – Operationalization
• Data Scientist works with IT department to create repeatable experimentation of
the model; hand-over process of the model
• IT prepare the Operational environment
• Audit structure
• Integration with existing / legacy applications
• Possible software development as Mobile / Web App for end-user usage

In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase VIII – Model Lifecycle
• Involves maintenance of the analytical model in-view of changing customer needs
• Two types of model changes:
a. Model Refresh – Model is trained with more recent data, leaving the model
structurally untouched
b. Model Upgrade – Initiated by availability of new data sources and a
business request to improve model performance.
• Involved are operational team, IT team, Data Scientists, DBAs, end-users

In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 Bi g DatA LI F E-C Y C L E
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
B I G D ATA A N A LY T I C S L I F E
C YC L E
• Big data defers from traditional data primarily due to
volume, velocity, variety, veracity and value
• A step-by-step methodology is required to acquire,
process, analyze, visualize the big data

Book - Big Data Fundamentals: Concepts, Drivers & Techniques


https://www.informit.com/articles/article.aspx?p=2473128&seqNum=11

In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage I : Business Case Evaluation
• Create a well-defined business case and get approval
• Identify KPIs that define the assessment criteria, to make business goals SMART
(specific, measurable, attainable, relevant, timely)
• Business case must qualify as a ‘big data’ problem – volume, velocity, variety,
veracity, value
• Outcome: Budget requirements, identify software (tools), hardware, training
requirements

In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage II : Data Identification
• Identify the datasets required for the project and their sources
• Guideline: Identify as many sources as possible, which help gain insights
• Sources can be internal / external to the enterprise
• Internal – Data marts, Data warehouses or operational systems
• External – Data within Blogs, websites etc.

In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage III : Data Acquisition and Filtering
• Data is gathered from all sources identified in the previous phase
• Data filtering is performed to remove corrupted / noise data
• Corrupt – records with missing / nonsensical values / invalid data
types
• Create metadata, helps in data provenance, accuracy and quality
• Dataset size & structure
• Source information
• Date and time of creation
• Language specific information

In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage IV : Data Extraction
• Extract disparate data and transform it into a format that the underlying Big Data

solution can use for the purpose of the data analysis .

Extraction of Latitude and Longitude from JSON


User Id and Comments
extracted from XML document

In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage V : Data Validation and Cleansing
• Big data may receive redundant data across sources
• Redundancy can be used to interconnect dataset and fill missing values

• The first value in Dataset B is validated against its corresponding value in Dataset A.
• The second value in Dataset B is not validated against its corresponding value in Dataset A.
• If a value is missing, it is inserted from Dataset A.

In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage VI : Data Aggregation and Representation
• Integrating multiple datasets together to arrive at unified view
• Involves joining datasets based on common fields such as ID or Date
• Semantics standardization (Ex: Surname and Last name – Same value
labeled differently in different datasets)
• Represent using standard data format (row-oriented database)

In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage VII : Data Analysis
• Perform EDA (Exploratory Data Analysis)
• Apply Analytics: Descriptive, Diagnostic, Predictive or Prescriptive

In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage VIII : Data Visualization
• Use tools to graphically visualize and communicate the insights to business users
• Present Dashboards
• Excel, Tableau, Power BI etc.

In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage IX : Utilization of Analysis Results
• Determining how and where the processed analysis data can be leveraged
• Results can be:
• Fed as input to enterprise systems (Customer analysis result fed into
OTT platform to assist recommendation)
• Refine the business process (Ex: Consolidate transportation routes as
part of supply chain process)
• Generate alerts (Send notification to users via Email or SMS about
impending events)

In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
CASE STUDY: Background
• Company X is an Insurance Company that deals with health and home insurance
• The company has a ‘Claim Management System’ which contains the claim data,
incident photographs and claim notes
• The company wants to invest in Big Data Analytics to “detect fraudulent claims in the
building sector”
• Let us see how the company uses the ‘Big Data Analytics’ Lifecycle to achieve the
objective of ‘detecting fraudulent claims in the building sector’

* Building Insurance is a type of Home insurance that covers the structure of the house from any kinds of danger or risks

In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims

Phase I: Business Case Evaluation


• Use case is important as it leads to decrease in monetary loss for Company X
• It covers ‘opportunistic fraud’ such as lying and exaggeration which covers
majority of insurance claim cases.
• KPI for success is set as – ‘reduction in fraudulent claims by 15%.’
• Regarding budget allocation and Infrastructure upgrade, Company X decides to
leverage Open Source Big Data Solution – Hadoop Ecosystem.

In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims

Phase II: Data Identification


• Internal datasets: Policy data, insurance application documents, claims data,
incident photographs, emails
• External datasets: Social Media Data (Twitter Feeds), Weather reports,
Geographical data (GIS), and census data.
• The claim data consists of historical claim data consisting of multiple fields where
one of the fields specifies if the claim was fraudulent or legitimate.

In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims

Phase III: Data Acquisition and Filtering


• Policy data obtained from Policy Administration System
• Claim data from Claims Management System
• Call center agent notes and emails from CRM system
• Social Media Data (Twitter Feeds), Weather reports, Geographical data (GIS),
and census data are obtained from third party vendors.
• To ensure provenance, each dataset is attached metadata such as dataset
name, source, size, format, acquired date and number of records.
• Batch filtering jobs to remove corrupt records in external datasets.

In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims

Phase IV: Data Extraction


• Tweets dataset is in JSON format: User Id, Timestamp and Tweet Text is
extracted into tabular form.
• Weather dataset is in XML format: Timestamp, Temperature Forecast, Wind
Speed Forecast, Wind Direction Forecast, Snow Forecast and Flood Forecast
parameters extracted into tabular form.

In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims

Phase V: Data Validation and Cleaning


• Check the extracted fields from Twitter and Weather datasets for typographical
errors, incorrect data, data type validation and range validation

In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims

Phase VI: Data Aggregation and Representation


• For meaningful analysis of data, join together policy data, claim data, call center
agent notes in a single dataset that is tabular, where each field can be
referenced through a user query.
• Resulting dataset is stored in NoSQL database.

In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims

Phase VII: Data Analysis


• Perform Exploratory Data Analysis
• This stage is repeated a number of times as the results generated after the first
pass are not conclusive enough to comprehend what makes a fraudulent claim
different from a legitimate claim.
• Machine learning models were developed using Naïve Bayes, Random Forest,
Decision Tree, Logistic Model Tree etc
• Metrics used: Accuracy, Precision, Recall, F-Measure, ROC

In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims

Phase VIII: Data Visualization


• The team has discovered some interesting findings and now needs to convey the
results to the actuaries, underwriters and claim adjusters.
• Different visualization methods are used including bar and line graphs and scatter
plots.
• Scatter plots are used to analyze groups of fraudulent and legitimate claims in the light
of different factors, such as customer age, age of policy, number of claims made
and value of claim.

In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims

Phase IX: Utilization of Analysis Results

• The machine learning model was incorporated into the existing claim
processing system to flag fraudulent claims.

In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A L I F E -
CYCLE - II
Phase 7: Storage
Phase 1: Foundations
Phase 8: Integration
Phase 2: Acquisition
Phase 9: Analytics and Visualization
Phase 3: Preparation
Phase 10: Consumption
Phase 4: Input and Access
Phase 11: Retention, Backup, and
Phase 5: Processing
Archival
Phase 6: Output and
Interpretation Phase 12: Destruction

PS: Some phases may overlap and can be done in parallel.

In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A
L I F E -C Y C L E
Phase 1: Foundations
• Understanding and validating data requirements, solution scope, roles and
responsibilities, data infrastructure preparation, technical and non-technical
considerations, and understanding data rules in an organization.
Phase 2: Data Acquisition
• Data Acquisition refers to collecting data.
• Data sets can be obtained from various sources, both internal and external to the
business organizations.
• Data sources can be in
• structured forms such as transferred from a data warehouse, a data mart, various
transaction systems.
• semi-structured sources such as Weblogs, system logs.
• unstructured sources such as media files consisting of videos, audios, and
pictures.
In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A
L I F E -C Y C L E
Phase 3: Data Preparation
• Collected data (Raw Data) is rigorously checked for inconsistencies, errors, and
duplicates.
• Redundant, duplicated, incomplete, and incorrect data are removed.
• The objective is to have clean and useable data sets.
Phase 4: Data Input and Access
• Data input refers to sending data to planned target data repositories, systems, or
applications.
• Data can be stored in CRM (Customer Relationship Management) application, a data
lake or a data warehouse.
• Data access refers to accessing data using various methods.
• NoSQL is widely used to access big data.

In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A
L I F E -C Y C L E
• Phase 5: Data Processing
• Processing the raw form of data.
• Convert data into a readable format giving it the form and the context.
• Interpret the data using the selected data analytics tools such as Hadoop MapReduce, Hive,
Pig, and Spark SQL.
• Data processing also includes activities
• Data annotation – refers to labeling the data.
• Data integration – aims to combine data existing in different sources, and provide a unified view
of data to the data consumers.
• Data representation – refers to the way data is processed, transmitted, and stored.
• Data aggregation – aims to compile data from databases to combined data-sets to be used for
data processing.
In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A
L I F E -C Y C L E
Phase 6: Data Output and Interpretation
• In the data output phase, the data is in a format which is ready for consumption by the
business users.
• Transform data into usable formats such as plain text, graphs, processed images, or
video files.
• This phase is also called the data ingestion.
• Common Big Data ingestion tools are Sqoop, Flume, and Spark streaming.
• Interpreting the ingested data requires analyzing ingested data and extract information
or meaning out of it to answer the questions related to the Big Data business
solutions.

In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A
L I F E -C Y C L E
Phase 7: Data Storage
• Store data in designed and designated storage units.
• Storage infrastructure can consist of storage area networks (SAN), network-attached
storage (NAS), or direct access storage (DAS) formats.
Phase 8: Data Integration
• Integration of stored data to different systems for various purposes.
• Integration of data lakes with a data warehouse or data marts.
Phase 9: Data Analytics and Visualization
• Integrated data can be useful and productive for data analytics and visualization.
• Business value is gained in this phase.

In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A
L I F E -C Y C L E
Phase 10: Data Consumption
• Data is turned into information ready for consumption by the internal or external users,
including customers of the business organization.
• Data consumption require architectural input for policies, rules, regulations, principles,
and guidelines.
Phase 11: Retention, Backup, and Archival
• Use established data backup strategies, techniques, methods, and tools.
• Identify, document, and obtain approval for the retention, backup, and archival
decisions.
Phase 12: Data Destruction
• There may be regulatory requirements to destruct a particular type of data after a certain
amount of times.
• Confirm the destruction requirements with the data governance team in business
organizations.
In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 BI G DatA LI F E-C Y C L E
5 SEMMA
6 SMAM
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
C H A L L E N G E S IN D AT A D R I V E N
D E C I S I O N -MA K I N G
• 1. Discrimination
• Algorithmic discrimination can come from various sources.
• Data used to train algorithms may have biases that lead to discriminatory decisions.
• Discrimination may arise from the use of a particular algorithm.
• .

In tro d u c ti o n TO DatA SC I E N C E
C H A L L E N G E S IN D AT A D R I V E N
D E C I S I O N -MA K I N G
1. Racism embedded in US healthcare
In October 2019, researchers found that an algorithm used on more than 200
million people in US hospitals to predict which patients would likely need extra
medical care heavily favoured white patients over black patients. While race
itself wasn’t a variable used in this algorithm, another variable highly correlated
to race was, which was healthcare cost history. The rationale was that cost
summarizes how many healthcare needs a particular person has. For various
reasons, black patients incurred lower healthcare costs than white patients with
the same conditions on average.

In tro d u c ti o n TO DatA SC I E N C E
C H A L L E N G E S IN D AT A D R I V E N
D E C I S I O N -MA K I N G
2. Amazon’s hiring algorithm
Amazon’s one of the largest tech giants in the world. And so, it’s no surprise
that they’re heavy users of machine learning and artificial intelligence. In 2015,
Amazon realized that their algorithm used for hiring employees was found to be
biased against women. The reason for that was because the algorithm was
based on the number of resumes submitted over the past ten years, and since
most of the applicants were men, it was trained to favor men over women.

In tro d u c ti o n TO DatA SC I E N C E
C H A L L E N G E S IN D AT A D R I V E N
D E C I S I O N -MA K I N G
2. Lack of transparency
• Transparency refers to the capacity to understand a computational model and therefore
contribute to the attribution of responsibility for consequences derived from its use.
• A model is transparent if a person can easily observe it and understand it.
• Three types of opacity (i.e. lack of transparency) in algorithmic decisions
• Intentional opacity
• Knowledge opacity
• Intrinsic opacity

In tro d u c ti o n TO DatA SC I E N C E
C H A L L E N G E S IN D AT A D R I V E N
D E3.CViolation
ISIO N -MA
of privacy
K I N G
• Misuse of users’ personal data and on data aggregation by entities such as data
brokers, may have direct implications for people’s privacy. [Google faced Lawsuit
for Privacy Violation in 2020]
4. Digital literacy
• Devote resources to digital and computer literacy programs from children to the
elderly.
• This enables the society to make decisions about technologies that we do not
understand. [Cases of Cyberbullying among Juvenile population]
5. Fuzzy responsibility
• As more and more decisions that affect millions of people are made automatically by
algorithms, we must be clear about who is responsible for the consequences of these
decisions. Transparency is often considered a fundamental factor in the clarity of
In tro d u c ti o n TO attribution
DatA SC I E N C E of responsibility.
C H A L L E N G E S IN D AT A D R I V E N
D E C I S I O N -MA K I N G
6. Lack of ethical frameworks
• Algorithmic data-based decision-making processes generate important ethical dilemmas
regarding what actions are appropriate in light of the inferences made by algorithms.
• It is therefore essential that decisions be made in accordance with a clearly defined and
accepted ethical framework.
• There is no single method for introducing ethical principles into algorithmic decision
processes.
7. Lack of diversity
• Data-based algorithms and artificial intelligence techniques for decision-making have
been developed by homogeneous groups of IT professionals.
• Ensure that teams are diverse in terms of areas of knowledge as well as demographic
factors

In tro d u c ti o n TO DatA SC I E N C E
R EFERE
NCES
htt ps://www.kdnuggets.com/2014/10/
c r i s p - d m - to p - m e t h o d o l o g y - a n a l y ti c s - d ata - m i n i n g - d ata - s c i e n c e - p ro j e c t ht m l
htt ps://www.datasciencecentral.com/profi les/blogs/
crisp-dm-a-standard-methodology-to-ensure-a-good-outcome
htt ps://documentati on.sas.com/?docsetId=emref&docsetTarget=n061bzurmej4j3n1jnj8bbjjm1a2.htm&
docsetVersion=14.3&locale=en
htt p://jesshampton.com/2011/02/16/semma -and-crisp-dm-data-mining-methodologies/
htt ps://www.kdnuggets.com/2015/08/new-standard-methodology-analyti cal-models.html
htt ps://medium.com/illuminati on -curated/big-data-lifecycle-management-629dfe16b78d
https://www.esadeknowledge.com/view/
7 -c h a ll en ge s - a nd - op p or tu n iti es - in -d ata - b ased - d ec i sion - ma k in g - 19356 0

T HANK YOU

In tro d u c ti o n TO DatA SC I E N C E
Big data analysis life cycle-Usecases

• What is the precision agriculture? Why it is a likely answer to climate change and food security?
• https://www.youtube.com/watch?v=581Kx8wzTMc&ab_channel=Inter‐AmericanDevelopmentBank
• • Innovating for Agribusiness
•  https://www.youtube.com/watch?v=C4W0qSQ6A8U
• • How Big Data Can Solve Food Insecurity
•  https://www.youtube.com/watch?v=4r_IxShUQuA&ab_channel=mSTARProject
• • AI for AgriTech ecosystem in India‐ IBM Research
•  https://www.youtube.com/watch?v=hhoLSI4bW_4&ab_channel=IBMIndia
• • Bringing Artificial Intelligence to agriculture to boost crop yield
•  https://www.youtube.com/watch?v=GSvT940uS_8&ab_channel=MicrosoftIndia
• • Artificial intelligence could revolutionize farming industry
•  https://www.youtube.com/watch?v=cw3flTRrPts
DSECL ZG523
Introduction to Data Science
Dr.Vijayalakshmi
T ABLE OF
C ONTENTS

1 Data SC I E N C E PrO C E S S
190 /

99

2 Ca s e St udY

In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N C E
P ROCESS
10 Questions the process aims to answer
• Problem to Approach
1 What is the problem that you are trying to solve?
2 How can you use data to answer the questions? CRISP-DM approach
• Working with Data
191 /

3 What data do you need to answer the question?


99

4 Where is the data coming from? Identify all Sources. How will you aquire it?
5 Is the data that you collected representative of the porblem to be solved?
6 What additional work is required to manipulate and work with the data?
• Delivering the Answer
7 In what way can the data be visualized to get to the answer that is required?
8 Does the model used really answer the initial question or does it need to be adjusted?
9 Can you put the model into practice?
10 Can you get constructive feedback into answering the question?

Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N C E
P ROCESS Business Analytic
Understanding Approach

Data
Feedback Requirements
192 /

99

Data
Deployment
Collection

Evaluation Data
Understanding

Data
Data Modeling
Preparation
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N C E
P ROCESS
From Problem to Approach
• Business Understanding
• Analytic Approach Business Analytic
Understanding Approach
From Requirements to Collection
• Data Requirements Data
193 /

• Data Collection
99 Feedback Requirements

From Understanding to Preparation


• Data
Data Understanding Deployment
• Collection
Data Preparation

From Modeling to Evaluation Evaluation Data


• Modeling Understanding
• Evaluation
Data
Data Modeling
From Deployment to Feedback Preparation
• Deployment
• Feedback

In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS

1 Data SC I E N C E PrO C E S S
194 /

99

2 Ca s e St udY

In tro d u c ti o n TO DatA SC I E N C E
H OSPIT A L
R EADMISSIONS

195 /

99

Image Source:
https://medium.com/nwamaka-imasogie/predicting-hospital-readmission-using-nlp-5f0fe6f1a705
In tro d u c ti o n TO DatA SC I E N C E
H OSPIT A L
R EADMISSIONS -
•SHospital
C E NReadmission
ARIO is a common problem in the healthcare sector, wherein a patient after
discharge gets re-admitted to the hospital because of the following reasons:
• Medication errors
• Medication noncompliance by the patient
• Fall injuries
• Lack of timely follow-up care
196 /

99

• Inadequate Nutrition
• Inadequate discussion on palliative care [relief from suffering]
• Infection
• Failure to identify post-acute care needs etc.

• Hospital readmissions may bring bad name to the hospital / treating doctor / support staff,
and lead to increased length of stay and expenditure for the hospital and the patient. Hence,
it is a critical issue that needs addressing.

In tro d u c ti o n TO DatA SC I E N C E
H OSPIT A L
R EADMISSIONS -
S CThere
E NisA RIO
a limited budget for providing healthcare to the public.
Hospital readmissions for re-occurring problems are considered as a sign of failure in the
healthcare system.
There is a dire need to properly address the patient condition prior to the initial patient
197 /

discharge. 99

American Healthcare Insurance Provider, Health care authorities in the region & IBM Data
Scientists:
• What is the best way to allocate these funds to maximize their use in providing
quality care?

Source: CognitiveClass

In tro d u c ti o n TO DatA SC I E N C E
F R OM P R O B L E M T O
A P P R OA C H
Business Analytic
Understanding Approach

Data
198 / Feedback Requirements
99

Data
Deployment
Collection

Evaluation Data
Understanding

Data
Data Modeling
Preparation

In tro d u c ti o n TO DatA SC I E N C E
B u s i n e s s P R O B L E M T O A NA LY T I C
A P P R OA C H
Every Data Science activity is time-bound and involves cost and resources

The need to understand and prioritize the business goal.


The way stakeholder support influences a project.
The importance of selecting the right model.
When to use a predictive, descriptive, or classification
model.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
1. B USIN E SS
U NDERSTANDING
(C ONCEPT)
What is the problem that you are trying to
solve?
Identify the goal. 200 /

99

Identify and define the objectives that


support the goal.
Ask questions
Seek clarifications
Get the
stakeholder buy-in
and support.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 1. B USINESS
U NDERSTANDING
Case study asks the following question:
• What is the best way to allocate the limited healthcare
budget to maximize its use in providing quality care?
201 /

99

Goals and Objectives


• Define the GOALS.
• Provide quality care without increasing cost.
• Define the OBJECTIVES.
• Review the process to identify inefficiencies.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 1. B USINESS
U NDERSTANDING
Examining hospital readmissions [Insurance Company + Hospitals + Data Scientists]
• It was found that approximately 30% of individuals who finish rehab treatment would be
readmitted to a rehab center within one year.
• 50% would be readmitted within five years.
• After reviewing some records, it was found that patients with heart failure were high on
202 /

99

the list of readmission.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 1. B USINESS
U NDERSTANDING
Data scientists proposed and organized an on-site workshop.
It was found that a decision tree model can be applied to investigate this scenario
to determine the reason for this phenomenon. [Why – Diagnostic Analytics]
The business sponsors involvement throughout the project was critical because the
203 /

99

sponsor had
• Set the overall direction
• Remained committed and advised
• When required, got the necessary support

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 1. B USINESS
U NDERSTANDING
Finally, four business requirements were identified for whatever model would be built
• Case study question
• What is the best way to allocate the limited healthcare budget to maximize its use
204 /

in providing quality care?


99

• Business requirements
• To predict the risk of readmission. [Predictive Analytics]
• To predict readmission outcomes for those patients with Congestive Heart Failure.
• To understand the combination of events that led to the predicted outcome.
• To apply an easy-to-understand process to new patients, regarding their
readmission risk.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
 Available data

2. A N A LY T I C A P P R OA C H ( C O N C
Patient data, Readmissions data, CHF data, etc
 How can we use data to answer the questions?
Descriptive
 Choose Analytic approach based on the type of question.
• Descriptive205 /

99
• Current data
• Diagnostic (Statistical Analysis)
Diagnostic Analytics Prescriptive
• What happened?
• Why is this happening?
• Predictive (Forecasting)
• What if these trends continue?
Predictive
• What will happen next?
• Prescriptive
• How do we solve it?
2. A N A LY T I C
A P P R OA C H ( C O N C E P T )
The analytic approach can be selected once a clear understanding of the question is
established:
• If the question is to determine probabilities of an action.
206 /

99

• Predictive model can be used


• If the question is to show relationships between variables
• Use a Descriptive model
• If the question requires a yes / no answer
• Classification approach to predicting a response is appropriate.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
A N A LY T I C A P P R OA C H - D E C I S I O N T R E E
(CONCEPT)
What is a Decision Tree?
1.An algorithm that represents a set of questions & decisions using a tree-like
structure.
2.It provides a procedure to decide what questions to ask, which to ask and when to
ask them to predict the value of an outcome.
207 /

99
A N A LY T I C A P P R OA C H - D E C I S I O N T R E E
(CONCEPT)

208 /

99
C A S E S T U D Y - 2.
A N A LY T I C A P P R OA C H
A decision tree classification model was used
to identify the combination of conditions leading
to each patient’s outcome.
Examining the variables in each of the nodes
209 /

along each path to a leaf, led to a respective


99

threshold value to split the tree. Eg:


Age >= 60
A decision tree classifier provides both the
predicted outcome, as well as the likelihood of
that outcome, based on the proportion at the
dominant outcome, yes or no, in each group.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 2.
A N A LY T I C A P P R OA C H
For non-data scientists, a decision tree
classification model is easy to understand and
apply, to score new patients for their risk of
readmission.

In tro d u c ti o n TO DatA SC I E N C E 20 / 99
C A S E S T U D Y - 2.
A N A LY T I C A P P R OA C H
Clinicians can readily see what conditions are
causing a patient to be scored as high-risk.
Multiple models can be built and applied at
various points during hospital stay.
211 /

99

This gives a moving picture of the patient’s risk


and how it is evolving with the various
treatments being applied.
For these reasons, the decision tree approach
was chosen for building the Congestive Heart
Failure (CHF) readmission model.

Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
F R OM D AT A R E qU I R E M E N T S T O
D AT A C OL L E C T I O N
Business Analytic
Understanding Approach

Data
212 / Feedback Requirements
99

Data
Deployment
Collection

Evaluation Data
Understanding

Data
Data Modeling
Preparation

In tro d u c ti o n TO DatA SC I E N C E
3. D AT A
R E qU I R E M E N T S
( C IfOourNgoal
CisEtoPmake
T )a ”Biryani” but we don’t have the right ingredients, then the
success of making a good Biryani will be compromised.
If the ”recipe” is the problem to be solved, then the data are the ingredients.
The data scientist must ask the following questions:
213 /

99

• What are the data requirements?


• How to obtain or collect them?
• How to understand and use them?
• How to prepare the data to meet the desired outcome?
Based on the understanding of the problem and the analytic approach chosen, it is
important to define the data requirements.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 3. D AT A
R E qU I R E M E N T S

The analytic approach is decision tree classification, so data requirements should be


defined accordingly. 214 /

This involves:
99

• Identify data content


• Identify data formats
• Identify data sources needed for the initial data collection.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 3. D AT A
R E qU I R E M E N T S
• Data requirements for the case study included selecting a suitable list of
patients from the health insurance providers' member base.
• In order to put together patient clinical
215 /

histories, three criteria were identified


99

for1 selecting thebepatient


A patient must admittedcohort.
as an in-patient
within health insurance provider’s service area.
[Complete medical history]
2 Patient’s primary diagnosis should be CHF for
one full year.
3 Prior to the primary admission for CHF, a patient
must have had at least 6 months of continuous
enrollment.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 3. D AT A
R E qU I R E M E N T S

Disqualifying conditions (outliers)


216 /

• CHF patient who have been diagnosed with


99

other serious conditions [Comorbidities] are


excluded because this may result in above-
average rates of re-entry and may therefore
distort results.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 3. D AT A
R E qU I R E M E N T S
Defining the data
The content and format suitable for decision tree classifier needs to be defined.
Format
• Transactional format
• This model requires, one record per patient.
217 /

99

• Columns of the record represent dependent and predictor variables.


Content
• To model the readmission outcome, data should represent all aspects of the patient’s
clinical history.
• This includes:
• Authorizations
• Primary, secondary and tertiary diagnoses,
• Procedures, prescriptions and other services provided during hospitalization or visits
by patients / doctors.
In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 3. D AT A
R E qU I R E M E N T S

A given patient can have thousands of records that represent all their attributes.
The data analytics specialists collected the transaction records from patient records
218 /

99

and created a set of new variables to represent that information.


It was a task for the data preparation phase, so it is important to anticipate the next
phases.

Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
4. D AT A C O L L E C T I O N
(C ONCEPT)
Content

What occurs during data collection?


Quality
• Decide whether more data or less data
219 /

99

is required.
• Revise data requirements if needed.
Data Collection Archive
• Assess content, quality and initial insights
of data.
• Identify gaps in data.
Extract
• How to extract, merge and Archive data?
Merge
Source: CognitiveClass

In tro d u c ti o n TO DatA SC I E N C E
4. D AT A C O L L E C T I O N
(C ONCEPT)
Once data collection is completed, the Data Scientist performs an assessment to
make sure he has all the required data.
As with the purchase of ingredients for making a meal, some ingredients may be out
of season and more difficult to obtain or cost more than originally planned.
220 /

99

At this stage, the data requirements are reviewed and a decision is made as to
whether more or less data is required for the collection.
The gaps in the data are identified and plans for filling or replacement must be
made.
Once this step is complete, essentially, the ingredients are now ready for washing and
cutting.

In tro dSource:
u c ti o n TO DatA SC I E N C E
CognitiveClass
4. D AT A C O L L E C T I O N
(C ONCEPT)
The collected data is explored using descriptive statistics and visualization to assess
its content and quality.

221 /

99

In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 4. D AT A
C OLLECTION
This case study required data about:
• Demographics, clinical and coverage information of patients, provider information, claims
records, as well as pharmaceutical and other information related to all the diagnoses of
the CHF patients.
222 /

99

Available data sources


• Corporate data warehouse
• Single source of medical, claims, eligibility,
provider, and member information.
• In-patient record system
• Claim patient system
• Disease management program information

Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 4. D AT A
C OLLECTION
This case study also required other data, but
not available.
• Pharmaceutical records
• Information on drugs
This data source was not yet integrated with
the rest of the data sources.
In such situations,
• It is okay to postpone decisions about
unavailable data and to try to capture them
later.
• This can happen even after obtaining
intermediate results from predictive
modeling.
• If the results indicate that drug information may be important for a good model, you will
spend time trying to get it.
In tro d u c ti o n TO DatA SC I E N C E 34 / 99
C A S E S T U D Y - 4. D AT A
C OLLECTION
Data Pre-processing and Merging Data
• Database administrators and
programmers often work together to
extract data from different sources and
then combine them.
224 /

99

• Redundant data can be deleted and made


available to the next level of methodology – the
”Data Understanding” phase.
• At this stage, scientists and analysts can
discuss ways to better manage their data
by automating certain database processes
to facilitate data collection
Next, we move on to understanding the data
In tro dSource: CognitiveClass
u c ti o n TO DatA SC I E N C E
F R OM D AT A U N D E R S T A N D I N G T O
D AT A P R E PA R AT I O N
Business Analytic
Understanding Approach

Data
Feedback Requirements
225 /

99

Data
Deployment
Collection

Evaluation Data
Understanding
Data
Data Modeling
Preparation

In tro d u c ti o n TO DatA SC I E N C E
F R OM D AT A U N D E R S T A N D I N G T O
D AT A P R E PA R AT I O N

The importance of descriptive statistics.


How to manage missing, invalid, or misleading data?
226 /

The need to clean data and sometimes transform data.


99

The consequences of bad data for the model.


Data understanding is iterative.
• We learn more about data, the more we study it.

Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
5. D AT A
U NDERSTANDING
( C This
ON C EofPtheTmethodology
section S) answers the question.
• Is data you collected representative of the problem to be solved?
Descriptitive statistics
• Univariates statistics
• Pairwise correlation
227 /

• Histograms
99

Assert data quality


• Missing value
• Invalid data
• Misleading data
From the data collected, we should understand the variables and their characteristics
using Exploratory Data Analysis and Descriptive Statistics.
Sometimes we may have to perform pre-processing operations on the data.

In tro dSource: CognitiveClass


u c ti o n TO DatA SC I E N C E
D AT A U N D E R S T A N D I N G
First, Univariate Statistics
• Basic statistics included univariate statistics
for each variable, such as:
• mean, median, minimum, maximum,
and standard deviation
228 /

99

Second, Pairwise Correlations


• Pairwise correlations were used to
determine the degree of correlation between
the variables.
• Variables that are highly correlated
means they are essentially redundant.
• This makes only one variable relevant for
the modeling.
Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
D AT A U N D E R S T A N D I N G
Third, Histograms
• Third, the histograms of the variables
were examined to understand their
distributions.
• Histograms are a good way to understand
how values or variables are distributed.
• They help to know what kind of data
preparation may be needed to make
the variable more useful in a model.
• For example:
• it provides a visual interpretation of numerical
data by showing the number of data points that
fall within a specified range of values.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E 40 / 99
D AT A U N D E R S T A N D I N G
Looking at data quality
• Univariate, statistics and histograms are also used to assess the quality of the data.
• On the basis of the data provided, some values can be recoded or deleted if necessary.
• E.g., if a particular variable has a lot of missing values, we may drop the variable from
the model.
230 /


99

Sometimes a missing value means ”no” or ”0” (zero), or sometimes simply ”we do not
know”.
A variable contains invalid or misleading
values.
• E.g., A numeric variable called ”age”
containing 0 to 100 and 999, where ”triple-9”
actually means ”missing”, will be treated as
a valid value unless we have corrected it.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 5. D AT A
U NDERSTANDING
Data understanding is an iterative process.
• Originally, the meaning of CHF admission was decided on the basis of a primary
diagnosis of CHF.
• However, preliminary data analysis and clinical experience revealed that CHF
231 /

admissions were also based on other diagnosis.


99

• The initial definition did not cover all cases of CHF admissions.
• They added secondary and tertiary diagnoses, and created a more complete definition
of CHF admission.
• This is one example of the iterative processes in the methodology .
• The more we work with the problem and the data, the more we learn and the more the
model can be adjusted, which ultimately leads to a better resolution of the problem.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
6. D AT A
P R E PA R AT I O N
( CInOa N
way,C E preparation
data P T ) is like removing dirt and washing vegetables.
Compared to data collection and understanding, data preparation is the most time
consuming phase – 70% to 90% of overall project time.
Automating collection and preparation time can reduce to 50%.
The data preparation phase of the methodology answers the question:
232 /

99

• What are the ways in which data is prepared?


• Address missing or invalid values
• Remove duplicates
• Format data properly
Transforming data
• Process of getting data into a state where it may be easier to work with.
Feature Engineering

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

233 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
6. D AT A
P R E PA R AT I O N
( C Feature
ONC EPT)
Engineering
• Process of using domain knowledge of data to create
features that make ML algorithms work.
• Feature is a characteristic that might help solving a
234 /

problem. 99

• Feature engineering is also part of the data


preparation.
• Use domain knowledge on data to create features that
work with machine learning algorithms.
• A feature is a property that can be useful for solving a
problem.
• The functions in the data are important for the predictive
models and influence the desired results.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
Data Scientists need clarification on domain terms for data preparation

1. Defining Congestive Heart Failure [from a Data Scientist perspective]


• In the case study, first step in the data preparation stage was to actually define what
235 /

CHF means.
99

• First, the set of diagnosis-related group codes needed to be identified, as CHF implies
certain kinds of fluid buildup.
• Data scientists also needed to consider that CHF is only one type of heart failure.
• Clinical guidance was needed to get the right codes for CHF.
• CHF occurs when the heart muscle does not pump blood as much as it should. This leads to
fluid build up in the lungs.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

2. Defining re-admission criteria for Congestive Heart Failure


• Next step involved defining the criteria for CHF readmissions.

236 /

The timing of events needed to be evaluated in order to define whether a particular CHF
99

admission was an initial event (called as index admission), or a CHF-related re-


admission.
• Based on clinical expertise, a time period of 30 days was set as the window for
readmission relevant for CHF patients, following the discharge from the initial admission.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

237 /

99

In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

238 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

Aggregating transactional records


• Next, the records that were in transactional format were aggregated.

239 /

Meaning that the data included multiple records for each patient.
99

• Transactional records included claims submitted for physician, laboratory, hospital, and
clinical services.
• Also included were records describing all the diagnoses, procedures, prescriptions, and
other information about in-patients and out-patients.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

240 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
Aggregating data to patient level
• A given patient could have hundreds or even thousands of records, depending on their
clinical history.
• All the transactional records were aggregated to the patient level, yielding a single
record for each patient.
241 /

99

• This is required for the decision-tree classification method used for modeling.
• Many new columns were created representing the information in the transactions.
• E.g: Frequency and most recent visits to doctors, clinics and hospitals with diagnoses,
procedures, prescriptions, and so forth.
• Co-morbidities with CHF were also considered, such as:
• Diabetes, hypertension, and many other diseases and chronic conditions that could impact
the risk of re-admission for CHF.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

242 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

• Do we need more data or less data?


• A literature review on CHF was also undertaken to see whether any important data
243 /

elements were overlooked.


99

• Such as co-morbidities that had not yet been accounted for.


• The literature review involved looping back to the data collection stage to add a few more
indicators for conditions and procedures.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

244 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

Creating new variables


• Aggregating the transactional data at the patient level, involved.
• merging it with the other patient data, including their demographic information, such as
245 /

99

age, gender, type of insurance, and so forth.


• The result was the creation of one table containing a single record per patient.
• Columns represent the attributes about the patient in his or her clinical history.
• These columns would be used as variables in the predictive modeling.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
Completing the data set
Here is a list of the variables that were ultimately used in building the model
• Measures
246 /

• Gender, Age, Primary Diagnosis Related Group (DRG), Length of Stay, CHF Diagnosis
99

Importance (primary, secondary, tertiary), Prior admissions, Line of business.


• Diagnosis Flags (Y/N)
• CHF, Atrial fibrillation, Pneumonia, Diabetes, Renal failure, Hypertension.
Dependent Variable
• CHF readmission within 30 days following discharge from CHF hospitalization
(Yes/No).

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

247 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

248 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N

Creating training and testing datasets


• The data preparation stage resulted in a cohort of 2,343 patients.
249 /

99

• These patients met all of the criteria for this case study.
• The data (patient records) were then split into training and testing sets for building and
validating the model, respectively.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
F R OM D AT A M O D E L I N G
T O E VA L UAT I O N
Business Analytic
Understanding Approach

Data
250 / Feedback Requirements
99

Data
Deployment
Collection

Evaluation Data
Understanding

Data
Data Modeling
Preparation

In tro d u c ti o n TO DatA SC I E N C E
F R OM D AT A M O D E L I N G
T O E VA L UAT I O N

The difference between descriptive and predictive models.


The role of training sets and test sets.
251 /

99

The importance of asking if the question has been answered.


Why diagnostic measures tools are needed.
The purpose of statistical significance tests. [Hypothesis testing]
That modeling and evaluation are iterative processes.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
7. D AT A M O D E L I N G
(C ONCEPT)
In what way can the data be visualized to get to the answer that is required?
Modeling is based on the analyatic approach.
Data modeling focuses on developing models that are either descriptive or
predictive.
• Descriptive Models
252 /

• What happened?
99

• Use statistics.
• Predictive Models
• What wil happen?
• Use machine learning.
• Try to generate yes/no type outcomes.
• A training set is used for developing the predictive model.
• Training set
• Contains historical data in which the outcomes are already known.
• Acts like a gauge to determine if the model needs to be calibrated.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
7. D AT A M O D E L I N G
(C ONCEPT)

253 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
7. D AT A M O D E L I N G
(C ONCEPT)
• The data scientist will try different algorithms to ensure that the
variables in play are actually required.
• Success of compilation, preparation and modeling depends on the
254 /

understanding of problem and analytical approach being taken.


99

• Like the quality of ingredients in cooking, the quality of data sets the
stage for the outcome.
• If data quality is bad, the outcome will be bad.
• Constant refinement, adjustment, and tweaking within each step are
essential to ensure a solid outcome.
• The end goal is to build a model that can answer the original
Source: CognitiveClass
In tro d u cquestion.
ti o n T O DatA SC I E N C E
7. D A T A M O D E L I N G – Concept of Confusion Matrix

Since Data Modeling for the case study involves ‘Confusion Matrix’, first let us
255 /

understand the concept. 99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 7.
D AT A M O D E L I N G
Decision tree to predict CHF readmission is built
In this first model, the default is 1-to-1 is used.
The overall accuracy in classifying the yes and
no outcomes was 85%.
256 /

This sounds good, but it represents only 45% of


99

the ”yes”.
• Meaning, when it’s actually YES,
model predicted YES only 45% of the
time.
The question is:
• How could the accuracy of the model
be improved in predicting the yes
outcome?
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 7.
D AT A M O D E L I N G
• There are many aspects to model building – one of those is
parameter tuning to improve the model.
• With a prepared training set, the first decision tree classification
257 /

99

model for CHF readmission can be built.


• We are looking for patients with high-risk readmission, so the outcome of
interest will be CHF readmission equals ”yes”.
• For decision tree classification, the best parameter to
adjust is the relative cost of misclassified yes and no
outcomes.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 7.
D AT A M O D E L I N G
Type I Error or False positive
• When a true, non-readmission is misclassified, and action is taken to reduce that
patient’s risk, the cost of that error is the wasted intervention.
Type II Error or False negative
258 /

• When a true readmission is misclassified,


99

and no action is taken to reduce that risk.


• The cost of this error is the readmission and all
its attended costs, plus the trauma to the patient.
The costs of the two different kinds of
misclassification errors can be quite different.
• Adjust the relative weights of misclassifying the
yes and no outcomes.
For decision tree classification, the best parameter to adjust is the
u c tirelative
In tro dSource: cost
SC I E N C Eof misclassified yes and no outcomes.
CognitiveClass
o n TO DatA
C A S E S T U D Y - 7.
D AT A M O D E L I N G
For the second model, the relative cost was set at 9-to-1.
• Ratio of cost of false positive to false negative.
• This is a very high ratio, but gives more insight to the model’s behavior.
This time the model correctly classified 97% of
the YES, but at the expense of a very low
259 /

99

accuracy on the NO, with an overall accuracy of


only 49%.
This was clearly not a good model.
The problem with this outcome is the large
number of false-positives.
• A true, non-readmission is misclassified as re-
admission.
• This would recommend unnecessary and costly intervention for patients, who would not
have been re-admitted anyway.
In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 7.
D AT A M O D E L I N G
Try again to find a better balance between the yes and no accuracies.
For the third model, the relative cost was set at
4-to-1.
This time, the overall accuracy was 81%.
Yes accuracy was 68%. This is called sensitivity.
No accuracy was 85%. This is called specificity.
This is the optimum balance that can be
obtained with a rather small training set.
• By adjusting the relative cost of misclassified
yes and no outcomes parameter.
In medical diagnosis
• Test sensitivity is the ability of a test to correctly identify those with the disease (true
positive rate).
• DatA SCspecificity
In tro d u c ti o n TOTest IENCE is the ability70 of
/ 99the test to correctly identify those without the disease (true
C ONFUSION
M AT R I X
Confusion matrix is a table that is often used to evaluate the performance of a
classification model (or ”classifier”).
It works on a set of test data for which the true values are known.
There are two possible predicted classes: ”YES” and ”NO”.
If we were predicting the presence of a disease, for example, ”yes” would mean they
261 /

99

have the disease, and ”no” would mean they don’t have the disease.
• The classifier made a total of 165 predictions.
Predicted: Predicted
• 165 patients were being tested for the
No Yes
presence of that disease. N = 165
• Out of those 165 cases, the classifier
Actual:
No:
predicted ”yes” 110 times, and ”no” 55 times. 50 10
• In reality, 105 patients in the sample have Actual:
the disease, and 60 patients do not. Yes:
5 100
In tro d u c ti o n TO DatA SC I E N C E
C ONFUSION
M AT R I X
True positives (TP) / Sensitivity:
• The model predicted yes, and the patients have the disease.
Predicted: Predicted
True negative (TN) / Specificity:
262 /
No Yes
• The model predicted no, and the patients don’t
99
N = 165
have the disease.
Actual:
No:
False positives (FP) / Type I error: TN = 50 FP =10
Actual:
• The model predicted YES, but the patients don’t
Yes:
actually have the disease. FN = 5 TP = 100
False negatives (FN) / Type II error:
• The model predicted NO, but the patients
actually have the disease.

In tro d u c ti o n TO DatA SC I E N C E
C ONFUSION
M AT R I X
Term Description Calculation
Accuracy Overall, how often is the clas- (TP+TN)/total = (100+50)/165
sifier correct? = 0.91
Misclassification Rate Overall, how often is it wrong? (FP+FN)/total = (10+5)/165
263 /

Error Rate 99 Equivalent to 1 minus Accu- = 0.09


racy
True Positive Rate When it’s actually YES, how TP/actual YES = 100/105 =
(Sensitivity or Recall) often does it predict YES? 0.95
True Negative When it’s actually NO, how of- TN/actual NO = 50/60 = 0.83
Rate
(Specificity) ten does it predict NO?
Equivalent to 1 minus False
Positive Rate

In tro d u c ti o n TO DatA SC I E N C E
C ONFUSION
M AT R I X
Term Description Calculation
False Positive Rate When it’s actually NO, how of- FP/actual NO = 10/60 = 0.17
(Type I Error) ten does it predict YES?
True Negative When it’s actually NO, how of- TN/actual NO = 50/60 = 0.83
Rate 264 /

(Specificity)
99 ten does it predict NO?
Equivalent to 1 minus False
Positive Rate
Precision When it predicts YES, how of- TP/predicted YES = 100/110
ten is it correct? = 0.91
Prevalence How often does the YES con- Actual YES/total = 105/165 =
dition actually occur in our 0.64
sample?

In tro d u c ti o n TO DatA SC I E N C E
8. E VA L U AT I O N
(C ONCEPT)
Quality of the developed model is assessed.
Before model gets deployed, evaluate whether the model really answers the initial
question.

In tro d u c ti o n TO DatA SC I E N C E
8. E VA L U AT I O N
(C ONCEPT)
Two phases
Diagnostic measure phase
• Ensures that the model works as intended.
• If the model is a predictive model
• A decision tree can be used to assess whether the response provided by the model
266 /

99

matches the original design.


• This allows areas to be displayed where adjustments are required.
• If the model is a descriptive model that evaluates relationships
• A set of tests with known results can be applied and the model refined as
necessary.
Statistical significance phase
• Applied to the model to ensure data is being properly handled and interpreted within
the model.
• This is to avoid a second unnecessary assumption when the answer is revealed

In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 8.
E VA L U AT I O N
One way is to find the optimal model through a diagnostic measure based on tuning
one of the parameters in model building.
Specifically we’ll see how to tune the relative cost of misclassifying yes and no
outcomes. 267 /

99

Four models were built with four different relative


misclassification costs.
Each value of this model-building parameter
increases the true positive rate of the accuracy in
predicting yes, at the expense of lower accuracy
in predicting no, that is, an increasing
false-positive rate.
Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 8.
E VA L U AT I O N
Which model is best based on tuning this parameter?
Risk-reducing intervention – two scenarios
• Cannot be applied to all CHF patients because many of them would not have been
readmitted anyway. This will be cost effective.
• The intervention itself would not be as effective in improving patient care if not enough
268 /

99

high-risk CHF patients targeted.


How do we determine which model was optimal?
• This can be done with the help of an ROC curve (receiver operating characteristic
curve).
ROC curve is a graph showing the performance of a classification model at all
classification thresholds.
ROC curve plots two parameters:
• True Positive Rate
• False Positive Rate
In tro d u c ti o n TO DatA SC I E N C E
R E C E I V E R O P E R AT O R
C H A R A C T E R I S T I C ( R O C ) C URV E
ROC curves are used to show the connection/trade-off between clinical sensitivity and
specificity for every possible cut-off (threshold) for a test or a combination of tests.
The area under an ROC curve is a measure of the usefulness of a test in general.
• A greater area means a more useful test.
ROC curves are used in clinical biochemistry to choose the most appropriate cut-off for a
test.
The best cut-off has the highest true positive rate together with the lowest false
positive rate.
ROC curves were first employed in the study of discriminator systems for the detection of
radio signals in the presence of noise in the 1940s, following the attack on Pearl Harbor.
The initial research was motivated by the desire to determine how the US RADAR
”receiver operators” had missed the Japanese aircraft.
In tro d uc ti o n TO DatA SC I E N C E 79 / 99
R E C E I V E R O P E R AT O R
C H A R A C T E R I S T I C ( R O C ) C URV E
• An ROC curve plots TPR vs. FPR at different classification thresholds.
• Lowering the classification threshold classifies more items as positive, thus
increasing both False Positives and True Positives.
270 /

99

• The optimal model is the one giving the maximum separation between
the blue ROC curve relative to the red base line.
• This curve quantifies how well a binary classification model performs.
• Declassifying the yes and no outcomes when some discrimination criterion is varied.
• In this case, the criterion is a relative misclassification cost.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 8.
E VA L U AT I O N
We can see that model 3, with a relative misclassification cost of 4-to-1, is the best of the 4
models

271 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
F R OM D E P L O Y M E N T
TO F EEDBAC K
Business Analytic
Understanding Approach

Data
272 / Feedback Requirements
99

Data
Deployment
Collection

Evaluation Data
Understanding

Data
Data Modeling
Preparation

In tro d u c ti o n TO DatA SC I E N C E
F R OM D E P L O Y M E N T
TO F EEDBAC K

The importance of stakeholder input.


To consider the scale of
273 /

99

deployment.
The importance of incorporating feedback to refine the model.
The refined model must be redeployed.
This process should be repeated as often as necessary.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
9. D E P L O Y M E N T
(C ONCEPT)

• To make the model relevant and useful to address the initial question,
involves getting the stakeholders familiar with the tool produced.
274 /

99

• Once the model is evaluated/approved by the stakeholders, it is deployed


and put to the ultimate test.
• The model may be rolled out to a limited group of users or in a test
environment, to build up confidence in applying the outcome for use
across the board.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 9.
D EPLOYMENT

275 /

99

Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 9.
D EPLOYMENT
Understanding the results
• In preparation for model deployment, the next step was to assimilate the knowledge for
the business group who would be designing and managing the intervention program to
reduce readmission risk.
276 /

99

• In this scenario, the business people translated the model results so that the clinical staff
could understand how to identify high-risk patients and design suitable intervention
actions.
• The goal was to reduce the likelihood that these patients would be readmitted within 30
days after discharge.
• During the business requirements stage, the Intervention Program Director and her team
had wanted an application that would provide automated, near real-time risk
assessments of congestive heart failure.

In tro d u c ti o n TO DatA SC I E N C E
9.
D EPLOYMENT

277 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 9.
D EPLOYMENT
Gathering application requirements
• It also had to be easy for clinical staff to use, and preferably through browser-based
application on a tablet, that each staff member could carry around.
278 /

99

• This patient data was generated throughout the hospital stay.


• It would be automatically prepared in a format needed by the model and each
patient would be scored near the time of discharge.
• Clinicians would then have the most up-to-date risk assessment for each patient, helping
them to select which patients to target for intervention after discharge. As part of
solution deployment, the Intervention team would develop and deliver training for the
clinical staff.

In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 9.
D EPLOYMENT
Additional Requirements
• Processes for tracking and monitoring patients receiving the intervention would have to
be developed in collaboration with IT developers and database administrators, so that
the results could go through the feedback stage and the model could be refined over
279 /

time. 99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
10. F E E D B A C K
(C ONCEPT)
Feedback from users to refine the model.
Assess the model for performance and impact.
The value of the model will be dependent on successfully incorporating feedback and
280 /

making adjustments for as long as the solution is required.


99

Throughout the Data Science Methodology, each step sets the stage for the next.
This makes the methodology cyclical, ensures refinement at each stage in the game.
Once the model has been evaluated and the data scientist trusts that it will work, it
will be deployed and will undergo the final test:
Its real use in real time in the field.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 10.
F EEDBAC K

281 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 10.
F EEDBAC K
Feedback stage included these steps:
1

282 /

99

• The review process would be defined and put into place, with over
population. Clinical management executives would have overall res
• CHF patients receiving intervention would be tracked and their re-a
Source: CognitiveClass
• The intervention would then be measured to determine how effecti
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 10.
F EEDBAC K

283 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 10.
F EEDBAC K
For ethical reasons, CHF patients would not be split into controlled and treatment
groups.
Instead, readmission rates would be compared before and after the implementation of
284 /

the model to measure its impact.


99

After the deployment and feedback stages, the impact of the intervention program on
re-admission rates would be reviewed after the first year of its implementation.
Then the model would be refined, based on all of the data compiled after model
implementation and the knowledge gained throughout these stages.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 10.
F EEDBAC K

285 /

99

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 10.
F EEDBAC K

Redeployment
The intervention actions and processes would be reviewed and very likely refined as
286 /

well, based on the experience and knowledge gained through initial deployment and
99

feedback.
Finally, the refined model and intervention actions would be redeployed, with the
feedback process continued throughout the life of the Intervention program.

Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
D AT A S C I E N C E
P R O C E S S - S UMMARY
Learn the importance of
• Understanding the question
• Picking the most effective analytic approach
Learn to work with data
287 /

• determine the data requirements


99

iterative stages
• collect the appropriate data
• understand the data
• prepare the data for modeling
Learn how to
• evaluate and deploy the model
• getting feedback on it
• use the feedback constructively so as to impove the model

In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N C E
P R O C E S S - S UMMARY

Think like a data scientist


288 /

• Forming a concrete business or research problem


99

• Collecting and analyzing data


• Building a model iterative stages
• Understanding the feedback after model deployment

In tro d u c ti o n TO DatA SC I E N C E
289 /
T HANK YOU
99

In tro d u c ti o n TO DatA SC I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 4 : DATA S CIENCE T EAMS
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.

In tro d u c ti o n TO DatA SC I E N C E 291 /


T ABLE OF
C ONTENTS

1 Data SC I E N C E TE A M S

In tro d u c ti o n TO DatA SC I E N C E 292 /


D AT A D R I V E N D E C I S I O N
MAKING
Usecase:
Airbnb
Experiment.
293 /

) Find ways to put data into new projects using an established Learn-Plan-Test-
24

Measure process.
Democratize data.
) Scale a data science team to the whole company and even clients.
Measure the impact.
) Evaluate what part DS teams have in your decision-making process and give them
credit for it.

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
R O L E S I N D AT A S C I E N C E
T E A M [1/6]

[1] Chief Analytics Officer / Chief Data


Officer
) CAO, a “business translator,”
294 /

bridges the gap between data


24

science and domain expertise


acting both as a visionary and a
technical lead.
) Preferred skills: data science
and analytics, programming
skills, domain expertise,
leadership and visionary abilities.

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
R O L E S I N D AT A S C I E N C E
T E A M [2/6]

[2] Data analyst


) The data analyst role implies proper data collection and interpretation activities.
295 /
) An analyst ensures that collected data is relevant and exhaustive while also
24

interpreting the analytics results.


) May require data analysts to have visualization skills to convert alienating numbers
into tangible insights through graphics. (eg: IBM or HP)
) Preferred skills: R, Python, JavaScript, C/C++, SQL

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
R O L E S I N D AT A S C I E N C E
T E A M [3/6]
3 Business analyst
) A business analyst basically realizes a CAO’s functions but on the operational level.
) This implies converting business expectations into data analysis.
) If your core data scientist lacks domain expertise, a business analyst bridges this gulf.
296 /

) Preferred skills: data visualization, business intelligence, SQL.


24

4 Data scientist
) A data scientist is a person who solves business tasks using machine learning and
data mining techniques.
) The role can be narrowed down to data preparation and cleaning with further
model training and evaluation.
) Preferred skills: R, SAS, Python, Matlab, SQL, noSQL, Hive, Pig, Hadoop, Spark

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
R O L E S I N D AT A S C I E N C E
T E A M [4/6]
Job of a data scientist is often divided into two roles
[4A] Machine Learning Engineer
) A machine learning engineer combines software engineering and modeling skills
by determining which model to use and what data should be used for each model.
) Probability and statistics are also their forte.
297 /

) Training, monitoring, and maintaining a model.


24

) Preferred skills: R, Python, Scala, Julia, Java


[4B] Data Journalist
) Data journalists help make sense of data output by putting it in the right context.
) Articulating business problems and shaping analytics results into compelling stories.
) Present the idea to stakeholders and represent the data team with those unfamiliar
with statistics.
) Preferred skills: SQL, Python, R, Scala, Carto, D3, Tableau
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
R O L E S I N D AT A S C I E N C E
T E A M [5/6]
5 Data architect
) Working with Big Data.
) This role is critical to warehouse the data, define database architecture, centralize
data, and ensure integrity across different sources.
298 /

) Preferred skills: SQL, noSQL, XML, Hive, Pig, Hadoop, Spark


24

6 Data engineer
) Data engineers implement, test, and maintain infrastructural components that
data architects design.
) Realistically, the role of an engineer and the role of an architect can be combined in
one person.
) Preferred skills: SQL, noSQL, Hive, Pig, Matlab, SAS, Python, Java, Ruby, C++, Perl

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
R O L E S I N D AT A S C I E N C E
T E A M [6/6]

[7] Application/data visualization engineer


) This role is only necessary for a specialized data science model.
299 /

24

) An application engineer or other developers from front-end units will oversee end-
user data visualization.
) Preferred skills: programming, JavaScript (for visualization), SQL, noSQL.

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N T I S T
[1/2]
Data scientists are responsible for discovering insights from massive amounts of
structured and unstructured data to help shape or meet specific business needs and
goals.
Role 300 /

24

) Main objective is to organize and analyze large amounts of data, often using
software specifically designed for the task.
Responsibility
) Chief responsibility is data analysis, a process that begins with data collection and
ends with business decisions made on the basis of the data scientist’s final data
analytics results.

h tt p s : / / w w w. c i o . c o m / a r ti c l e / 3 2 1 7 0 2 6 / w h a t - i s - a - d a ta - s c i e n ti s t- a - ke y - d a ta - a n a l y ti c s - r o l e - a n d - a - l u c r a ti v e - c a r e e r. h t m l
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N T I S T
[2/2]
Stitch Fix’s Michael Hochster defines two types of data scientists:
Type A stands for Analysis
) This person is a statistician that makes sense of data without necessarily having
301 /

strong programming knowledge.


24

) Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc.
Type B stands for Building
) These folks use data in production.
) They’re excellent good software engineers with some statistics background who
build recommendation systems, personalization use cases, etc.

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N T I S T R E q U I R E M E N T S -
I N D U S T R Y -W I S E
Business
) Data analysis of business data can inform decisions around efficiency,
inventory, production errors, customer loyalty and more.
E-commerce
) improve customer service, find trends and develop services or products.
302 /

Finance
24

) data on accounts, credit and debit transactions and similar financial data, security
and compliance, including fraud detection.
Government
) form decisions, support constituents and monitor overall satisfaction, security
and compliance.
Science
) collect, share and analyze data from experiments in a better way.
h tt p s : / / w w w. c i o . c o m / a r ti c l e / 3 2 1 7 0 2 6 / w h a t - i s - a - d a ta - s c i e n ti s t- a - ke y - d a ta - a n a l y ti c s - r o l e - a n d - a - l u c r a ti v e - c a r e e r. h t m l
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N T I S T R E q U I R E M E N T S -
I N D U S T R Y -W I S E
Social networking
) targeted advertising, improve customer satisfaction, establish trends in location data
and enhance features and services.
) Ongoing data analysis of posts, tweets, blogs and other social media can
help businesses constantly improve their services.
Healthcare
) Electronic medical records requires a dedication to big data, security and
compliance.
) Improve health services and uncover trends that might go unnoticed otherwise.
Telecommunications
) All electronics collect data, and all that data needs to be stored, managed,
maintained and analyzed.
) Data scientists help companies to improve products and keep customers happy by
delivering the features they want.
h tt p s : / / w w w. c i o . c o m / a r ti c l e / 3 2 1 7 0 2 6 / w h a t - i s - a - d a ta - s c i e n ti s t- a - ke y - d a ta - a n a l y ti c s - r o l e - a n d - a - l u c r a ti v e - c a r e e r. h t m l
In tro d u c ti o n TO DatA SC I E N C E 14 / 24
S K I L L S E T F O R A D AT A
SCIENTIST
P R O G R A M M I N G : Most fundamental of a data scientist’s skill set. Programming
improves your statistics skills, helps you “analyze large datasets” and gives
you the ability to create your own tools.
Q U A N T I TAT I V E A N A LY S I S : Improve your ability to run experimental analysis, scale
your data strategy and help you implement machine learning.
304 /

24

P R O D U C T I N T U I T I O N : Understanding products will help you perform quantitative


analysis. It will also help you predict system behavior, establish metrics and
improve debugging skills.
C O M M U N I C AT I O N : Strong communication skills will help you “leverage all of the
previous skills listed.”
T E A M W O R K : It requires being selfless, embracing feedback and sharing your
knowledge with your team.
William Chen, Data Science Manager at Quora
In tro d u c ti o n TO DatA SC I E N C E
S K I L L S E T O F A D AT A
SCIENTIST

305 /

24

In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N C E T E A M
B UILDING

Get to know each other for better communication


Foster team cohesion and teamwork
Encourage collaboration to boost team
productivity and performance.

htt ps: / / towa rd sd ata sci e n c e . co m / w hy - te a m -b u i ld i n g - is-i m p o r ta nt-to -d ata -sc i e nti st s -a 8fa 7 4d b c 09 b
In tro d u c ti o n TO DatA SC I E N C E
O R G A N I S AT I O N O F D AT A
SCIENCE T EAM
[1] Decentralized
) Data scientists report into specific
business units (ex: Marketing) or functional
units (ex: Product Recommendations)
within a company.
307 /

24

) Resources allocated only to projects within


their silos with no view of analytics activities
or priorities outside their function or
business unit.
) Analytics are scattered across the
organization in different functions and
business units.
) Little to no coordination
) Drawback – lead to isolated teams
In tro d u c ti o n TO DatA SC I E N C E
O R G A N I S AT I O N O F D AT A
SCIENCE T EAM

[2] Functional
) Resource allocation driven by a
308 /

functional agenda rather than an


24

enterprise agenda.
) Analysts are located in the functions
where the most analytical activity takes
place, but may also provide services to
rest of the corporation.
) Little coordination

In tro d u c ti o n TO DatA SC I E N C E
O R G A N I S AT I O N O F D AT A
SCIENCE T EAM

[3] Consulting
) Resources allocated based on availability
309 /

on a first-come first-served basis without


24

necessarily aligning to enterprise objectives


) Analysts work together in a central group
but act as internal consultants who charge
“clients” (business units) for their services
) No centralized coordination

In tro d u c ti o n TO DatA SC I E N C E
O R G A N I S AT I O N O F D AT A
SCIENCE T EAM
[4] Centralized
) Data scientists are members of a core
group, reporting to a head of data science
or analytics.
) Stronger ownership and management of
310 /

24

resource allocation and project prioritization


within a central pool.
) Analysts reside in central group, where
they serve a variety of functions and
business units and work on diverse
projects.
) Coordination by central analytic unit
) Challenge – Hard to assess and
meet demands for incoming data
In tro d u c ti o n TO science
DatA SC I E N C projects.
E (esp in smaller
O R G A N I S AT I O N O F D AT A
SCIENCE T EAM

[5] Center of Excellence


) Better alignment of analytics initiatives
and resource allocation to enterprise
311 /

24

priorities without operational involvement.


) Analysts are allocated to units
throughout the organization and their
activities are coordinated by a central
entity.
) Flexible model with right balance of
centralized and distributed coordination.

In tro d u c ti o n TO DatA SC I E N C E
O R G A N I S AT I O N O F D AT A
SCIENCE T EAM

[6] Federated
) Same as “Center of Excellence” model
with need-based operational involvement
312 /

24

to provide SME support.


) A centralized group of advanced analysts
is strategically deployed to enterprise-wide
initiatives.
) Flexible model with right balance of
centralized and distributed coordination.

In tro d u c ti o n TO DatA SC I E N C E
Building an Analytics-Driven Organization, Accenture
https://www.altexsoft.com/blog/datascienc
313 /

24

•how-to-structure-data-science-team-key-models-and-roles/ h
•what-i s-a-data-s c i e n ti s t-a-key-data-a n a l y ti c s-r o l e-and-a-l u c rati

In tro d u c ti o n TO DatA SC I E N C E
INTRODUCTION TO DATA SCIENCE
MODULE # 3 : DATA SCIENCE Proposal
IDS Course Team
BITS Pilani
TABLE OF CONTENTS

1 DATA SCIENCE PROPOSAL

INTRODUCTION TO DATA SCIENCE 315 / 30


WHAT IS DATA SCIENCE PROPOSAL
As a Data Scientist, there are occasions when proposals need to be written for data science projects.
At Microsoft:
A. Business-led Proposal Origin of Proposal
• Business teams come with requirements
• Ex: Product Engineering Team on how to prioritize
customer feedback screening
B. Data science-led Innovation
• From Data Science team
• Ex: How to maximize customer satisfaction for Azure
C. Data science-led Systemic Solutions
• What is the impact of ‘x’ on business
• Ex: ‘X’ can be marketing campaign, new service launch

https://medium.com/data-science-at-microsoft/managing-a-data-science-project-87945ff79483

INTRODUCTION TO DATA SCIENCE 316 / 30


QUESTIONNAIRE TO PREPARE PROPOSAL
1. What is the business problem we are trying to solve?
2. Write an exact definition. Identify the type of the problem.
3. Are we addressing a specific problem or a problem specific to a team? Is it a generic
problem across all business? (help to create certain frameworks or accelerators)
4. Who are the targeted audience?
5. How do you evaluate your solution outcome? Are there any evaluation metrics
available?
6. What is the acceptance criteria for the solution? (for e.g. for a classification task
accuracy should be above 65%)

INTRODUCTION TO DATA SCIENCE 317 / 30


QUESTIONNAIRE TO PREPARE PROPOSAL
Business Understanding
• What is the business problem we are trying to solve?
• Write an exact definition.
• Is it a prediction problem?
→ e.g. predicting company’s profit in next quarter.
• Are we doing a segmentation?
→ e.g. a customer segmentation for targeted marketing.
• Are we going to recommend something say a product to the
user?
• Is it anomaly detection or a fraud detection problem?
• Is it an optimization problem?
→ e.g. optimizing revenue of a company.

INTRODUCTION TO DATA SCIENCE 318 / 30


1. PREDICTION
• Classification
• Given a new individual observation, predicts which class it belongs to.
• e.g. whether a credit card customer will default or not given his data like credit card balance,
income etc.
• Covid Discharge Status, viz., (Recovered, Expired, Shifted)
• Social media sentiment analysis to determine the emotion behind user-generated content
• Regression
• Given a new individual observation, estimates the value of a particular variable specific to that
individual.
• e.g. predicting the revenue for the next quarter
• Predicting the price of a house, given locality details

INTRODUCTION TO DATA SCIENCE 319 / 30


1. PREDICTION... CONTD...

• Scoring or Class Probability Estimation


• Related to the classification problem
• Instead of class prediction, predict a score representing the probability or likelihood that the
individual belongs to the class.
• e.g. *In NLP, a score for a matching statement against the threshold value.

*Engati Chatbot link - https://app.engati.com/static/standalone/bot.html?bot_key=889d005935e7437b

INTRODUCTION TO DATA SCIENCE 320 / 30


1. PREDICTION... CONTD...

• Survival Analysis/Churn Analysis


• Churn analytics is the process of measuring the rate at which customers quit the product, site, or
service.
• Analysis of data where outcome is the time duration until the occurrence of an event of interest
• e.g. Customer life time with a provider

INTRODUCTION TO DATA SCIENCE 321 / 30


1. PREDICTION... CONTD... Churn Analysis

Based on Historical data, predicting the Churn Analysis for next quarter

INTRODUCTION TO DATA SCIENCE 322 / 30


2. SEGMENTATION / CLUSTERING
Clustering attempts to group individuals based on similarity.
• e.g. Segment the customers to High spenders and Low spenders based on their buying pattern and other data .

INTRODUCTION TO DATA SCIENCE 323 / 30


3. RECOMMENDATION / SIMILARITY MATCHING
• Similarity matching attempts to find similar individuals based on the data known about
them. This is useful in recommendation problem setting.
• e.g. Finding people similar to you who have purchased or liked similar products, recommending a
movie to a user based on his preferences and similar users’ interests.
• OTT platforms, E-Commerce platforms

INTRODUCTION TO DATA SCIENCE 324 / 30


4. ANOMALY DETECTION
• Profiling / Fraud Detection
• Profiling attempts to characterize the typical behavior of an individual or a group.

Consumer • Convenience driven


• Connectivity driven
How do I collect this
Characteristics • Personalization driven data?
1. Feedback
• Loyalty 2. Survey customer
• Discount
Consumer Typology • Impulsive interests and
• Need-based
preferences
• Lifestyle
3. Keep profiles
Psychographic • Demographics
• Social Class
consistent and up-to-
date – integrate 3rd
party sources

https://commence.com/blog/2020/06/16/customer-profiling-methods/

INTRODUCTION TO DATA SCIENCE 325 / 30


4. Fraud Analytics

https://www.crisil.com/en/home/our-businesses/global-research-and-risk-solutions/our-offerings/non-financial-risk/financial-crime-management/
fraud-management/fraud-detection-and-analytics.html#

INTRODUCTION TO DATA SCIENCE 326 / 30


5. CAUSAL MODELLING / ROOT CAUSE ANALYSIS
Casual modeling helps to understand the casual relationship between events or what
events/actions influence other.
• What are the possible root causes for an anomaly detected?
• Whether the advertisements influenced consumer’s decision to purchase or not?
• What are the reasons for fraud in bank
• Lack of Training
• Competition to achieve incentives
• Overburdened Staff
• Low Compliance Level (not following RBI Guidelines)

INTRODUCTION TO DATA SCIENCE 327 / 30


6. MARKET BASKET ANALYSIS
Co-occurrence Grouping / Association Rule Discovery / Frequent Item set Mining
• Find the association between the entities based on the purchase transactions involving them.
• e.g. What items are purchased together by consumers at a supermarket.

INTRODUCTION TO DATA SCIENCE 328 / 30


7. DATA REDUCTION
• Replace a large data with a smaller set of data that contain most of the important information in the large dataset.
• Involves loss of information.
• Which data reduction strategy to follow?
• Aggregation / Sampling / Dimensionality reduction
• Examples
• Massive data sets of customers dining preferences may be reduced to much smaller data set revealing their
cuisine preferences.
• A large time series sensor data at a second interval may be reduced to hourly data or to a smaller data set with only
changed values
• ISRO weather data augmentation with semantic data project: Followed dimensionality reduction

INTRODUCTION TO DATA SCIENCE 329 / 30


QUESTIONS TO BE ASKED BASED ON TASK
• Prediction
• Do we know what variable (target) to be predicted?
• Is that target variable defined precisely?
• What values or ranges of values that this variable can take? [Ordinal / Categorical]
• Will modelling this target variable address all the problems defined in the scope or only a sub
problem?
• Clustering
• Do we know the end objective? i.e. Is an EDA (Exploratory Data Analysis) path clearly defined
to see where our analysis is going?

INTRODUCTION TO DATA SCIENCE 330 / 30


SOLUTION APPROACH
• Is the proposed analytical solution formulated appropriately to solve the business problem OR is it an
approximation?
• Will the proposed solution address all the problems defined in the scope or only a sub problem?
• What will be the benefits of the proposed solution? Benefit vs. Cost tradeoff.
• What will be the specific end objectives to be met by the proposed solution?
• What should be the anticipated outcomes by the proposed solution?

INTRODUCTION TO DATA SCIENCE 331 / 30


SOLUTION APPROACH
What are the deliverables? Data Science deliverables fall under 3 categories:
1. Analysis – A study using data to describe how a product
or program is working. Ex: Exploratory Data Analysis,
Diagnosis to highlight change in trend
2. Experiment – A scientific study to test a hypothesis. Ex:
Spending more money on digital advertising leads to
increased sales.
Alternate Hypothesis – “Mean sales increased after
spending more time on advertising”
3. Model – Machine learning model trained on data to
predict an outcome. Ex: Churn prediction to alert the
company about at-risk customers.

https://medium.com/data-science-at-microsoft/managing-a-data-science-project-87945ff79483

INTRODUCTION TO DATA SCIENCE 332 / 30


DATA PREPARATION
• What are the important variables that you think we should collect?
• Are these variables readily available? Or is there an additional effort needed to collect these
variables?
• What are the types of data?
• e.g. Sensor data, ERP, e-commerce and SAP CRM data are structured (OLTP), Social networking
data is unstructured.
• Where are the locations of data in the system?
• e.g. Product master and sales transaction data in ERP SQL RDBMS database, OLAP data in
SQL server for BI reporting, Text data for customer review and sentiment from Tweets and FB
posts etc. [Internal to Organization, Acquire from 3rd party sources?]
• Where are the data coming from?
• e.g. data from sensor, sales data from ERP, online store

INTRODUCTION TO DATA SCIENCE 333 / 30


DATA PREPARATION ... CONTD ...
• Who are the current consumers of the data?
• e.g. Visualization tools, BI application etc.
• What are the methods to acquire data?
• e.g. Sensor data are ingested to data lake. ERP, e-commerce, and SAP CRM are inside
organization’s data center and proper access control needs to be granted to access the data. Social
networking data are retrieved from streaming API as a nightly job and are stored in a NoSQL
database etc.
• What are the integration points?
• e.g. IT team needs to provide database access and needs to build API services to access
certain data.
• Will it be practical to get all the relevant variables and load it to our workspace?

INTRODUCTION TO DATA SCIENCE 334 / 30


DATA PREPARATION ... CONTD ...
• What are the problems in acquiring the data?
• e.g. Sensor data are archived and deleted after 'x' days. Request needs to be raised to store the data and
to archive the data to make enough sample data for analyses and modelling.
• Social networking data may not be available for a longer term. All relevant data are captured by
existing systems, and request needs to be raised and approved for accessing data from servers.
• For the prediction problems, is sufficient amount of labelled examples available? Or is there a cost involved
in getting these values?
• e.g. a field survey may be needed to collect the response from a customer to see the likelihood of
joining a new plan.
• Are the training data drawn from a similar population on which the model to be applied? If not, are the
selection biases noted? What are the plans to compensate?

INTRODUCTION TO DATA SCIENCE 335 / 30


MODELLING
• Is the choice of model appropriate for the business problem? Is it in line with our prior knowledge of the
problem?
• Classification, scoring, clustering, etc.
• Does the modelling technique meet all the other requirements (functional and non-functional) of
the problem?
• Should various modelling techniques be tried and compared using appropriate evaluation metrics?
• Check the amount of data required, generalization performance (i.e. how our model would be using
another sample), learning time

INTRODUCTION TO DATA SCIENCE 336 / 30


EVALUATION

• Is there a plan for domain expert validation?


• If so, will the model be in a form that they can understand?
• Is there an evaluation metric set up by the business? (e.g. For a classification problem, there should be less
than x% of False Positives). Is that appropriate for the business problem?
• Is there a hold-out data (i.e. data used for validation / test) available? [70%-30% generally]
• Are the business costs and benefits considered into account?

INTRODUCTION TO DATA SCIENCE 337 / 30


EVALUATION
• For a classification problem, is there a threshold defined (for e.g. different thresholds can give different
implications in terms of benefits like reducing the threshold to a 0.70 can reduce the False Positives)
• For a regression problem, how will we evaluate the quality of prediction in the business context?
• For a clustering problem, how the clustering is interpreted in the context of the business problem?
• How will we measure the business impact of the final model? How will we justify the project expense
against the benefits? [ROI]

INTRODUCTION TO DATA SCIENCE 338 /30


EXISTING SYSTEMS / REqUIREMENTS

• What are the existing/related systems within the capability that capture/use related
information? For e.g. A prediction model is already being used for fraud analysis. Can we
reuse the same transaction dataset for providing recommendations?
• What are the gaps?
• Who are the stakeholders?
• Who will be affected by this implementation?

INTRODUCTION TO DATA SCIENCE 339 / 30


ASSUMPTIONS / DEPENDENCIES / CHALLENGES

• Note down the assumptions; things like availability of necessary data, access to the
infrastructure, licenses etc.
• Any Licenses/Commercials needed in case of proprietary solutions?
• Note down the dependencies: things like dependency on setting up and access to the
infrastructure/tools, on access rights etc.
• Are there any other dependencies?
• Do you see any other problems/challenges?

INTRODUCTION TO DATA SCIENCE 340 / 30


IMPLEMENTATION

• Does the client have a technology preference?


• Does the client have limited / unlimited infrastructure?
[Deployment – Cloud / On-premise?]

INTRODUCTION TO DATA SCIENCE 341 / 30


Case Study – Bipolar Disorder
• Bipolar Disorder (BD) is a recurrent chronic disorder characterized by fluctuations in mood state and energy,
which affects over 1% of the World population.
• BD is a primary cause of disability among people, leading to functional and cognitive impairment, with
increased morbidity, especially death by suicide.
• Compared to a normal, mentally-stable individual, an individual suffering from Bipolar Disorder experiences
extreme mood fluctuations, classified into “manic episodes” and “depressive episodes”, which typically last
between days to months.
• While the manic episodes are characterized by racing thoughts, feeling of elation, extreme irritability etc., the
depressive episodes are characterized by feelings of extreme sadness, restlessness, trouble in concentration,
insomnia etc.

INTRODUCTION TO DATA SCIENCE 342 / 30


Case Study – Bipolar Disorder
The standard states of bipolar disorder are as follows:
• i) Bipolar I Disorder, ii) Bipolar II Disorder, iii) Cyclothymia, iv) Unspecified Bipolar.
• From a clinical viewpoint, Bipolar I is defined by the appearance of atleast one manic episode. Patients may experience
hypomanic or major depressive episodes prior to or after the manic episodes.
• Bipolar II, Cyclothymia and Unspecified vary in episodes between hypomania and depression, with each cycle lasting
between weeks to months.
• Hypomania experiences: Reduced need for sleep. Spending recklessly, like buying a car you cannot afford. Taking chances
you normally wouldn't take because you "feel lucky" Talking so fast that it's difficult for others to follow what's being said.

Objective: Derive the best model for Bipolar State Prediction!

INTRODUCTION TO DATA SCIENCE 343 / 30


Case Study – Bipolar Disorder
Company X intends to develop a Smart Healthcare
System for monitoring Bipolar Disorder.
As a Data Scientist working for the company, what
kind of questions would you ask in the Data
Collection / Preparation phases?
• What are the important variables that we need to
collect?
• What are the types of data?
• Locations of data in the system?
• Integration points?
• Problems in acquiring the data?
• Do we have sufficient labelled samples for
prediction?
• Are the training data drawn from a similar
population on which the model to be applied? If
not, are the selection biases noted?

INTRODUCTION TO DATA SCIENCE 344 / 30


Case Study – Bipolar Disorder
Data Scientist will consult the Domain expert (Psychiatrist or Psychologist) to find what variables are important
to collect during the Manic phase / Hypomanic phase / Depressive phase of the patient

• What are the important variables that we need to collect and location of data?
• Physiological Data [From Sensor]: Heart Rate, Electrodermal Activity (EDA), Oxygen Saturation
(SPO2), Blood Pressure etc.
• Behavioral Data [From Mobile App]: Self-assessment questionnaire to capture daily information
regarding sleep quality [hourly scale], physical activity [-3 for inactive to +3 for active], mood states
using GAD [7 point likert], HDRS [7 point likert], YMRS [5 point likert]. In addition, data on
alcohol intake, stress levels, motivation levels, concentration levels, menstrual cycle pattern,
irritability levels, insomnia levels. Ask the treating doctors will be asked to rate the patient progress
using scales from much worse (-3) to much better (+3). The behavioral data will be collected from a
Mobile App.

INTRODUCTION TO DATA SCIENCE 345 / 30


Case Study – Bipolar Disorder
• Integration points?
• The sensor data and behavioral data are integrated on a daily basis and presented to the ‘Health Analytics
Engine’ on the Cloud to perform the analytics.
• Problems in acquiring the data?
• Bipolar patients must cooperate and provide the behavioral data truthfully on a periodic basis.
• Do we have sufficient labelled samples for prediction?
• Need to devise a strategy to collect the data samples for period of 6 months, on a daily basis, to build the
labelled data.
• Are the training data drawn from a similar population on which the model to be applied? If not, are the selection biases
noted?
• All the data is captured from patients suffering from Bipolar Disorder, albeit in different cycles [Type I, Type II,
Cyclothymia etc].

INTRODUCTION TO DATA SCIENCE 346 / 30


Case Study – Bipolar Disorder
• What modelling techniques can be applied to predict the patient states?
• For a similar case in the US - Decision Tree, Random Forest, Support Vector Machine and Logistic Regression
models were applied, and the accuracy of Random Forest was the best.
• Outcome - Predict the patient states. [Multiclass Classification – Bipolar Type I, Bipolar Type II, Cyclothemia
and Unspecified are the states]

INTRODUCTION TO DATA SCIENCE 347 / 30


A GUIDE TO DESIGNING A DATA SCIENCE PROJECT

• To get started, brainstorm possible ideas that might interest you.


• Write a proposal along the CRISP-DM Standards.
• Planning
• Keep a timeline with a To Do, In Progress, Completed and Parking section.
• Track the progress
• Keep track of how much progress you are making on your metrics.
• Maintain a code repo for a code review.
• Know when to stop
• Identify an minimum viable product (MVP) to help you know when to stop.

INTRODUCTION TO DATA SCIENCE 348 / 30


Data Science for Business by Tom Fawcett and Foster Provost, O’Reilly
https://www.linkedin.com/pulse/
ask-questions-while-preparing-proposal-data-science-project-menon
http://www.acheronanalytics.com/acheron-blog/
a-guide-to-designing-a-data-science-project

THANK YOU

INTRODUCTION TO DATA SCIENCE 349 / 30


DSECL ZG523
Introduction to Data Science
Dr. Vijayalakshmi
TABLE OF CONTENTS

1 Data DATA-
2 SETS
3 DATA RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7

INTRODUCTION TO DATA SCIENCE


DATA
• Data is a collection of data objects and their
attributes.
• A collection of attributes describe an object.
• Object is also known as record, observation, case,
sample or instance.
• An attribute is a property or characteristic of an
object.
• Examples: eye color of a person,
temperature
• Attribute is also known as variable, field,
characteristic, or feature.

INTRODUCTION TO DATA SCIENCE


QUALITY OF DATA

Data quality issues


• Noise and outliers;
• Missing data
• Inconsistent data
• Duplicate data
• Data that is biased or unrepresentative of the phenomenon or population that the data is supposed to
describe [80% Male observations and 20% Female]

INTRODUCTION TO DATA SCIENCE


DATA QUALITY ISSUES

Find the issues in the given data.

Name Age Date of Birth Course ID CGPA


Amy 24 01-Jan-1995 CS 104 7.4
Ben 23 Dec-01-1996 CS 102 7.5
Cathy 25 01-Nov-1994 6.7
Diana 24 Oct-01-1995 CS 104 7.9
Ben 23 Dec-01-1996 CS 102 7.5
Eden 24 CS 103 87.5
Fischer 01-01-1959 CS 105 7.0

Amy 24 01-Jan-1995 CS 104 7.2

INTRODUCTION TO DATA SCIENCE


DATA QUALITY ISSUES
Find the issues in the given data. 1. Missing data
Name Age Date of Birth Course ID CGPA 2. Inconsistent data format
(DOB)
Amy 24 01-Jan-1995 CS 104 7.4 3. Duplicate data
Ben 23 Dec-01-1996 CS 102 7.5 4. Data Inconsistency
Cathy 25 01-Nov-1994 <missing> 6.7 5. Incorrect data [Outlier]
Diana 24 Oct-01-1995 CS 104 7.9
Ben 23 Dec-01-1996 CS 102 7.5
Eden 24 <missing> CS 103 87.5
Fischer <missin 01-01-1959 CS 105 7.0
g>
Amy 24 01-Jan-1995 CS 104 7.2

INTRODUCTION TO DATA SCIENCE


PREPROCESSING ON DATA
• Improve Data Quality
• To better fit a specified data mining or machine learning technique or tool.
• Number of attributes in a data set is often reduced because many techniques are more
effective when the data has a relatively small number of attributes.
• Data correction corrects the errors in the data.
• Data cleansing removes irrelevant data.
• Data transformation changes data from one format to another.
• Correction improves the data quality.

INTRODUCTION TO DATA SCIENCE


ATTRIBUTE / FEATURE

An attribute is a property or characteristic of an


object.
• eye color of a person, temperature
Attribute is also known as variable, field,
characteristic, or feature.
The values used to represent an attribute may
have properties that are not properties of the
attribute itself.
• Average age of an employee may have a
meaning , whereas it makes no sense to talk
about the average employee ID.

INTRODUCTION TO DATA SCIENCE


ATTRIBUTE / FEATURE

The type of an attribute should tell us what


properties of the attribute are reflected in the
values used to measure it.
• For the age attribute, the properties of the
integers used to represent age are very much
the properties of the attribute. Even so, ages
have a maximum while integers do not.
• The ID attribute is distinct. The only valid
operation for employee IDs is to test whether
they are equal.

INTRODUCTION TO DATA SCIENCE


PROPERTIES OF ATTRIBUTES

Specify the type of an attribute by identifying the properties of numbers that


correspond to underlying properties of the attribute.
Properties include
• Distinctiveness =, !=
• Order <, >, ≥, ≤
• Addition +, −
• Multiplication * /
Based on these properties, we define four types of attributes: nominal, ordinal, interval,
and ratio.
Each attribute type possesses all of the properties and operations of the attribute types
above it.

INTRODUCTION TO DATA SCIENCE


TYPES OF ATTRIBUTES
Ratio

Numerical

Interval

Data

Ordinal

Categorical

Nominal
INTRODUCTION TO DATA SCIENCE
TYPES OF ATTRIBUTES
Nominal: Distinctiveness

Ordinal: Order, the data can be


categorized and ranked. 

Interval: the data can be categorized


and ranked, and evenly spaced. 

Ratio: the data can be categorized,


ranked, evenly spaced and has a
natural zero.

Each attribute type possesses all of the


properties and operations of the
attribute types above it.

INTRODUCTION TO DATA SCIENCE


TYPES OF ATTRIBUTES
Ratio Income, Height, Weight,
Annual Sales, Age
Numerical

Interval Calendar dates, temperature

Data

Ordinal Grades, Shirt Size (s, m, l,


xl, xxl)
Categorical
Eye Color, Gender,
Nominal Nationality
INTRODUCTION TO DATA SCIENCE
TYPES OF ATTRIBUTES

INTRODUCTION TO DATA SCIENCE


TYPES OF ATTRIBUTES EXAMPLE

Identify the types of attributes in the given data.

ID Age Gender Course ID CGPA Grade


19001 24 Female CS 104 7.4 Good
19002 23 Male CS 102 7.5 Good
19003 25 Female CS 103 6.7 Fair
19004 24 Female CS 104 7.9 Good
19005 23 Male CS 102 7.5 Good
19006 24 Female CS 103 8.5 Excellent
19007 26 Male CS 105 7.0 Good

INTRODUCTION TO DATA SCIENCE


TYPES OF ATTRIBUTES EXAMPLE

Identify the types of attributes in the given data.

ID Age Gender Course ID CGPA Grade


19001 24 Female CS 104 7.4 Good
19002 23 Male CS 102 7.5 Good
19003 25 Female CS 103 6.7 Fair
19004 24 Female CS 104 7.9 Good
19005 23 Male CS 102 7.5 Good
19006 24 Female CS 103 8.5 Excellent
19007 26 Male CS 105 7.0 Good
Nominal Ratio Nominal Nominal Ratio Ordinal

INTRODUCTION TO DATA SCIENCE


ATTRIBUTES AND TRANSFORMATIONS

Introduction to Data Mining by Tan


INTRODUCTION TO DATA SCIENCE
ATTRIBUTES BY THE NUMBER OF VALUES
1. Discrete Attribute
• only a finite or Countable set of values.
• zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes

2. Continuous Attribute
• Measureable data.
• Temperature, height, age, weight
• Continuous attributes are typically represented as floating-point variables.

INTRODUCTION TO DATA SCIENCE


ATTRIBUTES BY THE NUMBER OF VALUES
• Discrete data is countable while
continuous data is measurable.
• Discrete data contains distinct or separate
values.
• On the other hand, continuous data
includes any value within range.
• Discrete data is graphically represented
by bar graph whereas a histogram is used
to represent continuous data graphically.

INTRODUCTION TO DATA SCIENCE


TYPES OF ATTRIBUTES EXAMPLE

Identify the types of attributes in the given data – Discrete vs Continuous?

ID Age Gender Course ID CGPA Grade


19001 24 Female CS 104 7.4 Good
19002 23 Male CS 102 7.5 Good
19003 25 Female CS 103 6.7 Fair
19004 24 Female CS 104 7.9 Good
19005 23 Male CS 102 7.5 Good
19006 24 Female CS 103 8.5 Excellent
19007 26 Male CS 105 7.0 Good

INTRODUCTION TO DATA SCIENCE


TYPES OF ATTRIBUTES EXAMPLE

Identify the types of attributes in the given data.

ID Age Gender Course ID CGPA Grade


19001 24 Female CS 104 7.4 Good
19002 23 Male CS 102 7.5 Good
19003 25 Female CS 103 6.7 Fair
19004 24 Female CS 104 7.9 Good
19005 23 Male CS 102 7.5 Good
19006 24 Female CS 103 8.5 Excellent
19007 26 Male CS 105 7.0 Good
Discrete Continuous Discrete Discrete Continuous Discrete

INTRODUCTION TO DATA SCIENCE


DATA FORMATS
Record data
• Transaction or Market Basket data – set of items
• Data Matrix – record data with only numeric attributes.
• Sparse Data Matrix – binary asymmetric data. 0/1 entries.
• Document term matrix
Graph data
• Data with relationships among objects – Web pages
• Data with objects as graphs – LOD cloud
Ordered data
• Sequential data or temporal data – Record data + time.
• Sequence data – genome representation
• Time series data – temporal autocorrelation
• Spatial data – spatial autocorrelation
INTRODUCTION TO DATA SCIENCE
RECORD DATA EXAMPLE

Flat file (CSV), Banking, Retail, E- SPSS data matrix Frequency of terms
RDBMS Commerce etc. that appears in
documents, used in
Information
Retrieval
https://towardsdatascience.com/types-of-data-sets-in-data-science-data-mining-machine-learning-eb47c80af7a

INTRODUCTION TO DATA SCIENCE


GRAPH DATA EXAMPLE

Linked Open Data Cloud


https://lod-cloud.net/

INTRODUCTION TO DATA SCIENCE


ORDERED DATA EXAMPLE

Also called ‘Temporal Data’, each Positions instead of time stamp


record has time associated. Ex: DNA sequence bases (G, T, A,
Ex: Money transfer transaction in C)
Banking

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS

1 Data DATA-
2 SETS
3 DATA RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7

INTRODUCTION TO DATA SCIENCE


TYPES OF DATA-SETS
1 Structured data
• Data containing a defined data type, format and structure.
• Example: transaction data, online analytical processing , OLAP data cubes, traditional RDBMS,
CSV file and spreadsheets.
2 Semi structured data
• Textual data file with discernible pattern that enables parsing
• Example: XML data file, JSON data file
3 Quasi structured data
• Textual data with erratic data format that can be formatted with effort, tools and time
• Example: Web click-stream data [IP address, Timestamp, GeoCodes etc]
4 Unstructured data
• Data that has no inherent structure.
• Example: PDF, Images, Video, Email
INTRODUCTION TO DATA SCIENCE
TYPES OF DATA-SETS

Quasi Structured Data - Web click-stream data

INTRODUCTION TO DATA SCIENCE


TYPES OF DATA-SETS

5 Natural Language data


• Entity recognition, topic recognition, summarization, text completion, and sentiment analysis
• Models trained in one domain don’t generalize well to other domains. [Vocabulary]
Machine generated data
6
• Machine-generated data is automatically created by a computer, process, application, or other
machine without human intervention.
• High volume and speed.
• Examples web server logs, call detail records, network event logs

INTRODUCTION TO DATA SCIENCE


TYPES OF DATA-SETS
6 Graph-based or network data
• Data can be shown in a graph. [Ex: Linked Open Data Cloud]
• A graph is a mathematical structure to model pair-wise relationships between objects.
• Graph or network data focuses on the relationship or adjacency of objects.
• Graph databases with specialized query languages such as SPARQL.
• Example: DBPedia data in RDF format [RDF Dump or through end point]
[https://dbpedia.org/sparql]
7
Streaming data
• The data flows into the system when an event happens instead of being loaded into a data store
in a batch.
• Example: live sports or music events, stock market.

INTRODUCTION TO DATA SCIENCE


TYPES OF DATA-SETS

INTRODUCTION TO DATA SCIENCE


CHARACTERISTICS OF DATA-SETS
1 Dimensionality
• Number of attributes
• Curse of Dimensionality – the difficulties associated with analyzing high-dimensional data
• Dimensionality reduction techniques [PCA, NMF, LDA etc]
Sparsity
2
• For some data sets, such as those with asymmetric features, most attributes of an object have values of 0;
in many cases, fewer than 1% of the entries are non-zero.
• Advantage because usually only the non-zero values need to be stored and manipulated.
Resolution
• The patterns in the data depend on the level of resolution.
3
• If the resolution is too fine, a pattern may not be visible or may be buried in noise; if the resolution is
too coarse, the pattern may disappear.

INTRODUCTION TO DATA SCIENCE


Curse of Dimensionality

INTRODUCTION TO DATA SCIENCE


CHARACTERISTICS OF DATA-SETS

• All sensors are working • Sensor 1 and Sensor 6 are malfunctioning


• Most of sensor outputs are zero • Either fill the data manually, or discard the
• High dimensionality, but less information Sensor 1 and 6 data in calculations
• Sparse data
• The values of Sensors 1, 2 and 12 need to be
stored and manipulated

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS

1 DATA DATA-
2 SETS
3 Data RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7

INTRODUCTION TO DATA SCIENCE


RETRIEVING DATA

Already collected and stored the data in the organization


Look outside the organization for high-quality data available for public and commercial use.
(open-data providers)
Quality Check while Retrieving Data
• Check to see if data is equal to the data in the source document.
• Check for the right data types.

INTRODUCTION TO DATA SCIENCE


RETRIEVING DATA

Data Storage
• Database
tables
• Text files
• Data
marts
• Data
warehous
es
• Data
lakes
(raw
INTRODUCTION TO DATA SCIENCE
TABLE OF CONTENTS

1 DATA DATA-
2 SETS
3 Data RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7

INTRODUCTION TO DATA SCIENCE


DATA PREPARATION

INTRODUCTION TO DATA SCIENCE


DATA CLEANSING

Focuses on removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
Two types of errors
• Interpretation / Representation error
• Age > 130
• Height of a person is greater than 8 feet.
• Price is negative.
• Inconsistencies between data sources or against your company’s standardized values.
• Female and F
• Feet and meter
• Dollars and Pounds

INTRODUCTION TO DATA SCIENCE


DATA CLEANSING

INTRODUCTION TO DATA SCIENCE


DATA CLEANSING
Errors from data entry
• Cause
• Typos
• Errors due to lack of concentration
• Machine or hardware failure
• Detection
• Frequency table [Frequency is the number of times a specific data value occurs in your
dataset.]
• Correction
• Simple assignment statements
• If-then-else rules
White-spaces and typos
• Remove leading and trailing white-spaces.
• Change case of the alphabets from upper to lower. [Ex: SILK Framework – semantic matching]

INTRODUCTION TO DATA SCIENCE


DATA CLEANSING

Physically impossible values


• Examples
• Age > 130
• Height of a person is greater than 8 feet.
• Price is negative.
Outliers
• Use visualization techniques like box plots.
• Use statistical summary with minimum and maximum values.

INTRODUCTION TO DATA SCIENCE


DATA CLEANSING
Missing values

INTRODUCTION TO DATA SCIENCE


DATA CLEANSING

Deviations from code-book


• A code book is a description of your data. It contains things such as the number of variables per
observation, the number of observations, and what each encoding within a variable means. [Ex:
One-hot encoding for categorical values such as Gender]
• Discrepancies between the code-book and the data should be corrected.
Different units of measurement
• Pay attention to the respective units of measurement.
• Simple conversion can rectify.
Different levels of aggregation
• Data set containing data per week versus one containing data per work week.
• Data summarization will fix it.

INTRODUCTION TO DATA SCIENCE


COMBINING DATA

Two operations to combine information from different data sets.


• Joining
• Enriching an observation from one table with information from another table.
• Requires primary keys or candidate keys.
• Use views to virtually combine data.
• Appending or stacking
• Adding the observations of one table to those of another table.

INTRODUCTION TO DATA SCIENCE


TRANSFORMING DATA
Applying mathematical transformation to the input variable.
• For a relationship of the form, y = aebx transforming x to log x makes the relationship between
x and y linear.

Reducing number of variables.


Combining two variables into a new variable.

Introducing Data Science by Cielen, Meysman and Ali


INTRODUCTION TO DATA SCIENCE
TRANSFORMING DATA

Turning variables into dummies.


• Dummy variables can only take two values:
true(1) or false(0).
• Create separate columns for the classes
stored in one variable and indicate it with
1 if the class is present and 0 otherwise.

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS

1 DATA DATA-
2 SETS
3 DATA RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7

INTRODUCTION TO DATA SCIENCE 398 / 69


EXPLORATORY DATA ANALYSIS (EDA)

Use graphical techniques to gain an understanding of the data and the interactions between
variables.
Look at what can be learned from the data.
Statistical properties like distribution of data, correlation.
Discover outliers.

INTRODUCTION TO DATA SCIENCE


EXPLORATORY DATA ANALYSIS (EDA)

• Boxplot – can show the maximum, minimum, median, and other characterizing measures at
the same time.
• Histogram – In a histogram a variable is cut into discrete categories and the number of occurrences
in each category are summed up and shown in the graph.
• Pareto diagram – is a combination of the values and a cumulative distribution. Tabulation
• Clustering and other modeling techniques can also be a part of exploratory analysis.

Refer - https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Exploratory_data_Analysis.ipynb

INTRODUCTION TO DATA SCIENCE


BOXPLOT [WHISKER PLOT]
• A boxplot incorporates the five-number
summary.
• The ends of the box are at the quartiles.
• The box length is the interquartile range.
• The median is marked by a line within the box.
• The whiskers outside the box extend to the Minimum
and Maximum observations.

INTRODUCTION TO DATA SCIENCE


BOXPLOT

Consider the ordered list of observations for ‘Age’ feature.


25,25,30,33,33,35,35,35,35,36,40,41,42,42,51
Draw a box-plot to represent the above data.
Median, Minimum, Maximum, First quartile, Third quartile, Interquartile Range

BoxPlot Calculator - http://www.alcula.com/calculators/statistics/box-plot/

INTRODUCTION TO DATA SCIENCE


SCATTERPLOT
• Determine if there appears to be a relationship, pattern, or trend between two numeric attributes.
• Provide a visualization of bi-variate data to see clusters of points and outliers, or correlation
relationships.

INTRODUCTION TO DATA SCIENCE


SCATTERPLOT

Analysis
• More tips given during the dinner time
compared to the lunch time
• Positive correlation between total bill amount
and tip given, i.e., more the bill amount,
more the tip paid.

INTRODUCTION TO DATA SCIENCE


HeatMap
• Using visual cues in a heatmap.
• A heatmap is a way to visualize data in tabular format, where in place of the numbers,
you leverage colored cells that convey the relative magnitude of the numbers.
• Use color saturation to provide visual cues to quickly target the potential points of
interest.
• Always include a legend as a subtitle on the heatmap with color corresponding to the
conditional formatting color.

INTRODUCTION TO DATA SCIENCE


HeatMap

INTRODUCTION TO DATA SCIENCE


HeatMap
Interpretation
1. Social Networking and User Generated
Content are two data sources where year-on-year
maximum number of datasets have been added.
2. Publication data source is showing an average
year-on-year contribution.
3. Government and Life Sciences data sources
had maximum contributions in year 2011,
however the contributions thereon have dried up.

INTRODUCTION TO DATA SCIENCE


PARETO DIAGRAM [80-20 RULE] Cut-off point

• The Pareto Chart is a very powerful tool for showing


the relative importance of problems.

For many events, roughly 80% of the effects come


from 20% of the causes.

Ex:
80% of complaints come from 20% of customers
80% of sales come from 20% of clients
80% of computer crashes come from 20% of IT bugs

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS

1 DATA DATA-
2 SETS
3 DATA RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7

INTRODUCTION TO DATA SCIENCE


DATA QUALITY INDEX

https://www.deltapartnersgroup.com/
INTRODUCTION TO DATA SCIENCE
DATA QUALITY INDEX

http://www.dataintegration.ninja
INTRODUCTION TO DATA SCIENCE
IMPACT OF MISSING DATA

• Missing data imputation may distort variable distribution.


• Affect the performance of Machine Learning Models.
• Incompatible in Scikit Learn library of Python.

INTRODUCTION TO DATA SCIENCE


MISSING DATA MECHANISMS

• Understanding the mechanism of missing data will help us choose appropriate


imputation method.
• Missing Completely At Random (MCAR)
• Missing At Random (MAR)
• Not Missing At Random (NMAR)

INTRODUCTION TO DATA SCIENCE


MISSING COMPLETELY AT RANDOM (MCAR)
• The probability of missing is same for all the
observations.
• There is no relationship between the missing values
and any other values in the dataset.
• Removing such missing values will not effect the
inferences made.

INTRODUCTION TO DATA SCIENCE


MISSING AT RANDOM (MAR)
• The probability of missing values depends on
available information
• i.e it depends on other variables in the
dataset.

INTRODUCTION TO DATA SCIENCE


NOT MISSING AT RANDOM (NMAR)

• The missing values exist as an


indication of a certain class.
• Depression = yes has more missing
values. Hence choose imputation
technique appropriately.

INTRODUCTION TO DATA SCIENCE


IMPUTATION TECHNIqUES
A. Categorical Variables
B. Numerical Variables

INTRODUCTION TO DATA SCIENCE


IMPUTATION – CATEGORICAL VARIABLES

1. Imputation by Deleting rows


Years of
• Data set with Deleted rows may impact our Gender Age employment Cardiac Health
Male 28 7 8
classification process
Male 45 21 5
• Not applicable for smaller data set - 54 25 7
Female 31 5 8
Male 35 4 8
2. Replace with Most Frequent value Female 45 8 9
Male 48 29 4
• May create an imbalanced dataset within the category

INTRODUCTION TO DATA SCIENCE


IMPUTATION – CATEGORICAL VARIABLES
3. Create a Classifier algorithm to predict missing
values Years of
Gender Age employment Cardiac Health
• Use ‘Age’, ‘Years of employment’ and ‘Cardiac health’ as
Male 28 7 8
training dataset to predict ‘Gender’ Male 45 21 5
• Blank values become test dataset - 54 25 6
Female 31 5 8
4. Unsupervised ML technique such as
Male 35 4 8
K-means clustering Female 45 8 9
• Male 48 29 4
Cluster ‘Age’ and ‘Years of employment’ into 2 categories,
then predict the missing values for ‘Gender’ [possibility of
value falling into cluster]

INTRODUCTION TO DATA SCIENCE


IMPUTATION – NUMERICAL VARIABLES
1. Deleting missing value
2. Create a Classifier algorithm to predict values – (similar to
Categorical variables)
3. Statistical methods
• Mean: Average value of the variable is imputed for the
missing data provided data does not contain extreme
values (Outliers)
• Median: Centre value of the variable after arranging in
ascending order. Preferred when data consists of Outliers
• Mode: Imputed with the variable value that has maximum
frequency of occurrence [for large data and variable value
does not impact the overall outcome]

INTRODUCTION TO DATA SCIENCE


MEAN / MEDIAN IMPUTATION
• Used when MCAR / MAR.
• Assumes that the feature follows normal distribution
Advantages
• Easy to implement
• Faster way of obtaining complete dataset
Disadvantages
• Mean imputation reduces the variance of the imputed
variables.
• Mean imputation does not preserve relationships between
variables such as correlations.

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS

1 DATA DATA-
2 SETS
3 DATA RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7

INTRODUCTION TO DATA SCIENCE


OUTLIERS

• An outlier is a data point that is significantly far away from most other data points. For
example, if everyone in your classroom is of average height with the exception of two
basketball players that are significantly taller than the rest of the class, these two data points
would be considered outliers.
• Data objects with behaviors that are very different from expectation are called outliers or
anomalies.
• Outliers can significantly skew the distribution of your data.
• Outliers can be identified using summary statistics and plots of the data.
• Algorithms like Linear Regression, K-Nearest Neighbor, Adaboost are sensitive to
noise.

INTRODUCTION TO DATA SCIENCE


Normal and Skewed Distributions

INTRODUCTION TO DATA SCIENCE


OUTLIER DETECTION USING NORMAL DISTRIBUTION

• 99% of the observations of a


variable following a normal
distribution lie within µ ± 3σ
.

INTRODUCTION TO DATA SCIENCE


Epicycles of Data Analysis
Step 1: Set expectations
First and foremost set an expectation. This is the first duty for your analysis.
Step 2: Test expectations:
Then collecting information or data, comparing the data according to your
expectations, and if the expectations match the it fine else if it don’t match
then follow the 3rd step.
Step 3: Revise your expectations
Or fix the data so your data and your expectations could match.

Iterating through this 3-step process is what we call the “epicycle of data
analysis.” As you go through every stage
of an analysis, you will need to go through the epicycle to continuously refine
your question, your exploratory data
analysis, your formal models, your interpretation, and your communication.

https://makemeanalyst.com/5-core-activities-data-analysis-epicycles-data-analysis/
INTRODUCTION TO DATA SCIENCE
Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T3) The
Art of Data Science by Roger D Peng and Elizabeth Matsui (R1) Introducing
Data Science by Cielen, Meysman and Ali
https://www.deltapartnersgroup.com/
managing-data-quality-optimize-value-extraction
http://www.dataintegration.ninja/
relationship-between-data-quality-and-master-data-manag
ement/

THANK YOU

INTRODUCTION TO DATA SCIENCE


I NTRODUCTION TO DATA S CIENCE
M ODULE # 3 : DATA S CIENCE P ROCESS
IDS Course Team
BITS Pilani
T ABLE OF
C ONTENTS
1. Confusion Matrix
2. ROC
429 /

27

In tro d u c ti o n TO DatA SC I E N C E
Confusion Matrix
• A Confusion matrix is a table that is often used to evaluate the performance of a
classification model (or “classifier”).

• A Confusion Matrix shows what the machine learning algorithm did right and what
the algorithm did wrong (misclassification).
430 /

27

1
• It works on a set of test data for which the true values are known. There are two
possible predicted classes: “YES” and “NO”.

In tro d u c ti o n TO DatA SC I E N C E
Confusion Matrix
Actual Values

Y N

Y True Positive False Positive (Type


I Error)
431 /

27

Predicted
1 Values
N False Negative True Negative
There are four quadrants in the confusion matrix,
(Type which are symbolized as below.
II Error)
True Positive (TP) : The number of instances that were positive and correctly classified as positive.
False Positive (FP): The number of instances that were negative and incorrectly classified as positive.
This also known as Type 1 Error.
False Negative (FN): The number of instances that were positive and incorrectly classified as negative.
It is also known as Type 2 Error.
True Negative (TN): The number of instances that were negative and correctly classified as negative.

In tro d u c ti o n TO DatA SC I E N C E
Confusion Matrix
Which type of misclassification is more serious?? Type-I Error or Type-II Error?

Case I : Predicting whether a convict should be hanged or not? [Type I Error more Serious]
False Positive – Algorithm predicts that the convict has committed the crime, in reality, he is innocent.
Verdict: He will be hanged.
False Negative – Algorithm predicts that the convict is innocent, in reality, he has done the crime.
432 /

Verdict: He is released. 27

1
Case II : Predicting Smog in a region and alerting the public [Type II Error more Serious]
False Positive – Algorithm predicts smog, in reality, there is NO SMOG.
Verdict: People will take precaution unnecessarily.
False Negative – Algorithm predicts NO SMOG, in reality, there is SMOG.
Verdict: The high Smog may cause health issues in the people, since they have not taken precaution.

In tro d u c ti o n TO DatA SC I E N C E
Confusion Matrix
Let us consider an example of model predicting a Tumour for a patient.
Actual
Interpretation:
Values
True Positive (TP): Model predicted ‘Tumour’ and the patient has tumour.
Y N
False Positive (FP): Model predicted ‘Tumour’, the patient has ‘No Tumour’.
This also known as Type 1 Error. 433 /
Y 10 22
27

False Negative (FN): Model predicted ‘No Tumour’ but the patient actually has
tumour. It is also known as Type 2 Error.
True Negative (TN): Model predicted ‘No Tumour’ and the patient has no tumour. Predicted
Values

Discuss on the repercussions of Type 1 and Type 2 errors w.r.t the patient and N 8 60
the hospital.

In tro d u c ti o n TO DatA SC I E N C E
Confusion Matrix
False Negative Rate (FNR): It is defined as the
True Positive Rate (TPR): It is defined as the
fraction of positive examples classified as a
fraction of the positive examples predicted negative class by the classifier.
correctly by the classifier. This metrics is also
known as Recall, Sensitivity or Hit rate.
434 /

27

False Positive Rate (FPR): It is defined as the fraction of True Negative Rate (TNR): It is defined as the
negative examples classified as positive class by the fraction of negative examples classified correctly
classifier. This metric is also known as False Alarm Rate. by the classifier. This metric is also known as
Specificity.

In tro d u c ti o n TO DatA SC I E N C E
Confusion Matrix
Positive Predictive Value (PPV): It is defined Accuracy: How often is the classifier
as the fraction of the positive examples
correct.
classified as positive that are really positive. It
is also known as Precision.

435 /

27

F1 Score (F1): Recall (r) and Precision (p) are two widely used True Miscalculation Rate or Error Rate:
metrics employed in analysis, where detection of one of the classes How often is the classifier wrong.
is considered more significant than the others.

It is defined in terms of (TPR) and (PPV) as follows.

In tro d u c ti o n TO DatA SC I E N C E
All Formulae
FP TN
FPR= TNR=
FP +TN TN + FP

436 /

TP 27
2 𝑇𝑃
Precision= 𝐹1=
TP+ FP 2 𝑇𝑃 +𝐹𝑃+ 𝐹𝑁

TP+TN FN F P+ FN
Accuracy= FNR= Error Rate=
Total TP+ FN Total

In tro d u c ti o n TO DatA SC I E N C E
C a s e S t u d y – C H F P r e d i c ti o n
Calculate the following metrics for the given confusion
matrix:
Actual Values
1. True Positive Rate (TPR) [Recall / Sensitivity]

437 /
Y N 2. False Positive Rate (FPR)
27
3. False Negative Rate (FNR)
Y 100 (TP) 10 (FP)
4. True Negative Rate (TNR) [Specificity]
5. Precision
6. F1 Score
Predicted 7. Accuracy
Values 8. Error Rate or Miscalculation Rate
N 5 (FN) 50 (TN)

In tro d u c ti o n TO DatA SC I E N C E
C a s e S t u d y – C H F P r e d i c ti o n

1. True Positive Rate (TPR) [Recall / Sensitivity] = 0.95


Actual Values 2. False Positive Rate (FPR) = 0.17

Y N 3. False Negative Rate (FNR) = 0.047


438 /

27 4. True Negative Rate (TNR) [Specificity] = 0.83


Y 100 (TP) 10 (FP)
5. Precision = 0.91
6. F1 Score = 0.93
7. Accuracy = 150/165 = 0.91
8. Error Rate or Miscalculation Rate = 15/165 = 0.09

Predicte
d
Values
N 5 (FN) 50 (TN)
In tro d u c ti o n TO DatA SC I E N C E
ROC Curve
An ROC curve (receiver operating characteristic curve) is a
graph showing the performance of a classification model at
all classification thresholds.
It shows the trade-off between Sensitivity and Specificity
ROC curve plots two parameters:
439 /

27

• True Positive Rate

• False Positive Rate

Scikit-learn’s confusion matrix uses 0.5 as threshold.

https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
https://towardsdatascience.com/understanding-the-roc-curve-in-three-visual-steps-795b1399481c

In troducti on TO DatA SCIENCE


Data points for Likelihood of Repaying a Loan
• The probabilities usually range between 0 and 1.
• The higher the value, the more likely the person is to repay a loan.
• In the example on Fig, we’ve selected a threshold at 0.35:
• All predictions at or above this threshold, are classified
as “will repay”
• All predictions below this threshold, are classified
as “won’t repay”
440 /

27

In troducti on TO DatA SCIENCE


Altering the Threshold values

441 /

27

• Altering the threshold to 0, 0.35, 0.5, 0.65 and 1 levels. Notice how the FPR and TPR changes accordingly
• Overall, we can see this is a trade-off. As we increase our threshold, we’ll be better at classifying negatives, but
this is at the expense of mis-classifying more positives

In troducti on TO DatA SCIENCE


ROC Curves: Plot TPR and FPR for every Cutoff

442 /

27

Area under
ROC Curve
(AUC)

In troducti on TO DatA SCIENCE


Ideal Classifier - Example

443 /

27

The higher the AUC, the better is the classifier model

IN TRODUCTION TO DatA S
https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learni
ng/ 444 /

27

https://towardsdatascience.com/understanding-the-roc-curve-in-three-visual-s
teps-795b1399481c

T HANK YOU

In tro d u c ti o n TO DatA SC I E N C E

You might also like