1 - Introduction To Datascience
1 - Introduction To Datascience
• The content for these slides has been obtained from books and various
other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the
course
R EFERENCE B OOKS
R1 The Art of Data Science by Roger D Peng and Elizabeth Matsui
R2 Ethics and Data Science by DJ Patil, Hilary Mason, Mike Loukides
R3 Python Data Science Handbook: Essential tools for working with data by Jake
VanderPlas
R4 KDD, SEMMA and CRISP-DM: A Parallel Overview , Ana Azevedo and M.F.
Santos, IADS-DM, 2008
In tro d u c ti o n TO DatA SC I E N C E BITS Pilani, Pilani Campus
C A N VA S
Platform
• Python / Jupyter Notebook / Google Colab
Dataset
• Datasets as we deem appropriate.
Webinar
• As per schedule
Source: https://analyticsindiamag.com/why‐a ‐career‐in‐data ‐science ‐should ‐be ‐your ‐prime ‐focus ‐in ‐
2020/
Source: https://analyticsindiamag.com/why ‐a ‐career ‐in ‐data ‐science ‐should ‐be ‐your ‐prime ‐focus ‐in ‐
2020/
• Market Revenue
In recent years, with the amount of shift in analytics and data
science, the market revenue has increased
Analytics, data science, and big data industry in India generated
about
$2.71 billion annually in 2018, and the revenue grew to
$3.03 billion in 2019
This 2019 figure, is expected to double by 2025 nearly.
Source: Data Science using Python and R – Chantal Larose & Daniel Larose
Source: Simplilearn ‐
https://www.youtube.com/watch?v=
X3paOmcrTjQ
BITS Pilani, Pilani Campus
Phases of Data Science Process
Determine
whether your
The coding of models are any
the data science good
Establish Deploy the
process
baseline model model for real‐
Select the best world problems
performance Apply various performing
Clearly algorithms to model from a set
describe Partition the uncover hidden
project Gain insights data, Balance relationships
objectives into the data the data, if Deployment
Create a through needed
Data cleaning graphical Evaluation
problem
/preparation is exploration
statement that Modeling
more labor‐
can be solved
intensive
using data Setup
science
Explorator
y Data
Data Analysis
Preparation
Problem
Description
Source: Data Science Using Python & R by Chantal Larose and Daniel Larose
• Text Analysis
Uses a home grown tool called "DeepText" to analyze words
from the user posts and extracts meaning from them
E.g., extract people’s interest and align photographs with texts
• Targeted Advertising
Uses deep learning for targeted advertising
Forms clusters of users based on their preferences
and runs advertisements that appeal to them
• Personalized recommendation
Heavily relies on predictive analytics (a personalized
recommender system) to increase customer satisfaction
Analyzes the purchase history of customers, other customer
suggestions, and user ratings to recommend products
Source: https://data‐flair.training/blogs/data‐science‐
use‐cases/
• Customer segmentation
Using various data‐mining techniques, banks
are able to segment their customers in the
high‐value and low‐value segments
Data scientists makes use of clustering,
logistic regression, decision trees, etc., to
help banks to understand the Customer
Lifetime Value (CLV) and group them in
the appropriate segments
Source: https://data‐flair.training/blogs/data‐science‐applications/
• Identify the data analytics problem that can give more business to the
organization.
• Discover solutions after working on the data.
• Understanding of all the critical datasets and variables.
• Collect the structured and unstructured data from reliable sources.
• Work on unstructured data, such as images and videos.
• Analyze data and find out the hidden patterns and insights.
• Clean data by removing the missing values and outliers to get accuracy.
• Apply different models and algorithms to find out the business solutions.
• Communicate the insights to clients with the help of visualization tools.
• The content for these slides has been obtained from books and various
other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the
course
Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay
Madisetti
BITS Pilani, Pilani Campus
Characteristics of Big Data
There is no fixed threshold for the volume of data to be considered as big data
However, the term big data is used for massive scale data that is difficult to store, manage,
and process using traditional databases and data processing architectures
Specialized tools and frameworks are required to store, process and analyze such data. For
example:
Hadoop Ecosystem, NoSQL Databases, Programming Languages (Python, R), Etc.,.
Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay
BITS Pilani, Pilani Campus
Characteristics of Big Velocity
Data (5Vs)
• Velocity is the primary reason for the exponential growth
of data in short span of time
• Velocity of data refers to how fast the data is generated Volume
• Data generated by certain sources can arrive at very
Velocity
Big
high velocities. For Dat
example: Valu a Variet
e y
social media data or sensor data
Veracit
y
Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay
BITS Pilani, Pilani Campus
Characteristics of Big Variety & Veracity
Data (5Vs)
• Variety
Variety refers to different forms of the data Volume
Big data comes in different forms such as structured, unstructured or
semi‐structured Velocity
E.g., text data, image, audio, video and sensor data Big
Big data systems need to be flexible enough to handle such variety of Dat
data Valu a Variet
e y
• Veracity Veracit
y
Veracity refers to how accurate is the data
To extract value from the data, the data needs to be cleaned to remove
noise
Data‐driven applications can reap the benefits of big data only when the
data is meaningful and accurate
Therefore, cleansing of data is important so that incorrect and faulty data
can be filtered out
Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay BITS Pilani, Pilani Campus
Characteristics of Big Value
Data (5Vs)
• Value of data refers to the usefulness of data for
the intended purpose Volume
Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay
Madisetti BITS Pilani, Pilani Campus
Analytics
Goals
The Driver
• What drives the choice of technologies, algorithms, and frameworks used for
analytics?
Goals of the analytic task at hand
To predict something
To find patterns in the data
To find relationships in the data
Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay
Madisetti
Source: Big Data Analytics – A Hands‐on Approach by Arshdeep Bahga & Vijay
Madisetti
• Descriptive analysis
• Diagonistics analysis
• Predictive analysis
• Prescriptive analysis
Source: https://blogs.gartner.com/jason‐mcnellis/2019/11/05/youre‐likely‐investing‐lot‐marketing‐analytics‐getting‐right‐
insights/
Techniques / Algorithms:
) Regression
) Classification
) ML algorithms like Linear regression, Logistic regression, SVM
) Deep Learning techniques
https://www.10xds.com/blog/cognitive-analytics-to-reinvent-business/
Introduction to Data Science88 / 47 BITS Pilani, Pilani Campus
Cognitive Analytics
https://interestingengineering.com/cognitive-computing-more-human-than-artificial-intelligence
Introduction to Data Science89 / 47 BITS Pilani, Pilani Campus
Cognitive Analytics
Involves Semantics, AI, Machine learning, Deep Learning, Natural Language
Processing, and Neural Networks.
Simulates human thought process to learn from the data and extract the hidden
patterns from data.
Uses all types of data: audio, video, text, images in the analytics process.
Although this is the top tier of analytics maturity, Cognitive Analytics can be used in
the prior levels.
According to Jean Francois Puget:
”It extends the analytics journey to areas that were unreachable with more
classical analytics techniques like business intelligence, statistics, and
operations research.”
https://www.ecapitaladvisors.com/blog/analytics-maturity/ https://www.x
enonstack.com/insights/what-is-cognitive-analytics/
Introduction to Data Science30 / 47 BITS Pilani, Pilani Campus
Descriptive Analytics – Example #1
Data captured
Problem Statement : Gender
“Market research team at Aqua Analytics Age (In years)
Pvt. Ltd is assigned a task to identify pro- Education (In years)
file of a typical customer for a Digital fit- Relationship Status (Single or Partnered)
https://medium.com/@ashishpahwa7/first-case-study-in-descriptive-analytics-a744140c39a4
Introduction to Data Science91 / 47 BITS Pilani, Pilani Campus
Descriptive Analytics – Example #1
Problem Statement :
“During the 1980s General Electric was selling different products to its customers such as
light bulbs, jet engines, windmills, and other related products. Also, they separately sell
parts and services this means they would sell you a certain product you would use it until it
needs repair either because of normal wear and tear or because it’s broken. And you
would come back to GE and then GE would sell you parts and services to fix it. Model for
GE was focusing on how much GE was selling, in sales of operational equipment, and in
sales of parts and services. And what does GE need to do to drive up those sales?”
https://medium.com/parrotai/
understand-data-analytics-framework-with-a-case-study-in-the-business-world-15bfb421028d
Introduction to Data Science93 / 47 BITS Pilani, Pilani Campus
Diagnostic Analytics – Example #1
https://www.sganalytics.com/blog/change-management-analytics-adoption/
Introduction to Data Science94 / 47 BITS Pilani, Pilani Campus
Predictive Analytics – Example #1
• Google launched Google Flu Trends (GFT), to collect predictive analytics
regarding the outbreaks of flu. It’s a great example of seeing big data
analytics in action.
• So, did Google manage to predict influenza activity in real-time by
aggregating search engine queries with this big data and adopting
predictive analytics?
• Even with a wealth of big data analytics on search queries, GFT
overestimated the prevalence of flu by over 50% in 2012-2013 and
2011-2012.
• They matched the search engine terms conducted by people in different
regions of the world. And, when these queries were compared with traditional
flu surveillance systems, Google found that the predictive analytics of the flu
season pointed towards a correlation with higher search engine traffic for
BITS Pilani, Pilani Campus
Predictive Analytics – Example #1
https://www.slideshare.net/VasileiosLampos/
usergenerated-content-collective-and-personalised-inference-tasks
Introduction to Data Science96 / 47 BITS Pilani, Pilani Campus
Predictive Analytics – Example #2
Colleen Jones applied predictive analytics to FootSmart (a niche online catalog
retailer) on a content marketing product. It was called the FootSmart Health
Resource Center (FHRC) and it consisted of articles, diagrams, quizzes and the like.
On analyzing the data around increased search engine visibility, FHRC was found
to help FootSmart reach more of the right kind of target customers.
They were receiving more traffic, primarily consisting of people that cared about foot
health conditions and their treatments.
FootSmart decided to push more content at FHRC and also improve its
merchandising of the product.
The result of such informed data-driven decision making? A 36
increase in weekly sales.
https://www.footsmart.com/pages/health-resource-center
Introduction to Data Science97 / 47 BITS Pilani, Pilani Campus
Predictive Analytics – Example #2
With this information, the provider can now use predictive analytics to get an idea of how
many more ophthalmology claims it might receive during the next year.
Then, using prescriptive analytics, the company can look at scenarios where the
reimbursement costs for ophthalmology increases, decreases, or holds steady. These
scenarios then allow them to make an informed decision about how to proceed in a way that’s
both cost-effective and beneficial to their customers.
Source: https://www.micromd.com/blogmd/5‐types‐medical‐practices/
Image Source: https://technologyadvice.com/blog/healthcare/best‐medical‐practice‐management‐software/
Source: https://www.micromd.com/blogmd/5‐types‐medical‐
practices/
Source: https://www.micromd.com/blogmd/5‐types‐medical‐
practices/
Source: https://www.micromd.com/blogmd/5‐types‐medical‐
practices/
Source: https://integratedmp.com/healthcare‐practice‐analytics‐101‐numbers‐charts‐dashboards‐oh‐my/
Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/
Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/
Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/
Source: https://integratedmp.com/4‐key‐healthcare‐analytics‐sources‐is‐your‐practice‐using‐them/
Self study
https://integratedmp.com/
4-key-healthcare-analytics-sources-is-your-practice-using-them/
https://www.youtube.com/watch?v=olpuyn6kemg
In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 Bi g DatA LI F E-C Y C L E
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
D AT A A N A LY T I C S
M E T H OD OL OG I E S
• Methodology is a set of guiding principles and processes used to plan,
manage, and execute projects.
• It helps data analysts to reduce risks, avoid duplication of efforts and to
ultimately increase the impact of the project.
In tro d u c ti o n TO DatA SC I E N C E
NEED FOR A
Methodology
• Framework for recording experience.
• Allows projects to be replicated
• Aid to project planning and management.
• “Comfort factor” for new adopters
• Demonstrates maturity of Data Mining
• Reduces dependency on “stars”
• Encourage best practices and help to obtain better
results.
In tro d u c ti o n TO DatA SC I E N C E
D AT A A n a l y t i c s
METHODOLOGY
10 Questions the process aims to answer
Problem to Approach
1 What is the problem that you are trying to solve?
2 Are there available solutions to similar problems?
Working with Data
3 What data do you need to answer the question?
4 Where is the data coming from? Identify all Sources. How will you acquire it?
5 Is the data that you collected representative of the problem to be solved?
6 What additional work is required to manipulate and work with the data?
Delivering the Answer
7 In what way can the data be visualized to get to the answer that is required?
8 Does the model used really answer the initial question or does it need to be adjusted?
9 Can you put the model into practice?
10 Can you get constructive feedback into answering the question?
In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 Bi g DatA LI F E-C Y C L E
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
CRISP-
DM
CRISP-DM Phases
Cross Industry Standard Process for Data
Mining
People realized they needed a process to
define data mining steps applicable across any
Industry such as Retail, E-Commerce,
Healthcare etc.
Conceived by Daimler-Benz and Integral
Solutions Ltd in the year 1996
6 high-level phases
In tro d u c ti o n TO DatA SC I E N C E
CRISP-DM
P HASES
4. Modeling – What modeling techniques should we apply?
• Run the data mining tools.
5. Evaluation – Which model best meets the business objectives?
• Determine if results meet business objectives.
• Identify business issues that should have been addressed earlier.
6. Deployment – How do stakeholders access the results?
• Put the resulting models into practice.
• Set up for continuous mining of the data.
In tro d u c ti o n TO DatA SC I E N C E
C R I S P - D M P HASES AND
T ASKS
In tro d u c ti o n TO DatA SC I E N C E
W HY CRISP-
DM?
• CRISP-DM provides a uniform framework
• This methodology is cost-effective as it includes a number of processes to take
out simple data mining tasks and the processes are well established across
industry.
• CRISP-DM encourages best practices and allows projects to replicate.
• This methodology provides a uniform framework for planning and managing a
project.
• Being cross-industry standard, CRISP-DM can be implemented in any Data
Science project irrespective of its domain.
In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 Bi g DatA LI F E-C Y C L E
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
SEMMA
In tro d u c ti o n TO DatA SC I E N C E
SEM
MA
• SEMMA is not a data mining methodology but rather a logical organization of the
functional tool set of SAS Enterprise Miner for carrying out the core tasks of data
mining.
• Enterprise Miner is a Data Mining Software to create predictive and descriptive
models for large volumes of data.
• Enterprise Miner can be used as part of any iterative data mining methodology
adopted by the client. Naturally steps such as formulating a well defined business or
research problem and assembling quality representative data sources are critical to
the overall success of any data mining project.
• SEMMA is focused on the model development aspects of data mining.
• SEMMA overlaps with Data Preparation, Modelling and Evaluation phases of CRISP-
DM
In tro d u c ti o n TO DatA SC I E N C E
SEMMA
STAGES
1. Sample
•1 Sampling the data by extracting a portion of a large data set big enough to contain the
significant information, yet small enough to manipulate quickly.
• Partitioning the data to create training and test samples.
• Identifying dependent and independent variables influencing the process.
2. Explore
• Exploration of the data by searching for unanticipated trends and anomalies in order to
gain understanding and ideas.
• Perform Univariate analysis (single variable) and multivariate analysis (relationships)
3. Modify
• Modification of the data by creating, selecting, and transforming the variables to focus
the model selection process.
In tro d u c ti o n TO DatA SC I E N C E
SEMMA
STAGES
4. Model
• Apply variety of data mining techniques to produce a projected model [ML, Deep Learning,
Transfer Learning]
5. Assess
• Assessing the data by evaluating the usefulness and reliability of the findings from the
data mining process and estimate how well it performs.
In tro d u c ti o n TO DatA SC I E N C E
Advantages and Disadvantages
Advantages:
• Focus on only “Model aspects of Data Mining”
• Useful in most Machine Learning Projects where data comes from single datasource
Ex: Prima Indian Diabetes Dataset [Predict Diabetes], Titanic Dataset [Predict
Passenger Survival] from Kaggle
Disadvantages:
• Does not take into account the business understanding of a problem
• Disregards Data Collection and Processing from different data sources
https://www.diva-portal.org/smash/get/diva2:1250897/FULLTEXT01.pdf
In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 Bi g DatA LI F E-C Y C L E
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
SMAM
SMAM
(Standard Methodology for
Analytics Models)
http://www.datascienceassn.org/content/standard-methodology-analytical-models
In tro d u c ti o n TO DatA SC I E N C E
SMAM
P HASES
Phase Description
Use-case identification Selection of the ideal approach from a list of candidates
Model requirements Understanding the conditions required for the model to func-
gathering tion
Data preparation Getting the data ready for the modeling
Modeling experiments Scientific experimentation to solve the business question
Insight creation Visualization and dash-boarding to provide insight
Proof of Value: ROI Running the model in a small scale setting to prove the value
Operationalization Embedding the analytical model in operational systems
Model life-cycle Governance around model lifetime and refresh
In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase I - Use Case Identification
• Brainstorming of Business / Management / SMEs (Domain) / IT (Data Scientist)
teams
• Discussion revolves around:
• Business Needs
• Expert inputs on the domain
• Data Availability
• Analytical Model Complexity – time and effort
• Outcome: Selected Use Case and roadmap for next phases
In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase II – Model Requirements Gathering
• Involved parties include Business / End-users / Data Scientists / IT
• Preparation of Model Requirement Document
• Business requirements
• IT requirements
• End user requirements
• Scoring requirements
• Data requirements
• Analytical model requirements
In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase III – Data Preparation
• Involved parties include IT / Data Administrators / DBA / Data Modelers and Data
Scientists
• Discussion on:
• Data Access
• Data Location
• Data Understanding
• Data Validation
• Data format [prepared by DBAs and consumed by Data Scientist]
• The process is agile; the data scientist tries out various approaches on smaller sets
and then may ask IT/ DBAs to perform the required transformation in large.
In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase IV – Modeling Experiments
• Data Scientist:
• Creates testable hypothesis
• Model features
• Creates analytical model
• Evaluates the analytical model
In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase V – Insight Creation
• Data Scientist:
• Analytical reporting [Inference] and Operational reporting [Prediction]
• Visualization and Dashboards
• Provide business usable insights
In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase VI – Proof of Value: ROI
• Quality of the analytical model is observed [Ex: Accuracy of the model is >90%]
• Analytical model is applied to new data and outcomes are measured to verify if
financially viable [for small POC].
• If ROI is positive for POC:
• Set up full-scale experiment with control groups
• Measure the model effectiveness
• Compute ROI and success criteria
• Involve Finance department / IT / End-users and Data Scientists in this phase
In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase VII – Operationalization
• Data Scientist works with IT department to create repeatable experimentation of
the model; hand-over process of the model
• IT prepare the Operational environment
• Audit structure
• Integration with existing / legacy applications
• Possible software development as Mobile / Web App for end-user usage
In tro d u c ti o n TO DatA SC I E N C E
SMAM Phases
Phase VIII – Model Lifecycle
• Involves maintenance of the analytical model in-view of changing customer needs
• Two types of model changes:
a. Model Refresh – Model is trained with more recent data, leaving the model
structurally untouched
b. Model Upgrade – Initiated by availability of new data sources and a
business request to improve model performance.
• Involved are operational team, IT team, Data Scientists, DBAs, end-users
In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 Bi g DatA LI F E-C Y C L E
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
B I G D ATA A N A LY T I C S L I F E
C YC L E
• Big data defers from traditional data primarily due to
volume, velocity, variety, veracity and value
• A step-by-step methodology is required to acquire,
process, analyze, visualize the big data
In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage I : Business Case Evaluation
• Create a well-defined business case and get approval
• Identify KPIs that define the assessment criteria, to make business goals SMART
(specific, measurable, attainable, relevant, timely)
• Business case must qualify as a ‘big data’ problem – volume, velocity, variety,
veracity, value
• Outcome: Budget requirements, identify software (tools), hardware, training
requirements
In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage II : Data Identification
• Identify the datasets required for the project and their sources
• Guideline: Identify as many sources as possible, which help gain insights
• Sources can be internal / external to the enterprise
• Internal – Data marts, Data warehouses or operational systems
• External – Data within Blogs, websites etc.
In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage III : Data Acquisition and Filtering
• Data is gathered from all sources identified in the previous phase
• Data filtering is performed to remove corrupted / noise data
• Corrupt – records with missing / nonsensical values / invalid data
types
• Create metadata, helps in data provenance, accuracy and quality
• Dataset size & structure
• Source information
• Date and time of creation
• Language specific information
In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage IV : Data Extraction
• Extract disparate data and transform it into a format that the underlying Big Data
In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage V : Data Validation and Cleansing
• Big data may receive redundant data across sources
• Redundancy can be used to interconnect dataset and fill missing values
• The first value in Dataset B is validated against its corresponding value in Dataset A.
• The second value in Dataset B is not validated against its corresponding value in Dataset A.
• If a value is missing, it is inserted from Dataset A.
In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage VI : Data Aggregation and Representation
• Integrating multiple datasets together to arrive at unified view
• Involves joining datasets based on common fields such as ID or Date
• Semantics standardization (Ex: Surname and Last name – Same value
labeled differently in different datasets)
• Represent using standard data format (row-oriented database)
In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage VII : Data Analysis
• Perform EDA (Exploratory Data Analysis)
• Apply Analytics: Descriptive, Diagnostic, Predictive or Prescriptive
In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage VIII : Data Visualization
• Use tools to graphically visualize and communicate the insights to business users
• Present Dashboards
• Excel, Tableau, Power BI etc.
In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
Stage IX : Utilization of Analysis Results
• Determining how and where the processed analysis data can be leveraged
• Results can be:
• Fed as input to enterprise systems (Customer analysis result fed into
OTT platform to assist recommendation)
• Refine the business process (Ex: Consolidate transportation routes as
part of supply chain process)
• Generate alerts (Send notification to users via Email or SMS about
impending events)
In tro d u c ti o n TO DatA SC I E N C E
B I G D ATA A N A LY T I C S L I F E
C YC L E
CASE STUDY: Background
• Company X is an Insurance Company that deals with health and home insurance
• The company has a ‘Claim Management System’ which contains the claim data,
incident photographs and claim notes
• The company wants to invest in Big Data Analytics to “detect fraudulent claims in the
building sector”
• Let us see how the company uses the ‘Big Data Analytics’ Lifecycle to achieve the
objective of ‘detecting fraudulent claims in the building sector’
* Building Insurance is a type of Home insurance that covers the structure of the house from any kinds of danger or risks
In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims
In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims
In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims
In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims
In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims
In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims
In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims
In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims
In tro d u c ti o n TO DatA SC I E N C E
Case Study: Detect Fraudulent Claims
• The machine learning model was incorporated into the existing claim
processing system to flag fraudulent claims.
In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A L I F E -
CYCLE - II
Phase 7: Storage
Phase 1: Foundations
Phase 8: Integration
Phase 2: Acquisition
Phase 9: Analytics and Visualization
Phase 3: Preparation
Phase 10: Consumption
Phase 4: Input and Access
Phase 11: Retention, Backup, and
Phase 5: Processing
Archival
Phase 6: Output and
Interpretation Phase 12: Destruction
In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A
L I F E -C Y C L E
Phase 1: Foundations
• Understanding and validating data requirements, solution scope, roles and
responsibilities, data infrastructure preparation, technical and non-technical
considerations, and understanding data rules in an organization.
Phase 2: Data Acquisition
• Data Acquisition refers to collecting data.
• Data sets can be obtained from various sources, both internal and external to the
business organizations.
• Data sources can be in
• structured forms such as transferred from a data warehouse, a data mart, various
transaction systems.
• semi-structured sources such as Weblogs, system logs.
• unstructured sources such as media files consisting of videos, audios, and
pictures.
In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A
L I F E -C Y C L E
Phase 3: Data Preparation
• Collected data (Raw Data) is rigorously checked for inconsistencies, errors, and
duplicates.
• Redundant, duplicated, incomplete, and incorrect data are removed.
• The objective is to have clean and useable data sets.
Phase 4: Data Input and Access
• Data input refers to sending data to planned target data repositories, systems, or
applications.
• Data can be stored in CRM (Customer Relationship Management) application, a data
lake or a data warehouse.
• Data access refers to accessing data using various methods.
• NoSQL is widely used to access big data.
In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A
L I F E -C Y C L E
• Phase 5: Data Processing
• Processing the raw form of data.
• Convert data into a readable format giving it the form and the context.
• Interpret the data using the selected data analytics tools such as Hadoop MapReduce, Hive,
Pig, and Spark SQL.
• Data processing also includes activities
• Data annotation – refers to labeling the data.
• Data integration – aims to combine data existing in different sources, and provide a unified view
of data to the data consumers.
• Data representation – refers to the way data is processed, transmitted, and stored.
• Data aggregation – aims to compile data from databases to combined data-sets to be used for
data processing.
In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A
L I F E -C Y C L E
Phase 6: Data Output and Interpretation
• In the data output phase, the data is in a format which is ready for consumption by the
business users.
• Transform data into usable formats such as plain text, graphs, processed images, or
video files.
• This phase is also called the data ingestion.
• Common Big Data ingestion tools are Sqoop, Flume, and Spark streaming.
• Interpreting the ingested data requires analyzing ingested data and extract information
or meaning out of it to answer the questions related to the Big Data business
solutions.
In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A
L I F E -C Y C L E
Phase 7: Data Storage
• Store data in designed and designated storage units.
• Storage infrastructure can consist of storage area networks (SAN), network-attached
storage (NAS), or direct access storage (DAS) formats.
Phase 8: Data Integration
• Integration of stored data to different systems for various purposes.
• Integration of data lakes with a data warehouse or data marts.
Phase 9: Data Analytics and Visualization
• Integrated data can be useful and productive for data analytics and visualization.
• Business value is gained in this phase.
In tro d u c ti o n TO DatA SC I E N C E
B IG D AT A
L I F E -C Y C L E
Phase 10: Data Consumption
• Data is turned into information ready for consumption by the internal or external users,
including customers of the business organization.
• Data consumption require architectural input for policies, rules, regulations, principles,
and guidelines.
Phase 11: Retention, Backup, and Archival
• Use established data backup strategies, techniques, methods, and tools.
• Identify, document, and obtain approval for the retention, backup, and archival
decisions.
Phase 12: Data Destruction
• There may be regulatory requirements to destruct a particular type of data after a certain
amount of times.
• Confirm the destruction requirements with the data governance team in business
organizations.
In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 DATA An alY T I C S
2 Data An alY T I C S Met hodoloG I E S
3 CRISP-DM
4 BI G DatA LI F E-C Y C L E
5 SEMMA
6 SMAM
7 CH A L L E N G E S I N DatA
DR I V E N DE C I S I O N
In tro d-
u cMA
ti o n TOK I NSC IG
DatA ENCE
C H A L L E N G E S IN D AT A D R I V E N
D E C I S I O N -MA K I N G
• 1. Discrimination
• Algorithmic discrimination can come from various sources.
• Data used to train algorithms may have biases that lead to discriminatory decisions.
• Discrimination may arise from the use of a particular algorithm.
• .
In tro d u c ti o n TO DatA SC I E N C E
C H A L L E N G E S IN D AT A D R I V E N
D E C I S I O N -MA K I N G
1. Racism embedded in US healthcare
In October 2019, researchers found that an algorithm used on more than 200
million people in US hospitals to predict which patients would likely need extra
medical care heavily favoured white patients over black patients. While race
itself wasn’t a variable used in this algorithm, another variable highly correlated
to race was, which was healthcare cost history. The rationale was that cost
summarizes how many healthcare needs a particular person has. For various
reasons, black patients incurred lower healthcare costs than white patients with
the same conditions on average.
In tro d u c ti o n TO DatA SC I E N C E
C H A L L E N G E S IN D AT A D R I V E N
D E C I S I O N -MA K I N G
2. Amazon’s hiring algorithm
Amazon’s one of the largest tech giants in the world. And so, it’s no surprise
that they’re heavy users of machine learning and artificial intelligence. In 2015,
Amazon realized that their algorithm used for hiring employees was found to be
biased against women. The reason for that was because the algorithm was
based on the number of resumes submitted over the past ten years, and since
most of the applicants were men, it was trained to favor men over women.
In tro d u c ti o n TO DatA SC I E N C E
C H A L L E N G E S IN D AT A D R I V E N
D E C I S I O N -MA K I N G
2. Lack of transparency
• Transparency refers to the capacity to understand a computational model and therefore
contribute to the attribution of responsibility for consequences derived from its use.
• A model is transparent if a person can easily observe it and understand it.
• Three types of opacity (i.e. lack of transparency) in algorithmic decisions
• Intentional opacity
• Knowledge opacity
• Intrinsic opacity
In tro d u c ti o n TO DatA SC I E N C E
C H A L L E N G E S IN D AT A D R I V E N
D E3.CViolation
ISIO N -MA
of privacy
K I N G
• Misuse of users’ personal data and on data aggregation by entities such as data
brokers, may have direct implications for people’s privacy. [Google faced Lawsuit
for Privacy Violation in 2020]
4. Digital literacy
• Devote resources to digital and computer literacy programs from children to the
elderly.
• This enables the society to make decisions about technologies that we do not
understand. [Cases of Cyberbullying among Juvenile population]
5. Fuzzy responsibility
• As more and more decisions that affect millions of people are made automatically by
algorithms, we must be clear about who is responsible for the consequences of these
decisions. Transparency is often considered a fundamental factor in the clarity of
In tro d u c ti o n TO attribution
DatA SC I E N C E of responsibility.
C H A L L E N G E S IN D AT A D R I V E N
D E C I S I O N -MA K I N G
6. Lack of ethical frameworks
• Algorithmic data-based decision-making processes generate important ethical dilemmas
regarding what actions are appropriate in light of the inferences made by algorithms.
• It is therefore essential that decisions be made in accordance with a clearly defined and
accepted ethical framework.
• There is no single method for introducing ethical principles into algorithmic decision
processes.
7. Lack of diversity
• Data-based algorithms and artificial intelligence techniques for decision-making have
been developed by homogeneous groups of IT professionals.
• Ensure that teams are diverse in terms of areas of knowledge as well as demographic
factors
In tro d u c ti o n TO DatA SC I E N C E
R EFERE
NCES
htt ps://www.kdnuggets.com/2014/10/
c r i s p - d m - to p - m e t h o d o l o g y - a n a l y ti c s - d ata - m i n i n g - d ata - s c i e n c e - p ro j e c t ht m l
htt ps://www.datasciencecentral.com/profi les/blogs/
crisp-dm-a-standard-methodology-to-ensure-a-good-outcome
htt ps://documentati on.sas.com/?docsetId=emref&docsetTarget=n061bzurmej4j3n1jnj8bbjjm1a2.htm&
docsetVersion=14.3&locale=en
htt p://jesshampton.com/2011/02/16/semma -and-crisp-dm-data-mining-methodologies/
htt ps://www.kdnuggets.com/2015/08/new-standard-methodology-analyti cal-models.html
htt ps://medium.com/illuminati on -curated/big-data-lifecycle-management-629dfe16b78d
https://www.esadeknowledge.com/view/
7 -c h a ll en ge s - a nd - op p or tu n iti es - in -d ata - b ased - d ec i sion - ma k in g - 19356 0
T HANK YOU
In tro d u c ti o n TO DatA SC I E N C E
Big data analysis life cycle-Usecases
• What is the precision agriculture? Why it is a likely answer to climate change and food security?
• https://www.youtube.com/watch?v=581Kx8wzTMc&ab_channel=Inter‐AmericanDevelopmentBank
• • Innovating for Agribusiness
• https://www.youtube.com/watch?v=C4W0qSQ6A8U
• • How Big Data Can Solve Food Insecurity
• https://www.youtube.com/watch?v=4r_IxShUQuA&ab_channel=mSTARProject
• • AI for AgriTech ecosystem in India‐ IBM Research
• https://www.youtube.com/watch?v=hhoLSI4bW_4&ab_channel=IBMIndia
• • Bringing Artificial Intelligence to agriculture to boost crop yield
• https://www.youtube.com/watch?v=GSvT940uS_8&ab_channel=MicrosoftIndia
• • Artificial intelligence could revolutionize farming industry
• https://www.youtube.com/watch?v=cw3flTRrPts
DSECL ZG523
Introduction to Data Science
Dr.Vijayalakshmi
T ABLE OF
C ONTENTS
1 Data SC I E N C E PrO C E S S
190 /
99
2 Ca s e St udY
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N C E
P ROCESS
10 Questions the process aims to answer
• Problem to Approach
1 What is the problem that you are trying to solve?
2 How can you use data to answer the questions? CRISP-DM approach
• Working with Data
191 /
4 Where is the data coming from? Identify all Sources. How will you aquire it?
5 Is the data that you collected representative of the porblem to be solved?
6 What additional work is required to manipulate and work with the data?
• Delivering the Answer
7 In what way can the data be visualized to get to the answer that is required?
8 Does the model used really answer the initial question or does it need to be adjusted?
9 Can you put the model into practice?
10 Can you get constructive feedback into answering the question?
Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N C E
P ROCESS Business Analytic
Understanding Approach
Data
Feedback Requirements
192 /
99
Data
Deployment
Collection
Evaluation Data
Understanding
Data
Data Modeling
Preparation
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N C E
P ROCESS
From Problem to Approach
• Business Understanding
• Analytic Approach Business Analytic
Understanding Approach
From Requirements to Collection
• Data Requirements Data
193 /
• Data Collection
99 Feedback Requirements
In tro d u c ti o n TO DatA SC I E N C E
T ABLE OF
C ONTENTS
1 Data SC I E N C E PrO C E S S
194 /
99
2 Ca s e St udY
In tro d u c ti o n TO DatA SC I E N C E
H OSPIT A L
R EADMISSIONS
195 /
99
Image Source:
https://medium.com/nwamaka-imasogie/predicting-hospital-readmission-using-nlp-5f0fe6f1a705
In tro d u c ti o n TO DatA SC I E N C E
H OSPIT A L
R EADMISSIONS -
•SHospital
C E NReadmission
ARIO is a common problem in the healthcare sector, wherein a patient after
discharge gets re-admitted to the hospital because of the following reasons:
• Medication errors
• Medication noncompliance by the patient
• Fall injuries
• Lack of timely follow-up care
196 /
99
• Inadequate Nutrition
• Inadequate discussion on palliative care [relief from suffering]
• Infection
• Failure to identify post-acute care needs etc.
• Hospital readmissions may bring bad name to the hospital / treating doctor / support staff,
and lead to increased length of stay and expenditure for the hospital and the patient. Hence,
it is a critical issue that needs addressing.
In tro d u c ti o n TO DatA SC I E N C E
H OSPIT A L
R EADMISSIONS -
S CThere
E NisA RIO
a limited budget for providing healthcare to the public.
Hospital readmissions for re-occurring problems are considered as a sign of failure in the
healthcare system.
There is a dire need to properly address the patient condition prior to the initial patient
197 /
discharge. 99
American Healthcare Insurance Provider, Health care authorities in the region & IBM Data
Scientists:
• What is the best way to allocate these funds to maximize their use in providing
quality care?
Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
F R OM P R O B L E M T O
A P P R OA C H
Business Analytic
Understanding Approach
Data
198 / Feedback Requirements
99
Data
Deployment
Collection
Evaluation Data
Understanding
Data
Data Modeling
Preparation
In tro d u c ti o n TO DatA SC I E N C E
B u s i n e s s P R O B L E M T O A NA LY T I C
A P P R OA C H
Every Data Science activity is time-bound and involves cost and resources
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
1. B USIN E SS
U NDERSTANDING
(C ONCEPT)
What is the problem that you are trying to
solve?
Identify the goal. 200 /
99
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 1. B USINESS
U NDERSTANDING
Examining hospital readmissions [Insurance Company + Hospitals + Data Scientists]
• It was found that approximately 30% of individuals who finish rehab treatment would be
readmitted to a rehab center within one year.
• 50% would be readmitted within five years.
• After reviewing some records, it was found that patients with heart failure were high on
202 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 1. B USINESS
U NDERSTANDING
Data scientists proposed and organized an on-site workshop.
It was found that a decision tree model can be applied to investigate this scenario
to determine the reason for this phenomenon. [Why – Diagnostic Analytics]
The business sponsors involvement throughout the project was critical because the
203 /
99
sponsor had
• Set the overall direction
• Remained committed and advised
• When required, got the necessary support
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 1. B USINESS
U NDERSTANDING
Finally, four business requirements were identified for whatever model would be built
• Case study question
• What is the best way to allocate the limited healthcare budget to maximize its use
204 /
• Business requirements
• To predict the risk of readmission. [Predictive Analytics]
• To predict readmission outcomes for those patients with Congestive Heart Failure.
• To understand the combination of events that led to the predicted outcome.
• To apply an easy-to-understand process to new patients, regarding their
readmission risk.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
Available data
•
2. A N A LY T I C A P P R OA C H ( C O N C
Patient data, Readmissions data, CHF data, etc
How can we use data to answer the questions?
Descriptive
Choose Analytic approach based on the type of question.
• Descriptive205 /
99
• Current data
• Diagnostic (Statistical Analysis)
Diagnostic Analytics Prescriptive
• What happened?
• Why is this happening?
• Predictive (Forecasting)
• What if these trends continue?
Predictive
• What will happen next?
• Prescriptive
• How do we solve it?
2. A N A LY T I C
A P P R OA C H ( C O N C E P T )
The analytic approach can be selected once a clear understanding of the question is
established:
• If the question is to determine probabilities of an action.
206 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
A N A LY T I C A P P R OA C H - D E C I S I O N T R E E
(CONCEPT)
What is a Decision Tree?
1.An algorithm that represents a set of questions & decisions using a tree-like
structure.
2.It provides a procedure to decide what questions to ask, which to ask and when to
ask them to predict the value of an outcome.
207 /
99
A N A LY T I C A P P R OA C H - D E C I S I O N T R E E
(CONCEPT)
208 /
99
C A S E S T U D Y - 2.
A N A LY T I C A P P R OA C H
A decision tree classification model was used
to identify the combination of conditions leading
to each patient’s outcome.
Examining the variables in each of the nodes
209 /
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 2.
A N A LY T I C A P P R OA C H
For non-data scientists, a decision tree
classification model is easy to understand and
apply, to score new patients for their risk of
readmission.
In tro d u c ti o n TO DatA SC I E N C E 20 / 99
C A S E S T U D Y - 2.
A N A LY T I C A P P R OA C H
Clinicians can readily see what conditions are
causing a patient to be scored as high-risk.
Multiple models can be built and applied at
various points during hospital stay.
211 /
99
Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
F R OM D AT A R E qU I R E M E N T S T O
D AT A C OL L E C T I O N
Business Analytic
Understanding Approach
Data
212 / Feedback Requirements
99
Data
Deployment
Collection
Evaluation Data
Understanding
Data
Data Modeling
Preparation
In tro d u c ti o n TO DatA SC I E N C E
3. D AT A
R E qU I R E M E N T S
( C IfOourNgoal
CisEtoPmake
T )a ”Biryani” but we don’t have the right ingredients, then the
success of making a good Biryani will be compromised.
If the ”recipe” is the problem to be solved, then the data are the ingredients.
The data scientist must ask the following questions:
213 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 3. D AT A
R E qU I R E M E N T S
This involves:
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 3. D AT A
R E qU I R E M E N T S
• Data requirements for the case study included selecting a suitable list of
patients from the health insurance providers' member base.
• In order to put together patient clinical
215 /
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 3. D AT A
R E qU I R E M E N T S
Defining the data
The content and format suitable for decision tree classifier needs to be defined.
Format
• Transactional format
• This model requires, one record per patient.
217 /
99
A given patient can have thousands of records that represent all their attributes.
The data analytics specialists collected the transaction records from patient records
218 /
99
Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
4. D AT A C O L L E C T I O N
(C ONCEPT)
Content
99
is required.
• Revise data requirements if needed.
Data Collection Archive
• Assess content, quality and initial insights
of data.
• Identify gaps in data.
Extract
• How to extract, merge and Archive data?
Merge
Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
4. D AT A C O L L E C T I O N
(C ONCEPT)
Once data collection is completed, the Data Scientist performs an assessment to
make sure he has all the required data.
As with the purchase of ingredients for making a meal, some ingredients may be out
of season and more difficult to obtain or cost more than originally planned.
220 /
99
At this stage, the data requirements are reviewed and a decision is made as to
whether more or less data is required for the collection.
The gaps in the data are identified and plans for filling or replacement must be
made.
Once this step is complete, essentially, the ingredients are now ready for washing and
cutting.
In tro dSource:
u c ti o n TO DatA SC I E N C E
CognitiveClass
4. D AT A C O L L E C T I O N
(C ONCEPT)
The collected data is explored using descriptive statistics and visualization to assess
its content and quality.
221 /
99
In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 4. D AT A
C OLLECTION
This case study required data about:
• Demographics, clinical and coverage information of patients, provider information, claims
records, as well as pharmaceutical and other information related to all the diagnoses of
the CHF patients.
222 /
99
Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 4. D AT A
C OLLECTION
This case study also required other data, but
not available.
• Pharmaceutical records
• Information on drugs
This data source was not yet integrated with
the rest of the data sources.
In such situations,
• It is okay to postpone decisions about
unavailable data and to try to capture them
later.
• This can happen even after obtaining
intermediate results from predictive
modeling.
• If the results indicate that drug information may be important for a good model, you will
spend time trying to get it.
In tro d u c ti o n TO DatA SC I E N C E 34 / 99
C A S E S T U D Y - 4. D AT A
C OLLECTION
Data Pre-processing and Merging Data
• Database administrators and
programmers often work together to
extract data from different sources and
then combine them.
224 /
99
Data
Feedback Requirements
225 /
99
Data
Deployment
Collection
Evaluation Data
Understanding
Data
Data Modeling
Preparation
In tro d u c ti o n TO DatA SC I E N C E
F R OM D AT A U N D E R S T A N D I N G T O
D AT A P R E PA R AT I O N
Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
5. D AT A
U NDERSTANDING
( C This
ON C EofPtheTmethodology
section S) answers the question.
• Is data you collected representative of the problem to be solved?
Descriptitive statistics
• Univariates statistics
• Pairwise correlation
227 /
• Histograms
99
99
•
99
Sometimes a missing value means ”no” or ”0” (zero), or sometimes simply ”we do not
know”.
A variable contains invalid or misleading
values.
• E.g., A numeric variable called ”age”
containing 0 to 100 and 999, where ”triple-9”
actually means ”missing”, will be treated as
a valid value unless we have corrected it.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 5. D AT A
U NDERSTANDING
Data understanding is an iterative process.
• Originally, the meaning of CHF admission was decided on the basis of a primary
diagnosis of CHF.
• However, preliminary data analysis and clinical experience revealed that CHF
231 /
• The initial definition did not cover all cases of CHF admissions.
• They added secondary and tertiary diagnoses, and created a more complete definition
of CHF admission.
• This is one example of the iterative processes in the methodology .
• The more we work with the problem and the data, the more we learn and the more the
model can be adjusted, which ultimately leads to a better resolution of the problem.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
6. D AT A
P R E PA R AT I O N
( CInOa N
way,C E preparation
data P T ) is like removing dirt and washing vegetables.
Compared to data collection and understanding, data preparation is the most time
consuming phase – 70% to 90% of overall project time.
Automating collection and preparation time can reduce to 50%.
The data preparation phase of the methodology answers the question:
232 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
233 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
6. D AT A
P R E PA R AT I O N
( C Feature
ONC EPT)
Engineering
• Process of using domain knowledge of data to create
features that make ML algorithms work.
• Feature is a characteristic that might help solving a
234 /
problem. 99
CHF means.
99
• First, the set of diagnosis-related group codes needed to be identified, as CHF implies
certain kinds of fluid buildup.
• Data scientists also needed to consider that CHF is only one type of heart failure.
• Clinical guidance was needed to get the right codes for CHF.
• CHF occurs when the heart muscle does not pump blood as much as it should. This leads to
fluid build up in the lungs.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
The timing of events needed to be evaluated in order to define whether a particular CHF
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
237 /
99
In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
238 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
Meaning that the data included multiple records for each patient.
99
• Transactional records included claims submitted for physician, laboratory, hospital, and
clinical services.
• Also included were records describing all the diagnoses, procedures, prescriptions, and
other information about in-patients and out-patients.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
240 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
Aggregating data to patient level
• A given patient could have hundreds or even thousands of records, depending on their
clinical history.
• All the transactional records were aggregated to the patient level, yielding a single
record for each patient.
241 /
99
• This is required for the decision-tree classification method used for modeling.
• Many new columns were created representing the information in the transactions.
• E.g: Frequency and most recent visits to doctors, clinics and hospitals with diagnoses,
procedures, prescriptions, and so forth.
• Co-morbidities with CHF were also considered, such as:
• Diabetes, hypertension, and many other diseases and chronic conditions that could impact
the risk of re-admission for CHF.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
242 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
244 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
Completing the data set
Here is a list of the variables that were ultimately used in building the model
• Measures
246 /
• Gender, Age, Primary Diagnosis Related Group (DRG), Length of Stay, CHF Diagnosis
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
247 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
248 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 6. D AT A
P R E PA R AT I O N
99
• These patients met all of the criteria for this case study.
• The data (patient records) were then split into training and testing sets for building and
validating the model, respectively.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
F R OM D AT A M O D E L I N G
T O E VA L UAT I O N
Business Analytic
Understanding Approach
Data
250 / Feedback Requirements
99
Data
Deployment
Collection
Evaluation Data
Understanding
Data
Data Modeling
Preparation
In tro d u c ti o n TO DatA SC I E N C E
F R OM D AT A M O D E L I N G
T O E VA L UAT I O N
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
7. D AT A M O D E L I N G
(C ONCEPT)
In what way can the data be visualized to get to the answer that is required?
Modeling is based on the analyatic approach.
Data modeling focuses on developing models that are either descriptive or
predictive.
• Descriptive Models
252 /
• What happened?
99
• Use statistics.
• Predictive Models
• What wil happen?
• Use machine learning.
• Try to generate yes/no type outcomes.
• A training set is used for developing the predictive model.
• Training set
• Contains historical data in which the outcomes are already known.
• Acts like a gauge to determine if the model needs to be calibrated.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
7. D AT A M O D E L I N G
(C ONCEPT)
253 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
7. D AT A M O D E L I N G
(C ONCEPT)
• The data scientist will try different algorithms to ensure that the
variables in play are actually required.
• Success of compilation, preparation and modeling depends on the
254 /
• Like the quality of ingredients in cooking, the quality of data sets the
stage for the outcome.
• If data quality is bad, the outcome will be bad.
• Constant refinement, adjustment, and tweaking within each step are
essential to ensure a solid outcome.
• The end goal is to build a model that can answer the original
Source: CognitiveClass
In tro d u cquestion.
ti o n T O DatA SC I E N C E
7. D A T A M O D E L I N G – Concept of Confusion Matrix
Since Data Modeling for the case study involves ‘Confusion Matrix’, first let us
255 /
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 7.
D AT A M O D E L I N G
Decision tree to predict CHF readmission is built
In this first model, the default is 1-to-1 is used.
The overall accuracy in classifying the yes and
no outcomes was 85%.
256 /
the ”yes”.
• Meaning, when it’s actually YES,
model predicted YES only 45% of the
time.
The question is:
• How could the accuracy of the model
be improved in predicting the yes
outcome?
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 7.
D AT A M O D E L I N G
• There are many aspects to model building – one of those is
parameter tuning to improve the model.
• With a prepared training set, the first decision tree classification
257 /
99
99
99
have the disease, and ”no” would mean they don’t have the disease.
• The classifier made a total of 165 predictions.
Predicted: Predicted
• 165 patients were being tested for the
No Yes
presence of that disease. N = 165
• Out of those 165 cases, the classifier
Actual:
No:
predicted ”yes” 110 times, and ”no” 55 times. 50 10
• In reality, 105 patients in the sample have Actual:
the disease, and 60 patients do not. Yes:
5 100
In tro d u c ti o n TO DatA SC I E N C E
C ONFUSION
M AT R I X
True positives (TP) / Sensitivity:
• The model predicted yes, and the patients have the disease.
Predicted: Predicted
True negative (TN) / Specificity:
262 /
No Yes
• The model predicted no, and the patients don’t
99
N = 165
have the disease.
Actual:
No:
False positives (FP) / Type I error: TN = 50 FP =10
Actual:
• The model predicted YES, but the patients don’t
Yes:
actually have the disease. FN = 5 TP = 100
False negatives (FN) / Type II error:
• The model predicted NO, but the patients
actually have the disease.
In tro d u c ti o n TO DatA SC I E N C E
C ONFUSION
M AT R I X
Term Description Calculation
Accuracy Overall, how often is the clas- (TP+TN)/total = (100+50)/165
sifier correct? = 0.91
Misclassification Rate Overall, how often is it wrong? (FP+FN)/total = (10+5)/165
263 /
In tro d u c ti o n TO DatA SC I E N C E
C ONFUSION
M AT R I X
Term Description Calculation
False Positive Rate When it’s actually NO, how of- FP/actual NO = 10/60 = 0.17
(Type I Error) ten does it predict YES?
True Negative When it’s actually NO, how of- TN/actual NO = 50/60 = 0.83
Rate 264 /
(Specificity)
99 ten does it predict NO?
Equivalent to 1 minus False
Positive Rate
Precision When it predicts YES, how of- TP/predicted YES = 100/110
ten is it correct? = 0.91
Prevalence How often does the YES con- Actual YES/total = 105/165 =
dition actually occur in our 0.64
sample?
In tro d u c ti o n TO DatA SC I E N C E
8. E VA L U AT I O N
(C ONCEPT)
Quality of the developed model is assessed.
Before model gets deployed, evaluate whether the model really answers the initial
question.
In tro d u c ti o n TO DatA SC I E N C E
8. E VA L U AT I O N
(C ONCEPT)
Two phases
Diagnostic measure phase
• Ensures that the model works as intended.
• If the model is a predictive model
• A decision tree can be used to assess whether the response provided by the model
266 /
99
In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 8.
E VA L U AT I O N
One way is to find the optimal model through a diagnostic measure based on tuning
one of the parameters in model building.
Specifically we’ll see how to tune the relative cost of misclassifying yes and no
outcomes. 267 /
99
99
99
• The optimal model is the one giving the maximum separation between
the blue ROC curve relative to the red base line.
• This curve quantifies how well a binary classification model performs.
• Declassifying the yes and no outcomes when some discrimination criterion is varied.
• In this case, the criterion is a relative misclassification cost.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 8.
E VA L U AT I O N
We can see that model 3, with a relative misclassification cost of 4-to-1, is the best of the 4
models
271 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
F R OM D E P L O Y M E N T
TO F EEDBAC K
Business Analytic
Understanding Approach
Data
272 / Feedback Requirements
99
Data
Deployment
Collection
Evaluation Data
Understanding
Data
Data Modeling
Preparation
In tro d u c ti o n TO DatA SC I E N C E
F R OM D E P L O Y M E N T
TO F EEDBAC K
99
deployment.
The importance of incorporating feedback to refine the model.
The refined model must be redeployed.
This process should be repeated as often as necessary.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
9. D E P L O Y M E N T
(C ONCEPT)
• To make the model relevant and useful to address the initial question,
involves getting the stakeholders familiar with the tool produced.
274 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 9.
D EPLOYMENT
275 /
99
Source: CognitiveClass
In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 9.
D EPLOYMENT
Understanding the results
• In preparation for model deployment, the next step was to assimilate the knowledge for
the business group who would be designing and managing the intervention program to
reduce readmission risk.
276 /
99
• In this scenario, the business people translated the model results so that the clinical staff
could understand how to identify high-risk patients and design suitable intervention
actions.
• The goal was to reduce the likelihood that these patients would be readmitted within 30
days after discharge.
• During the business requirements stage, the Intervention Program Director and her team
had wanted an application that would provide automated, near real-time risk
assessments of congestive heart failure.
In tro d u c ti o n TO DatA SC I E N C E
9.
D EPLOYMENT
277 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 9.
D EPLOYMENT
Gathering application requirements
• It also had to be easy for clinical staff to use, and preferably through browser-based
application on a tablet, that each staff member could carry around.
278 /
99
In tro d u c ti o n TO DatA SC I E N C E
C A S E S T U D Y - 9.
D EPLOYMENT
Additional Requirements
• Processes for tracking and monitoring patients receiving the intervention would have to
be developed in collaboration with IT developers and database administrators, so that
the results could go through the feedback stage and the model could be refined over
279 /
time. 99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
10. F E E D B A C K
(C ONCEPT)
Feedback from users to refine the model.
Assess the model for performance and impact.
The value of the model will be dependent on successfully incorporating feedback and
280 /
Throughout the Data Science Methodology, each step sets the stage for the next.
This makes the methodology cyclical, ensures refinement at each stage in the game.
Once the model has been evaluated and the data scientist trusts that it will work, it
will be deployed and will undergo the final test:
Its real use in real time in the field.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 10.
F EEDBAC K
281 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 10.
F EEDBAC K
Feedback stage included these steps:
1
282 /
99
• The review process would be defined and put into place, with over
population. Clinical management executives would have overall res
• CHF patients receiving intervention would be tracked and their re-a
Source: CognitiveClass
• The intervention would then be measured to determine how effecti
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 10.
F EEDBAC K
283 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 10.
F EEDBAC K
For ethical reasons, CHF patients would not be split into controlled and treatment
groups.
Instead, readmission rates would be compared before and after the implementation of
284 /
After the deployment and feedback stages, the impact of the intervention program on
re-admission rates would be reviewed after the first year of its implementation.
Then the model would be refined, based on all of the data compiled after model
implementation and the knowledge gained throughout these stages.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 10.
F EEDBAC K
285 /
99
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
C A S E S T U D Y - 10.
F EEDBAC K
Redeployment
The intervention actions and processes would be reviewed and very likely refined as
286 /
well, based on the experience and knowledge gained through initial deployment and
99
feedback.
Finally, the refined model and intervention actions would be redeployed, with the
feedback process continued throughout the life of the Intervention program.
Source: CognitiveClass
In tro d u c ti o n T O DatA SC I E N C E
D AT A S C I E N C E
P R O C E S S - S UMMARY
Learn the importance of
• Understanding the question
• Picking the most effective analytic approach
Learn to work with data
287 /
iterative stages
• collect the appropriate data
• understand the data
• prepare the data for modeling
Learn how to
• evaluate and deploy the model
• getting feedback on it
• use the feedback constructively so as to impove the model
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N C E
P R O C E S S - S UMMARY
In tro d u c ti o n TO DatA SC I E N C E
289 /
T HANK YOU
99
In tro d u c ti o n TO DatA SC I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 4 : DATA S CIENCE T EAMS
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.
1 Data SC I E N C E TE A M S
) Find ways to put data into new projects using an established Learn-Plan-Test-
24
Measure process.
Democratize data.
) Scale a data science team to the whole company and even clients.
Measure the impact.
) Evaluate what part DS teams have in your decision-making process and give them
credit for it.
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
R O L E S I N D AT A S C I E N C E
T E A M [1/6]
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
R O L E S I N D AT A S C I E N C E
T E A M [2/6]
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
R O L E S I N D AT A S C I E N C E
T E A M [3/6]
3 Business analyst
) A business analyst basically realizes a CAO’s functions but on the operational level.
) This implies converting business expectations into data analysis.
) If your core data scientist lacks domain expertise, a business analyst bridges this gulf.
296 /
4 Data scientist
) A data scientist is a person who solves business tasks using machine learning and
data mining techniques.
) The role can be narrowed down to data preparation and cleaning with further
model training and evaluation.
) Preferred skills: R, SAS, Python, Matlab, SQL, noSQL, Hive, Pig, Hadoop, Spark
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
R O L E S I N D AT A S C I E N C E
T E A M [4/6]
Job of a data scientist is often divided into two roles
[4A] Machine Learning Engineer
) A machine learning engineer combines software engineering and modeling skills
by determining which model to use and what data should be used for each model.
) Probability and statistics are also their forte.
297 /
6 Data engineer
) Data engineers implement, test, and maintain infrastructural components that
data architects design.
) Realistically, the role of an engineer and the role of an architect can be combined in
one person.
) Preferred skills: SQL, noSQL, Hive, Pig, Matlab, SAS, Python, Java, Ruby, C++, Perl
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
R O L E S I N D AT A S C I E N C E
T E A M [6/6]
24
) An application engineer or other developers from front-end units will oversee end-
user data visualization.
) Preferred skills: programming, JavaScript (for visualization), SQL, noSQL.
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N T I S T
[1/2]
Data scientists are responsible for discovering insights from massive amounts of
structured and unstructured data to help shape or meet specific business needs and
goals.
Role 300 /
24
) Main objective is to organize and analyze large amounts of data, often using
software specifically designed for the task.
Responsibility
) Chief responsibility is data analysis, a process that begins with data collection and
ends with business decisions made on the basis of the data scientist’s final data
analytics results.
h tt p s : / / w w w. c i o . c o m / a r ti c l e / 3 2 1 7 0 2 6 / w h a t - i s - a - d a ta - s c i e n ti s t- a - ke y - d a ta - a n a l y ti c s - r o l e - a n d - a - l u c r a ti v e - c a r e e r. h t m l
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N T I S T
[2/2]
Stitch Fix’s Michael Hochster defines two types of data scientists:
Type A stands for Analysis
) This person is a statistician that makes sense of data without necessarily having
301 /
) Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc.
Type B stands for Building
) These folks use data in production.
) They’re excellent good software engineers with some statistics background who
build recommendation systems, personalization use cases, etc.
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N T I S T R E q U I R E M E N T S -
I N D U S T R Y -W I S E
Business
) Data analysis of business data can inform decisions around efficiency,
inventory, production errors, customer loyalty and more.
E-commerce
) improve customer service, find trends and develop services or products.
302 /
Finance
24
) data on accounts, credit and debit transactions and similar financial data, security
and compliance, including fraud detection.
Government
) form decisions, support constituents and monitor overall satisfaction, security
and compliance.
Science
) collect, share and analyze data from experiments in a better way.
h tt p s : / / w w w. c i o . c o m / a r ti c l e / 3 2 1 7 0 2 6 / w h a t - i s - a - d a ta - s c i e n ti s t- a - ke y - d a ta - a n a l y ti c s - r o l e - a n d - a - l u c r a ti v e - c a r e e r. h t m l
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N T I S T R E q U I R E M E N T S -
I N D U S T R Y -W I S E
Social networking
) targeted advertising, improve customer satisfaction, establish trends in location data
and enhance features and services.
) Ongoing data analysis of posts, tweets, blogs and other social media can
help businesses constantly improve their services.
Healthcare
) Electronic medical records requires a dedication to big data, security and
compliance.
) Improve health services and uncover trends that might go unnoticed otherwise.
Telecommunications
) All electronics collect data, and all that data needs to be stored, managed,
maintained and analyzed.
) Data scientists help companies to improve products and keep customers happy by
delivering the features they want.
h tt p s : / / w w w. c i o . c o m / a r ti c l e / 3 2 1 7 0 2 6 / w h a t - i s - a - d a ta - s c i e n ti s t- a - ke y - d a ta - a n a l y ti c s - r o l e - a n d - a - l u c r a ti v e - c a r e e r. h t m l
In tro d u c ti o n TO DatA SC I E N C E 14 / 24
S K I L L S E T F O R A D AT A
SCIENTIST
P R O G R A M M I N G : Most fundamental of a data scientist’s skill set. Programming
improves your statistics skills, helps you “analyze large datasets” and gives
you the ability to create your own tools.
Q U A N T I TAT I V E A N A LY S I S : Improve your ability to run experimental analysis, scale
your data strategy and help you implement machine learning.
304 /
24
305 /
24
In tro d u c ti o n TO DatA SC I E N C E
D AT A S C I E N C E T E A M
B UILDING
htt ps: / / towa rd sd ata sci e n c e . co m / w hy - te a m -b u i ld i n g - is-i m p o r ta nt-to -d ata -sc i e nti st s -a 8fa 7 4d b c 09 b
In tro d u c ti o n TO DatA SC I E N C E
O R G A N I S AT I O N O F D AT A
SCIENCE T EAM
[1] Decentralized
) Data scientists report into specific
business units (ex: Marketing) or functional
units (ex: Product Recommendations)
within a company.
307 /
24
[2] Functional
) Resource allocation driven by a
308 /
enterprise agenda.
) Analysts are located in the functions
where the most analytical activity takes
place, but may also provide services to
rest of the corporation.
) Little coordination
In tro d u c ti o n TO DatA SC I E N C E
O R G A N I S AT I O N O F D AT A
SCIENCE T EAM
[3] Consulting
) Resources allocated based on availability
309 /
In tro d u c ti o n TO DatA SC I E N C E
O R G A N I S AT I O N O F D AT A
SCIENCE T EAM
[4] Centralized
) Data scientists are members of a core
group, reporting to a head of data science
or analytics.
) Stronger ownership and management of
310 /
24
24
In tro d u c ti o n TO DatA SC I E N C E
O R G A N I S AT I O N O F D AT A
SCIENCE T EAM
[6] Federated
) Same as “Center of Excellence” model
with need-based operational involvement
312 /
24
In tro d u c ti o n TO DatA SC I E N C E
Building an Analytics-Driven Organization, Accenture
https://www.altexsoft.com/blog/datascienc
313 /
24
•how-to-structure-data-science-team-key-models-and-roles/ h
•what-i s-a-data-s c i e n ti s t-a-key-data-a n a l y ti c s-r o l e-and-a-l u c rati
In tro d u c ti o n TO DatA SC I E N C E
INTRODUCTION TO DATA SCIENCE
MODULE # 3 : DATA SCIENCE Proposal
IDS Course Team
BITS Pilani
TABLE OF CONTENTS
https://medium.com/data-science-at-microsoft/managing-a-data-science-project-87945ff79483
Based on Historical data, predicting the Churn Analysis for next quarter
https://commence.com/blog/2020/06/16/customer-profiling-methods/
https://www.crisil.com/en/home/our-businesses/global-research-and-risk-solutions/our-offerings/non-financial-risk/financial-crime-management/
fraud-management/fraud-detection-and-analytics.html#
https://medium.com/data-science-at-microsoft/managing-a-data-science-project-87945ff79483
• What are the existing/related systems within the capability that capture/use related
information? For e.g. A prediction model is already being used for fraud analysis. Can we
reuse the same transaction dataset for providing recommendations?
• What are the gaps?
• Who are the stakeholders?
• Who will be affected by this implementation?
• Note down the assumptions; things like availability of necessary data, access to the
infrastructure, licenses etc.
• Any Licenses/Commercials needed in case of proprietary solutions?
• Note down the dependencies: things like dependency on setting up and access to the
infrastructure/tools, on access rights etc.
• Are there any other dependencies?
• Do you see any other problems/challenges?
• What are the important variables that we need to collect and location of data?
• Physiological Data [From Sensor]: Heart Rate, Electrodermal Activity (EDA), Oxygen Saturation
(SPO2), Blood Pressure etc.
• Behavioral Data [From Mobile App]: Self-assessment questionnaire to capture daily information
regarding sleep quality [hourly scale], physical activity [-3 for inactive to +3 for active], mood states
using GAD [7 point likert], HDRS [7 point likert], YMRS [5 point likert]. In addition, data on
alcohol intake, stress levels, motivation levels, concentration levels, menstrual cycle pattern,
irritability levels, insomnia levels. Ask the treating doctors will be asked to rate the patient progress
using scales from much worse (-3) to much better (+3). The behavioral data will be collected from a
Mobile App.
THANK YOU
1 Data DATA-
2 SETS
3 DATA RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7
Numerical
Interval
Data
Ordinal
Categorical
Nominal
INTRODUCTION TO DATA SCIENCE
TYPES OF ATTRIBUTES
Nominal: Distinctiveness
Data
2. Continuous Attribute
• Measureable data.
• Temperature, height, age, weight
• Continuous attributes are typically represented as floating-point variables.
Flat file (CSV), Banking, Retail, E- SPSS data matrix Frequency of terms
RDBMS Commerce etc. that appears in
documents, used in
Information
Retrieval
https://towardsdatascience.com/types-of-data-sets-in-data-science-data-mining-machine-learning-eb47c80af7a
1 Data DATA-
2 SETS
3 DATA RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7
1 DATA DATA-
2 SETS
3 Data RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7
Data Storage
• Database
tables
• Text files
• Data
marts
• Data
warehous
es
• Data
lakes
(raw
INTRODUCTION TO DATA SCIENCE
TABLE OF CONTENTS
1 DATA DATA-
2 SETS
3 Data RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7
Focuses on removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
Two types of errors
• Interpretation / Representation error
• Age > 130
• Height of a person is greater than 8 feet.
• Price is negative.
• Inconsistencies between data sources or against your company’s standardized values.
• Female and F
• Feet and meter
• Dollars and Pounds
1 DATA DATA-
2 SETS
3 DATA RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7
Use graphical techniques to gain an understanding of the data and the interactions between
variables.
Look at what can be learned from the data.
Statistical properties like distribution of data, correlation.
Discover outliers.
• Boxplot – can show the maximum, minimum, median, and other characterizing measures at
the same time.
• Histogram – In a histogram a variable is cut into discrete categories and the number of occurrences
in each category are summed up and shown in the graph.
• Pareto diagram – is a combination of the values and a cumulative distribution. Tabulation
• Clustering and other modeling techniques can also be a part of exploratory analysis.
Refer - https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Exploratory_data_Analysis.ipynb
Analysis
• More tips given during the dinner time
compared to the lunch time
• Positive correlation between total bill amount
and tip given, i.e., more the bill amount,
more the tip paid.
Ex:
80% of complaints come from 20% of customers
80% of sales come from 20% of clients
80% of computer crashes come from 20% of IT bugs
1 DATA DATA-
2 SETS
3 DATA RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7
https://www.deltapartnersgroup.com/
INTRODUCTION TO DATA SCIENCE
DATA QUALITY INDEX
http://www.dataintegration.ninja
INTRODUCTION TO DATA SCIENCE
IMPACT OF MISSING DATA
1 DATA DATA-
2 SETS
3 DATA RETRIEVAL DATA
4 PREPARATION DATA E
5 XPLORATION DATA
6 QUALITY OUTLIERS
7
• An outlier is a data point that is significantly far away from most other data points. For
example, if everyone in your classroom is of average height with the exception of two
basketball players that are significantly taller than the rest of the class, these two data points
would be considered outliers.
• Data objects with behaviors that are very different from expectation are called outliers or
anomalies.
• Outliers can significantly skew the distribution of your data.
• Outliers can be identified using summary statistics and plots of the data.
• Algorithms like Linear Regression, K-Nearest Neighbor, Adaboost are sensitive to
noise.
Iterating through this 3-step process is what we call the “epicycle of data
analysis.” As you go through every stage
of an analysis, you will need to go through the epicycle to continuously refine
your question, your exploratory data
analysis, your formal models, your interpretation, and your communication.
https://makemeanalyst.com/5-core-activities-data-analysis-epicycles-data-analysis/
INTRODUCTION TO DATA SCIENCE
Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T3) The
Art of Data Science by Roger D Peng and Elizabeth Matsui (R1) Introducing
Data Science by Cielen, Meysman and Ali
https://www.deltapartnersgroup.com/
managing-data-quality-optimize-value-extraction
http://www.dataintegration.ninja/
relationship-between-data-quality-and-master-data-manag
ement/
THANK YOU
27
In tro d u c ti o n TO DatA SC I E N C E
Confusion Matrix
• A Confusion matrix is a table that is often used to evaluate the performance of a
classification model (or “classifier”).
• A Confusion Matrix shows what the machine learning algorithm did right and what
the algorithm did wrong (misclassification).
430 /
27
1
• It works on a set of test data for which the true values are known. There are two
possible predicted classes: “YES” and “NO”.
In tro d u c ti o n TO DatA SC I E N C E
Confusion Matrix
Actual Values
Y N
27
Predicted
1 Values
N False Negative True Negative
There are four quadrants in the confusion matrix,
(Type which are symbolized as below.
II Error)
True Positive (TP) : The number of instances that were positive and correctly classified as positive.
False Positive (FP): The number of instances that were negative and incorrectly classified as positive.
This also known as Type 1 Error.
False Negative (FN): The number of instances that were positive and incorrectly classified as negative.
It is also known as Type 2 Error.
True Negative (TN): The number of instances that were negative and correctly classified as negative.
In tro d u c ti o n TO DatA SC I E N C E
Confusion Matrix
Which type of misclassification is more serious?? Type-I Error or Type-II Error?
Case I : Predicting whether a convict should be hanged or not? [Type I Error more Serious]
False Positive – Algorithm predicts that the convict has committed the crime, in reality, he is innocent.
Verdict: He will be hanged.
False Negative – Algorithm predicts that the convict is innocent, in reality, he has done the crime.
432 /
Verdict: He is released. 27
1
Case II : Predicting Smog in a region and alerting the public [Type II Error more Serious]
False Positive – Algorithm predicts smog, in reality, there is NO SMOG.
Verdict: People will take precaution unnecessarily.
False Negative – Algorithm predicts NO SMOG, in reality, there is SMOG.
Verdict: The high Smog may cause health issues in the people, since they have not taken precaution.
In tro d u c ti o n TO DatA SC I E N C E
Confusion Matrix
Let us consider an example of model predicting a Tumour for a patient.
Actual
Interpretation:
Values
True Positive (TP): Model predicted ‘Tumour’ and the patient has tumour.
Y N
False Positive (FP): Model predicted ‘Tumour’, the patient has ‘No Tumour’.
This also known as Type 1 Error. 433 /
Y 10 22
27
False Negative (FN): Model predicted ‘No Tumour’ but the patient actually has
tumour. It is also known as Type 2 Error.
True Negative (TN): Model predicted ‘No Tumour’ and the patient has no tumour. Predicted
Values
Discuss on the repercussions of Type 1 and Type 2 errors w.r.t the patient and N 8 60
the hospital.
In tro d u c ti o n TO DatA SC I E N C E
Confusion Matrix
False Negative Rate (FNR): It is defined as the
True Positive Rate (TPR): It is defined as the
fraction of positive examples classified as a
fraction of the positive examples predicted negative class by the classifier.
correctly by the classifier. This metrics is also
known as Recall, Sensitivity or Hit rate.
434 /
27
False Positive Rate (FPR): It is defined as the fraction of True Negative Rate (TNR): It is defined as the
negative examples classified as positive class by the fraction of negative examples classified correctly
classifier. This metric is also known as False Alarm Rate. by the classifier. This metric is also known as
Specificity.
In tro d u c ti o n TO DatA SC I E N C E
Confusion Matrix
Positive Predictive Value (PPV): It is defined Accuracy: How often is the classifier
as the fraction of the positive examples
correct.
classified as positive that are really positive. It
is also known as Precision.
435 /
27
F1 Score (F1): Recall (r) and Precision (p) are two widely used True Miscalculation Rate or Error Rate:
metrics employed in analysis, where detection of one of the classes How often is the classifier wrong.
is considered more significant than the others.
In tro d u c ti o n TO DatA SC I E N C E
All Formulae
FP TN
FPR= TNR=
FP +TN TN + FP
436 /
TP 27
2 𝑇𝑃
Precision= 𝐹1=
TP+ FP 2 𝑇𝑃 +𝐹𝑃+ 𝐹𝑁
TP+TN FN F P+ FN
Accuracy= FNR= Error Rate=
Total TP+ FN Total
In tro d u c ti o n TO DatA SC I E N C E
C a s e S t u d y – C H F P r e d i c ti o n
Calculate the following metrics for the given confusion
matrix:
Actual Values
1. True Positive Rate (TPR) [Recall / Sensitivity]
437 /
Y N 2. False Positive Rate (FPR)
27
3. False Negative Rate (FNR)
Y 100 (TP) 10 (FP)
4. True Negative Rate (TNR) [Specificity]
5. Precision
6. F1 Score
Predicted 7. Accuracy
Values 8. Error Rate or Miscalculation Rate
N 5 (FN) 50 (TN)
In tro d u c ti o n TO DatA SC I E N C E
C a s e S t u d y – C H F P r e d i c ti o n
Predicte
d
Values
N 5 (FN) 50 (TN)
In tro d u c ti o n TO DatA SC I E N C E
ROC Curve
An ROC curve (receiver operating characteristic curve) is a
graph showing the performance of a classification model at
all classification thresholds.
It shows the trade-off between Sensitivity and Specificity
ROC curve plots two parameters:
439 /
27
https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
https://towardsdatascience.com/understanding-the-roc-curve-in-three-visual-steps-795b1399481c
27
441 /
27
• Altering the threshold to 0, 0.35, 0.5, 0.65 and 1 levels. Notice how the FPR and TPR changes accordingly
• Overall, we can see this is a trade-off. As we increase our threshold, we’ll be better at classifying negatives, but
this is at the expense of mis-classifying more positives
442 /
27
Area under
ROC Curve
(AUC)
443 /
27
IN TRODUCTION TO DatA S
https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learni
ng/ 444 /
27
https://towardsdatascience.com/understanding-the-roc-curve-in-three-visual-s
teps-795b1399481c
T HANK YOU
In tro d u c ti o n TO DatA SC I E N C E