Unit2-Data Science

The document outlines the key roles and phases involved in a typical data analytics project. It discusses the roles of business users, project sponsors, managers, analysts, engineers and scientists. It then describes the six phases of a data analytics lifecycle: discovery to understand the problem, data preparation including extraction and transformation, model planning to identify candidate models, model building to develop and evaluate models, communicating results, and operationalizing results in a pilot project. The roles and phases work together to successfully execute an analytics project from start to finish.

Uploaded by

DIVYANSH GAUR (RA2011027010090)

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

Unit2-Data Science

Uploaded by

DIVYANSH GAUR (RA2011027010090)

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Unit II- DATA SCIENCE

DATA ANALYTICS LIFE CYCLE

DR.A.SHANTHINI
ASSOCIATE PROFESSOR
DEPARTMENT OF DSBS
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
Data Science Team

► There are certain key roles that are required for the complete and fulfilled functioning of the
data
► Each key plays a crucial role in developing a successful analytics project.
► There is no hard and fast rule for considering the listed roles, they can be used fewer or more
depending on the scope of the project, skills of the participants, and organizational structure.
► Data science team to execute projects on analytics successfully
Key roles of successful analytics project

► Business users
► Project Sponsor
► Project manager
► Business intelligence Analyst
► Data base administrator
► Data Engineer
► Data Scientist
Key Roles for a Data analytics project :

► Business User :

∙ The business user is the one who understands the main area of the project and is also
basically benefited from the results.

∙ This user gives advice and consult the team working on the project about the value of
the results obtained and how the operations on the outputs are done.

∙ The business manager, line manager, or deep subject matter expert in the project
mains fulfills this role.
Cont..

► Project Sponsor :

∙ The Project Sponsor is the one who is responsible to initiate the project. Project
Sponsor provides the actual requirements for the project and presents the basic
business issue.

∙ He generally provides the funds and measures the degree of value from the final output
of the team working on the project.

∙ This person introduce the prime concern and brooms the desired output.

► Project Manager :

This person ensures that key milestone and purpose of the project is met on time and of the
expected quality
Cont..

► Business Intelligence Analyst :

∙ Business Intelligence Analyst provides business domain perfection based on a detailed and deep
understanding of the data, key performance indicators (KPIs), key matrix, and business intelligence
from a reporting point of view.

∙ This person generally creates fascia and reports and knows about the data feeds and sources.

► Database Administrator (DBA) :

∙ DBA facilitates and arrange the database environment to support the analytics need of the team
working on a project.

∙ His responsibilities may include providing permission to key databases or tables and making sure
that the appropriate security stages are in their correct places related to the data repositories or not.
Cont..

► Data Engineer :

∙ Data engineer grasps deep technical skills to assist with tuning SQL queries for data
management and data extraction and provides support for data intake into the analytic
sandbox.

∙ The data engineer works jointly with the data scientist to help build data in correct ways
for analysis.

► Data Scientist :

∙ Data scientist facilitates with the subject matter expertise for analytical techniques, data
modelling, and applying correct analytical techniques for a given business issues.

∙ He ensures overall analytical objectives are met.

∙ Data scientists outline and apply analytical methods and proceed towards the data
available for the concerned project.
►
Over view of Data analytics life cycle

Phase 1: Discovery
Phase 2: Data Preparation
Phase 3: Model Planning
Phase 4: Model Building
Phase 5: Communicate Results
Phase 6: Operationalize
Phase 1.Discovery Phase

► Data science team must learn and investigate the

problem, develop context and understanding, and learn
about the data sources
1. Learning the Business Domain
2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders(Business end users, business analysts and data scientists.)
5. Interviewing the Analytics Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data Sources
Five main activities in the discovery
phase:

► Identify data sources

► Capture aggregate data sources
► Review the raw data
► Evaluate the data structures and tools needed
► Scope the sort of data infrastructure needed for this type of problem
Phase 2: Data Preparation

► Includes the steps to explore, preprocess, and condition data prior to modeling
and analysis
► Explore the data- done by preparing an analytics sandbox
► To get the data into the sandbox, the team needs to perform ETLT, by a
combination of extracting data from data sources, transforming, and loading data
into the sandbox.
► must decide how to condition and transform data to get it into a format to
facilitate subsequent analysis
► The team may perform data visualizations to help team members understand
the data, including its trends, outliers, and relationships among data variables.
Learning About Data

► Preparing the Analytic Sandbox- Expect the sandbox to be large. It may contain raw data,
aggregated data, and other data types that are less commonly used in organizations. Sandbox
size can vary greatly depending on the project. A good rule is to plan for the sandbox to be at
least 5–10 times the size of the original datasets.
► Performing ETLT-In ETL, users perform extract, transform, load processes to extract data from
a datastore, perform data transformations, and load the data back into the datastore. However,
the analytic sandbox approach differs slightly; it advocates extract, load, and then transform.
► Learning About the Data- Data-to understand what constitutes a reasonable value and
expected output versus what is a surprising finding. Identify additional data sources
► Data Conditioning- the process of cleaning data, normalizing datasets, and performing
transformations on the data. often viewed as a preprocessing step.
► Survey and Visualize- a useful step is to leverage data visualization tools to gain an overview
of the data.
► Common Tools for the Data Preparation Phase- Hadoop, Alpine Miner, OpenRefine, Data
Wrangler
Phase 3: Model Planning

► the data science team identifies candidate models to apply to the data for clustering,
classifying, or finding relationships in the data depending on the goal of the project
► Data Exploration and Variable Selection- activities focus mainly on data hygiene and
on assessing the quality of the data itself. A common way to conduct this step involves
using tools to perform data visualizations. Approaching the data exploration in this way
aids the team in previewing the data and assessing relationships between variables at a
high level
► Model Selection- to choose an analytical technique based on the end goal of the
project. In the case of machine learning and data mining, these rules and conditions are
grouped into several general sets of techniques, such as classification, association rules,
and clustering.
► Common Tools for the Model Planning Phase – R, SQL Analysis services,
SAS/ACCESS
Phase 4: Model Building

► the data science team needs to develop datasets for training, testing, and production purposes.
► an analytical model is developed and fit on the training data and evaluated (scored) against the
test data
► logic required to develop models can be highly complex, the actual duration of this phase can be
short compared to the time spent preparing the data and defining the approaches
► During this phase, users run models from analytical software packages, such as R or SAS, on
file extracts and small datasets for testing purposes. On a small scale, assess the validity of the
model and its results
► Common Tools for the Model Building Phase
► Commercial Tools -SAS Enterprise Miner, SPSS Modeler, Matlab, Alpine Miner,
STATISTICA and Mathematica
► Free or Open Source tools- R and PL/R, Octave, WEKA, Python, SQL
Phase 5: Communicate Results

► The team considers how best to articulate the findings and outcomes to the various team
members and stakeholders, taking into account caveats, assumptions, and any limitations of the
results.
► The key is to remember that the team must be rigorous enough with the data to determine
whether it will prove or disprove the hypotheses outlined in Phase 1 (discovery)
► If the results are valid, identify the aspects of the results that stand out and may provide salient
findings when it comes time to communicate them.
► If the results are not valid, think about adjustments that can be made to refine and iterate on the
model to make it valid.
► Depending on what emerged as a result of the model, the team may need to spend time
quantifying the business impact of the results to help prepare for the presentation and
demonstrate the value of the findings.
► The team will have documented the key findings and major insights derived from the analysis.
The deliverable of this phase will be the most visible portion of the process to the outside
stakeholders and sponsors, so take care to clearly articulate the results, methodology, and
business value of the findings
Phase 6: Operationalize

► sets up a pilot project to deploy the work in a controlled way before broadening
the work to a full enterprise or ecosystem of users.
► represents the first time that most analytics teams approach deploying the new
analytical methods or models in a production environment.
► This approach enables the team to learn about the performance and related
constraints of the model in a production environment on a small scale and make
adjustments before a full deployment
► Part of the operationalizing phase includes creating a mechanism for performing
ongoing monitoring of model accuracy.
key deliverables of an analytics
project
Four main deliverables.

► Presentation for project sponsors: This contains high-level takeaways for

executive level stakeholders, with a few key messages to aid their
decision-making process. Focus on clean, easy visuals for the presenter
to explain and for the viewer to grasp.
► Presentation for analysts, which describes business process changes and
reporting changes. Fellow data scientists will want the details and are
comfortable with technical graphs (such as Receiver Operating
Characteristic [ROC] curves, densityplots, and histograms.
► Code for technical people.
► Technical specifications of implementing the code.

DRDP Child Progress Report
100% (3)
DRDP Child Progress Report
2 pages
Postmodern Techniques
No ratings yet
Postmodern Techniques
16 pages
Steiner - Calendar Soul
100% (7)
Steiner - Calendar Soul
35 pages
Ch1-Introduction to Data Analytics & LifeCycle
No ratings yet
Ch1-Introduction to Data Analytics & LifeCycle
26 pages
Ch1-Introduction to data analytics & LifeCycle
No ratings yet
Ch1-Introduction to data analytics & LifeCycle
25 pages
Module I - 1
No ratings yet
Module I - 1
23 pages
MODULE 1
No ratings yet
MODULE 1
40 pages
ATW115 Slides Chp02
No ratings yet
ATW115 Slides Chp02
52 pages
Week 2 - Data Analytics Life Cycle
No ratings yet
Week 2 - Data Analytics Life Cycle
41 pages
Data Science: Lesson 4
No ratings yet
Data Science: Lesson 4
8 pages
Chapter 1
No ratings yet
Chapter 1
41 pages
BSR-Data Science
No ratings yet
BSR-Data Science
308 pages
_unit2 DATA SCIENCE
No ratings yet
_unit2 DATA SCIENCE
8 pages
Chap 1
No ratings yet
Chap 1
42 pages
Data Analytics
No ratings yet
Data Analytics
11 pages
5 Data Analytics Life Cycle
No ratings yet
5 Data Analytics Life Cycle
18 pages
Unit - I DA.pptx
No ratings yet
Unit - I DA.pptx
107 pages
Two
No ratings yet
Two
10 pages
Data Analytics I Unit Notes
No ratings yet
Data Analytics I Unit Notes
8 pages
Adobe Scan 27-Mar-2024
No ratings yet
Adobe Scan 27-Mar-2024
12 pages
Module I(Introduction Data Analytics Life Cycle) Part II (1)
No ratings yet
Module I(Introduction Data Analytics Life Cycle) Part II (1)
103 pages
6 - Data Science Methodology
No ratings yet
6 - Data Science Methodology
20 pages
Unit 1 - DSA
No ratings yet
Unit 1 - DSA
12 pages
Module 1 (3)
No ratings yet
Module 1 (3)
50 pages
Unit - I - 2
No ratings yet
Unit - I - 2
63 pages
Key Roles and Life Cycle
No ratings yet
Key Roles and Life Cycle
4 pages
Excel Cheat Sheet
No ratings yet
Excel Cheat Sheet
61 pages
Notas - Curso Data Analysis
No ratings yet
Notas - Curso Data Analysis
38 pages
Unit 2 DS
No ratings yet
Unit 2 DS
116 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
13 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
Data Science Methodology
No ratings yet
Data Science Methodology
26 pages
Topper World Data-Science-Lifecycle-Fnl
No ratings yet
Topper World Data-Science-Lifecycle-Fnl
6 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
2 Da
100% (1)
2 Da
17 pages
IDA-Group Assignment Question
No ratings yet
IDA-Group Assignment Question
6 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
Unit-I Introduction To Data Science
No ratings yet
Unit-I Introduction To Data Science
40 pages
Data Science Notes and Answers
No ratings yet
Data Science Notes and Answers
4 pages
D.a_introduction to Data Analytics
No ratings yet
D.a_introduction to Data Analytics
16 pages
Data Science Methodology - English Template
No ratings yet
Data Science Methodology - English Template
23 pages
CRISP DM - Explained in Easy Way
No ratings yet
CRISP DM - Explained in Easy Way
12 pages
DATA Mining
No ratings yet
DATA Mining
21 pages
LIFE CYCLE
No ratings yet
LIFE CYCLE
35 pages
Data Mining
No ratings yet
Data Mining
8 pages
Crisp DM
No ratings yet
Crisp DM
38 pages
DS Handout 2
No ratings yet
DS Handout 2
5 pages
What Is Data Anaysis
No ratings yet
What Is Data Anaysis
8 pages
DA-unit 1 part 2.pptx
No ratings yet
DA-unit 1 part 2.pptx
15 pages
Notes Unit I
No ratings yet
Notes Unit I
47 pages
analytics and data science
No ratings yet
analytics and data science
12 pages
Chapter 8
No ratings yet
Chapter 8
21 pages
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
No ratings yet
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
24 pages
Data Analytics Lifecycle
No ratings yet
Data Analytics Lifecycle
50 pages
Chapter 02 DataAnalyticsLifecycle
No ratings yet
Chapter 02 DataAnalyticsLifecycle
44 pages
Data Science
No ratings yet
Data Science
7 pages
2 Data Analytics
No ratings yet
2 Data Analytics
49 pages
Chapter 1- Intr to DS and Business Understanding
No ratings yet
Chapter 1- Intr to DS and Business Understanding
35 pages
Industrial File
No ratings yet
Industrial File
10 pages
Unit2__1_DWstrategy_4.11.24
No ratings yet
Unit2__1_DWstrategy_4.11.24
64 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
117 pages
Datathon_Software_guidelines (2)
No ratings yet
Datathon_Software_guidelines (2)
4 pages
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
Overall Syllabus
No ratings yet
Overall Syllabus
525 pages
Byte Ordering - Unit 2
No ratings yet
Byte Ordering - Unit 2
77 pages
Syllabus BDA
No ratings yet
Syllabus BDA
4 pages
Unit1-Data Science
No ratings yet
Unit1-Data Science
77 pages
Unit3-Data Science
No ratings yet
Unit3-Data Science
37 pages
Year 9 Assessment Task Term 1 Islamic Studies
No ratings yet
Year 9 Assessment Task Term 1 Islamic Studies
3 pages
01 Introduction 2023
No ratings yet
01 Introduction 2023
83 pages
Introduction To Philosophical Perspectives of The Self (Includes Socrates & Plato)
No ratings yet
Introduction To Philosophical Perspectives of The Self (Includes Socrates & Plato)
29 pages
Communication Ethics: World Englishes
No ratings yet
Communication Ethics: World Englishes
4 pages
Research Essay
100% (1)
Research Essay
11 pages
Remedial Bahasa Inggris - Salsa Dwi Yanti X Ips 2
No ratings yet
Remedial Bahasa Inggris - Salsa Dwi Yanti X Ips 2
3 pages
Alfred Korzybski, Manhood of Humanity PDF
No ratings yet
Alfred Korzybski, Manhood of Humanity PDF
390 pages
Nate and The Lost List-Editing
No ratings yet
Nate and The Lost List-Editing
5 pages
Where Did Science Come From2
No ratings yet
Where Did Science Come From2
1 page
Interventions To Improve Communication
0% (1)
Interventions To Improve Communication
15 pages
E3 Teachers Project Notes - GR 7 03 - 2023
No ratings yet
E3 Teachers Project Notes - GR 7 03 - 2023
22 pages
Intransitive Verbs
No ratings yet
Intransitive Verbs
3 pages
THE STORY of The AGED MOTHER LP and Worksheets
89% (9)
THE STORY of The AGED MOTHER LP and Worksheets
7 pages
AI Project Cycle
No ratings yet
AI Project Cycle
74 pages
Immersion vs. Interactivity: Virtual Reality and Literary Theory
0% (1)
Immersion vs. Interactivity: Virtual Reality and Literary Theory
29 pages
Cognitive Accounts of Second Language Acquisition
100% (3)
Cognitive Accounts of Second Language Acquisition
12 pages
Unit 1.: Language and Its Functions
No ratings yet
Unit 1.: Language and Its Functions
5 pages
Advanced Empathy Training
100% (2)
Advanced Empathy Training
8 pages
Peer Mentoring in Undergraduate Clinical Education of Orthoptic Students
No ratings yet
Peer Mentoring in Undergraduate Clinical Education of Orthoptic Students
12 pages
Teaching Young Learners (Theories and Principles)
No ratings yet
Teaching Young Learners (Theories and Principles)
13 pages
Lesson Plan
No ratings yet
Lesson Plan
3 pages
SNSW Unit-3
No ratings yet
SNSW Unit-3
15 pages
Direct and Indirect Speech
No ratings yet
Direct and Indirect Speech
8 pages
DLL - Science 6 - Q2 - W5
No ratings yet
DLL - Science 6 - Q2 - W5
4 pages
Discuss How Standard English Differs From Malaysian English
No ratings yet
Discuss How Standard English Differs From Malaysian English
12 pages
Improving Twitter Named Entity Recognition Using Word Representations
No ratings yet
Improving Twitter Named Entity Recognition Using Word Representations
5 pages