Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Module 1 (3)

The document outlines the Data Analytics Life Cycle, which consists of six phases designed to manage data for effective decision-making in analytics projects. It emphasizes the importance of key roles within a data science team, including business users, project sponsors, and data scientists, to ensure successful project execution. Additionally, it details the critical steps involved in the discovery and data preparation phases, highlighting the need for thorough understanding, resource assessment, and data conditioning to achieve accurate analytical outcomes.

Uploaded by

Dhiraj Mestri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 1 (3)

The document outlines the Data Analytics Life Cycle, which consists of six phases designed to manage data for effective decision-making in analytics projects. It emphasizes the importance of key roles within a data science team, including business users, project sponsors, and data scientists, to ensure successful project execution. Additionally, it details the critical steps involved in the discovery and data preparation phases, highlighting the need for thorough understanding, resource assessment, and data conditioning to achieve accurate analytical outcomes.

Uploaded by

Dhiraj Mestri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

University of Mumbai

Third Year CSE (AIML) Engineering

Course Code – CSC601

Course Name – Data Analytics and Visualization

By
Prof. A.V.Phanse
Data Analytics Life Cycle
 The life cycle of Data Analytics is a structured approach to generate, collect,
process, use, and analyze the data for effective decision making.

 It offers a systematic way to manage data for converting it into information


that can use to fulfill organizational and project goals.

 The Data Analytics Lifecycle is designed specifically for Big Data problems and
data science projects.

 The lifecycle has six phases, and project work can occur in several phases at
once. For most phases in the lifecycle, the movement can be either forward or
backward.

 Based on the new information received, data analyst can decide whether to
continue with their current research or scrap it and redo the entire analysis.
Throughout the process, they are guided by the Data Analytics Life Cycle.
The Data Analytics lifecycle primarily consists of 6 phases
Key Roles for a Successful Analytics Project
 Despite strong focus on the emerging role of the data scientist, there are
actually seven key roles that need to be fulfilled for a well functioning data
science team to execute analytic projects successfully.
 Although seven roles are listed, fewer or more people can accomplish the work
depending on the scope of the project, the organizational structure, and the
skills of the participants.
 The seven roles are as follows
1. Business User :
 The business user is the one who understands the main area of the project
and is also basically benefited from the results.
 This user gives advice and consult the team working on the project about
the value of the results obtained and how the operations on the outputs
are done.
 The business manager, line manager, or deep subject matter expert in the
project mains fulfills this role.

2. Project Sponsor
 The Project Sponsor is the one who is responsible to initiate the project.
Project Sponsor provides the actual requirements for the project and
presents the basic business issue.
 He generally provides the funds and measures the degree of value from
the final output of the team working on the project.
 This person introduce the prime concern and brooms the desired output.

3. Project Manager
 This person ensures that key milestone and purpose of the project is met
on time and of the expected quality.
4. Business Intelligence Analyst

 Business Intelligence Analyst provides business domain expertise based on


a detailed and deep understanding of the data, key performance indicators
(KPIs), key matrix, and business intelligence from a reporting point of view.
 This person generally creates fascia and reports and knows about the data
feeds and sources.

5. Database Administrator (DBA) :

 DBA facilitates and arrange the database environment to support the


analytics need of the team working on a project.
 His responsibilities may include providing permission to key databases or
tables and making sure that the appropriate security stages are in their
correct places related to the data repositories or not.

6. Data Engineer :
 Data engineer grasps deep technical skills to assist with tuning SQL queries for
data management and data extraction and provides support for data intake
into the analytic sandbox.
 The data engineer works jointly with the data scientist to help build data in
correct ways for analysis.
7. Data Scientist :
 Data scientist facilitates with the subject matter expertise for analytical
techniques, data modeling, and applying correct analytical techniques for a
given business issues.
 He ensures overall analytical objectives are met.
 Data scientists outline and apply analytical methods and proceed towards
the data available for the concerned project
Data Analytics Life Cycle
Phase 1: Discovery

 The Discovery phase is the foundational step in the Data Analytics Lifecycle. It
involves understanding the business context, framing the problem, and laying
the groundwork for the analytics project.

 In this phase, the team learns the business domain, including relevant history
such as whether the organization or business unit has attempted similar projects
in the past from which they can learn.

 The team assesses the resources available to support the project in terms of
people, technology, time, and data.

 In this phase, the data science team must learn and investigate the problem,
develop context and understanding, and learn about the data sources needed
and available for the project.

 In addition, the team formulates initial hypotheses that can later be tested with
data
1. Learning the Business Domain

 Understanding the domain area of the problem is essential. In many cases,


data scientists will have deep computational and quantitative knowledge
that can be broadly applied across many disciplines.
 At this early stage in the process, the team needs to determine how much
business or domain knowledge the data scientist needs to develop models
in Phases 3 and 4.
 The earlier the team can make this assessment, the decision can be taken to
find resources needed for the project team and ensures the team has the
right balance of domain knowledge and technical expertise.

2. Assessing Resources

 As part of the discovery phase, the team needs to assess the resources like
technology, tools, systems, data, and people available to support the
project.
 In addition, the team try to evaluate the level of analytical sophistication
within the organization and gaps that may exist related to tools, technology,
and skills available.
 In addition to the skills and computing resources, it is advisable to take
inventory of the types of data available to the team for the project. The
team will need to determine whether it must collect additional data,
purchase it from outside sources, or transform existing data.
 After taking inventory of the tools, technology, data, and people, it is
necessary to consider if the team has sufficient resources to succeed on
this project, or if additional resources are needed.
 Negotiating for resources at the start of the project, while scoping the
goals, objectives, and feasibility, is generally more useful than later in the
process and ensures sufficient time to execute it properly

3. Framing the Problem

 Framing is the process of stating the analytics problem to be solved. At this


point, it is a best practice to write down the problem statement and share it
with the key stakeholders.
 As part of this activity, it is important to identify the main objectives of the
project, identify what needs to be achieved in business terms, and identify
what needs to be done to meet the needs. Additionally, consider the
objectives and the success criteria for the project.
 It is best practice to share the statement of goals and success criteria with
the team and confirm alignment with the project sponsor’s expectations.
 Perhaps equally important is to establish failure criteria. The failure criteria
will guide the team in understanding when it is best to stop trying or settle
for the results that have been gleaned from the data.
 Many times people will continue to perform analyses past the point when
any meaningful insights can be drawn from the data.
 Establishing criteria for both success and failure helps the participants avoid
unproductive effort and remain aligned with the project sponsors
4. Identifying Key Stakeholders

 Another important step is to identify the key stakeholders and their


interests in the project.
 During these discussions, the team can identify the success criteria, key
risks, and stakeholders, which should include anyone who will benefit from
the project or will be significantly impacted by the project.
 When interviewing stakeholders, learn about the domain area and any
relevant history from similar analytics projects.
 Depending on the number of stakeholders and participants, the team may
consider outlining the type of activity and participation expected from each
stakeholder and participant.
5. Interviewing the Analytics Sponsor

 At the beginning, project sponsors may have a predetermined solution that may
not necessarily realize the desired outcome. In these cases, the team must use
its knowledge and expertise to identify the true underlying problem and
appropriate solution.
 When interviewing the main stakeholders, the team needs to take time to
thoroughly interview the project sponsor, who tends to be the one funding the
project or providing the high-level requirements.
 This person understands the problem and usually has an idea of a potential
working solution. It is critical to thoroughly understand the sponsor’s
perspective to guide the team in getting started on the project.
Following is a brief list of common questions that are helpful to ask during the
discovery phase when interviewing the project sponsor.

 What business problem is the team trying to solve?


 What is the desired outcome of the project?
 What data sources are available?
 What industry issues may impact the analysis?
 What timelines need to be considered?
 Who could provide insight into the project?
 Who has final decision-making authority on the project?
 How will the focus and scope of the problem change if the
following dimensions change:
 Time: Analyzing 1 year or 10 years’ worth of data?
 People: Assess impact of changes in resources on project
timeline.
 Risk: Conservative to aggressive
 Resources: None to unlimited (tools, technology, systems)
 Size and attributes of data: Including internal and external
data sources
6. Developing Initial Hypotheses

 Developing a set of IHs is a key facet of the discovery phase. This step
involves forming ideas that the team can test with data.
 Generally, it is best to come up with a few primary hypotheses to test and
then be creative about developing several more.
 These IHs form the basis of the analytical tests the team will use in later
phases and serve as the foundation for the findings in Phase 5.
 Another part of this process involves gathering and assessing hypotheses
from stakeholders and domain experts who may have their own perspective
on what the problem is, what the solution should be, and how to arrive at a
solution.
 These stakeholders would know the domain area well and can offer
suggestions on ideas to test as the team formulates hypotheses during this
phase.
 The team will likely collect many ideas that may illuminate the operating
assumptions of the stakeholders. These ideas will also give the team
opportunities to expand the project scope into adjacent spaces where it
makes sense or design experiments in a meaningful way to address the
most important interests of the stakeholders.
7. Identifying Potential Data Sources

 The team should perform five main activities during this step
Phase 2: Data Preparation

 The second phase of the Data Analytics Lifecycle involves data preparation,
which includes the steps to explore, preprocess, and condition data prior to
modeling and analysis.
 In this phase, the team needs to create an environment in which it can explore
the data. Usually, this is done by preparing an analytics sandbox.
 To get the data into the sandbox, the team needs to perform ETLT, by a
combination of extracting, transforming, and loading data into the sandbox.
 Once the data is in the sandbox, the team needs to learn about the data and
become familiar with it.
 Understanding the data in detail is critical to the success of the project. The
team also must decide how to condition and transform data to get it into a
format to facilitate subsequent analysis.
 The team may perform data visualizations to help team members understand
the data, including its trends, outliers, and relationships among data variables.
1. Preparing the Analytic Sandbox

 Sandbox can include everything from summary-level aggregated data,


structured data, raw data feeds, and unstructured text data from call logs or
web logs, depending on the kind of analysis the team plans to undertake.
 Sandbox size can vary greatly depending on the project. A good rule is to plan
for the sandbox to be at least 5–10 times the size of the original datasets, partly
because copies of the data may be created that serve as specific tables or data
stores for specific kinds of analysis in the project.
2. Performing ETLT (Extract, Transform, Load, Transform)

 In ETL, users perform extract, transform, load processes to extract data from a
data store, perform data transformations, and load the data back into the data
store.
 However, the analytic sandbox approach differs slightly; it advocates ELT i.e.
extract, load, and then transform.
 In this case, the data is extracted in its raw form and loaded into the data store,
where analysts can choose to transform the data into a new state or leave it in
its original, raw condition.
 The reason for this approach is that there is significant value in preserving the
raw data and including it in the sandbox before any transformations take place.

 The team may want clean data and aggregated data and may need to keep a
copy of the original data to compare against or look for hidden patterns that
may have existed in the data before the cleaning stage. This process can be
summarized as ETLT to reflect the fact that a team may choose to perform ETL
in one case and ELT in another.
 ETL is valuable when it comes to data quality, data security, and data
compliance. It can also save money on data warehousing costs. However, ETL is
slow when ingesting unstructured data, and it can lack flexibility.
 ELT is fast when ingesting large amounts of raw, unstructured data. It also brings
flexibility to your data integration and data analytics strategies. However, ELT
sacrifices data quality, security, and compliance in many cases.

ETLT is a “best of both worlds” approach to data integration that (1) speeds up data
ingestion while (2) ensuring data quality and securing sensitive data in accordance
with industry compliance standards.
3. Learning About the Data

Determines the data available to the


team early in the project

Highlights gaps – identifies data not


currently available

Identifies data outside the organization


that might be useful

The sample data inventory table shown below demonstrates one way to organize
the type of data inventory.
4. Data Conditioning

 Data conditioning refers to the process of cleaning data, normalizing datasets,


and performing transformations on the data.
 Data conditioning can involve many complex steps to join or merge datasets or
otherwise get datasets into a state that enables analysis in further phases.
 Part of this phase involves deciding which aspects of particular datasets will be
useful to analyze in later steps.

Key Tasks:
 Handle missing values using methods like mean/mode imputation or
advanced algorithms.
 Detect and treat outliers to avoid skewing results.
 Normalize or standardize data for consistency across variables.
 Encode categorical variables using techniques like one-hot or label encoding
5. Survey and Visualize
 After the team has collected and obtained
at least some of the datasets needed for
the subsequent analysis, a useful step is to
leverage data visualization tools to gain an
overview of the data.
 Seeing high-level patterns in the data
enables one to understand characteristics
about the data very quickly.
 This step use visual tools to detect outliers,
gaps, or trends that might impact analysis.

Visualization Techniques:

1. Univariate Analysis: Histograms,


boxplots to explore individual
variables.

2. Multivariate Analysis: Scatter plots,


heatmaps to study relationships.
Common Tools for the Data Preparation Phase
Several tools are commonly used for this phase:

1. Hadoop can perform massively parallel ingest and custom analysis for web
traffic parsing, GPS location analytics, genomic analysis, and combining of
massive unstructured data feeds from multiple sources.
2. Alpine Miner provides a graphical user interface (GUI) for creating analytic
workflows, including data manipulations and a series of analytic events
such as staged data-mining techniques on Postgres SQL and other Big Data
sources.
3. OpenRefine (formerly called Google Refine) is “a free, open source,
powerful tool for working with messy data.” It is a popular GUI-based tool
for performing data transformations, and it’s one of the most robust free
tools currently available.
4. Data Wrangler is an interactive tool for data cleaning and transformation.
Wrangler was developed at Stanford University and can be used to perform
many transformations on a given dataset. In addition, data transformation
outputs can be put into Java or Python. The advantage of this feature is that
a subset of the data can be manipulated in Wrangler via its GUI, and then
the same operations can be written out as Java or Python code to be
executed against the full, larger dataset offline in a local analytic sandbox
Why Data Preparation is Critical

Improves Model Accuracy: High-quality data ensures better predictive power and
insights.
Saves Time: Identifying and fixing issues early avoids delays in later phases.
Mitigates Risks: Reduces the chance of biased or misleading results caused by
poor-quality data.
Ensures Business Value: Reliable data leads to actionable insights that drive
business decisions.
The Data Preparation phase lays the groundwork for accurate and meaningful
analytics. Without a robust preparation process, even the most advanced models
and techniques can yield unreliable results.
Phase 3: Model Planning
 In this phase, the data science team identifies candidate models to apply to the
data for clustering, classifying, or finding relationships in the data depending on
the goal of the project
 Table shown below summarizes the results of an exercise of this type, involving
several domain areas and the types of models previously used in a classification
type of problem after conducting research on churn models in multiple industry
verticals.
 Performing this sort of exercise gives the team ideas of how others have solved
similar problems and presents the team with a list of candidate models to try as
part of the model planning phase.
1. Data Exploration and Variable Selection

The objective of the data exploration is to understand the relationships among the
variables to inform selection of the variables and methods and to understand the
problem domain.
2. Model Selection
Common Tools for the Model Planning Phase

By the end of the Model Planning phase, the team will have a clear strategy for building and
validating models, ensuring a focused and effective approach to solving the problem.
Phase 4: Model Building
 In this phase , the data science team needs to develop datasets for training,
testing, and production purposes.

 These datasets enable the data scientist to develop the analytical model and
train it, while holding aside some of the data for testing the model.

 During this process, it is critical to ensure that the training and test datasets are
sufficiently robust for the model and analytical techniques.
Creating robust models that are suitable to a specific situation requires thoughtful
consideration to ensure the models being developed ultimately meet the objectives
outlined in Phase 1.

Questions to consider include these:


Common Tools for the Model Building Phase
Phase 5: Communicate Results

 After executing the model, the team needs to compare the outcomes of the
modeling to the criteria established for success and failure.

 In this phase, the team considers how best to articulate the findings and
outcomes to the various team members and stakeholders, taking into account
caveats, assumptions, and any limitations of the results.

 Because the presentation is often circulated within an organization, it is critical


to articulate the results properly and position the findings in a way that is
appropriate for the audience
 As a result of this phase, the team will have documented the key findings and
major insights derived from the analysis.
Phase 6: Operationalize
 In the final phase, the team communicates the benefits of the project more
broadly and sets up a pilot project to deploy the work in a controlled way before
broadening the work to a full enterprise or ecosystem of users.
 In this phase, the team approach deploying the new analytical methods or
models in a production environment.
 Part of the operationalizing phase includes creating a mechanism for performing
ongoing monitoring of model accuracy and, if accuracy degrades, finding ways to
retrain the model.

Why the Operationalize Phase is Critical?

Ensures Business Impact: Converts analytics into actionable tools that influence
real-world decisions.
Maintains Model Effectiveness: Ongoing monitoring ensures the solution remains
relevant and reliable.
Promotes Scalability: Proper deployment allows the solution to be scaled across
the organization.
Closes the Analytics Loop: By integrating results into workflows, this phase
completes the journey from data to decision-making.
The Operationalize phase is where analytics becomes actionable, ensuring the
project delivers sustained value and supports continuous innovation within the
organization.
 Although these seven roles represent many interests within a project, these
interests usually overlap, and most of them can be met with four main
deliverables.

 When presenting to other audiences with more quantitative backgrounds, focus


more time on the methodology and findings.
 The audience will be more interested in the techniques, especially if the team
developed a new way of processing or analyzing data that can be reused in the
future or applied to similar problems.
 In addition, use imagery or data visualization when possible.
Why is data analytics lifecycle essential?
The Data Analytics Lifecycle (DAL) is essential because it provides a structured
framework for executing analytics projects effectively and delivering actionable
insights. Here’s a breakdown of why it is critical:

1. Ensures Alignment with Business Goals


The lifecycle begins with understanding the business problem and setting
objectives, ensuring that analytics efforts directly address organizational needs.
It avoids wasting resources on projects that do not contribute to business
outcomes.
2. Promotes a Systematic Approach
The lifecycle organizes the analytics process into distinct, manageable stages
(Discovery, Data Preparation, Model Building, etc.).
This ensures that all necessary steps are followed, reducing the likelihood of
errors or missed opportunities.
3. Encourages Collaboration
Analytics projects require input from multiple stakeholders, including business
leaders, data scientists, engineers, and domain experts.
DAL fosters cross-functional teamwork by defining roles and responsibilities
clearly.
4. Handles Complex Data Challenges
Modern data is often large, unstructured, and scattered across various systems.
The lifecycle ensures systematic data collection, cleaning, and preparation, creating
a solid foundation for analysis.
5. Drives Better Decision-Making
By following a structured approach, insights generated are reliable, reproducible,
and directly tied to actionable outcomes.
Results are communicated effectively through dashboards, reports, and
visualizations tailored to decision-makers.

6. Enhances Efficiency and Scalability


A well-defined lifecycle streamlines workflows, minimizes redundancy, and
ensures efficient use of resources.
The approach is scalable, making it applicable to both small and large analytics
projects.
7. Mitigates Risks
Risks, such as misaligned objectives, poor-quality data, or unscalable solutions,
are identified and addressed at appropriate stages.
Iterative validation ensures that errors are caught early.
8. Encourages Continuous Improvement
Lessons learned from each project can be documented and fed back into future
projects.
It promotes a culture of iteration, ensuring analytics solutions improve over time.

9. Bridges the Gap Between Technical and Non-Technical Stakeholders


The lifecycle facilitates clear communication between data experts and business
stakeholders.
It ensures that technical solutions are aligned with business contexts and that
insights are actionable for decision-makers.

10. Adapts to Diverse Analytics Needs


The lifecycle is flexible enough to handle a wide range of analytics problems, from
descriptive analytics (What happened?) to predictive analytics (What will happen?)
and prescriptive analytics (What should we do?).

In summary, the Data Analytics Lifecycle is essential for ensuring that analytics
projects are purposeful, efficient, and impactful. It transforms raw data into insights
that drive business success while fostering collaboration and reducing risks.
Case Study :

 Airbnb is an American company operating an online marketplace for short-and-


long-term homestays and experiences in various countries and regions.
 It acts as a broker and charges a commission from each booking.
How AirBnB used big data to propel its growth?

1. Enhanced Search Features


2. Guiding Hosts to the Perfect Price
3. Driving Company Growth

 airbnb has taught some valuable lessons when it comes to considering big data
as the voice of customers.
 The takeaway from the success of airbnb for any company is to –
1. Consider data as the soul of your business.
2. Hire data scientists who can decipher what customers need just by looking
at the data.
3. Make data-driven product decisions that will drive success.
Phase 1: Discovery

Airbnb holds following data –

1. Accommodates: the number of guests the rental can accommodate


2. Bedrooms: number of bedrooms included in the rental
3. Bathrooms: number of bathrooms included in the rental
4. Beds: number of beds included in the rental
5. Price: nightly price for the rental
6. Minimum nights: minimum number of nights a guest can stay for the rental
7. Maximum nights: maximum number of nights a guest can stay for the rental
8. Number of reviews: number of reviews that previous guests have left
9. Images : images clicked by guests
Phase 2: Data Preparation
AirBnB is a big user of the Hadoop technology as all the unstructured information about the
rooms, room owners, locations of the room is sorted and analyzed using the open source
framework - Hadoop. Apache Hive datawarehouse is used on top of Hadoop with 1.5
petabytes of data

Behavioral Aspect:
determined by how the user interacts with the Airbnb website.
Dimensional Factor:
device used, language and location preferred
Sentiment:
lodging reviews, survey results, and ratings are vital deciding factors
Imputed:
sorts the location preference of the traveler, for example, city vs. local towns

Phase 3: Model Planning

 Airbnb wanted to track sentiment of users by tracking reviews


 Airbnb wanted to track chance of guest to book on basis of price
 Location factors affects on price and booking
Phase 4: Model Building

There are several data science techniques being used by AirBnB to learn
more about its users
1. A/B Testing
2. Image Recognition and Analysis
3. Natural Language Processing
4. Predictive Modelling
5. Regression Analysis
6. Collaborative Filtering
Phase 5: Communicate Results
 The insight gained from its data enables Airbnb to ensure the company is concentrating
on signing up hosts in the most popular destinations around the world at peak times –
and ensure accommodation is marketed at an appropriate price.
 For example, Airbnb helps its hosts determine the right price for their property or room
using certain algorithms.
 The appropriate price of accommodation is determined by a number of data points, such
as location, transport links, type of accommodation, time of year, etc.
 The data science team at AirBnB redesigned the algorithm and removed the
“Neighbourhood” link for visitors from Asian countries. They rather listed Top Travelling
Destinations in Singapore, China, Japan and Korea. The result was astonishing - Asian
visitor’s conversion rates increased by 10%.
Phase 6: Operationalize

 At the heart of the Airbnb site is its search. Carefully tuned, its search has been
designed to inspire, amaze and delight customers at every step.
 But it wasn’t always such a walk in the park. Originally, Airbnb didn’t know
what kind of data to give customers, so it settled on a model which returned
the highest quality listings within a certain radius based on the user’s search.
 Because Airbnb is using data to constantly improve itself, is also forging into
new frontiers where laws and regulations have yet to catch up.
 Case in point, the launch of its “price tips” feature a little over a year ago. With
Price Tips, a host can look at the calendar to see which dates are likely to be
booked at their current price, as well as which aren’t, and get suggestions
Thank You

You might also like