Module 1 (3)
Module 1 (3)
By
Prof. A.V.Phanse
Data Analytics Life Cycle
The life cycle of Data Analytics is a structured approach to generate, collect,
process, use, and analyze the data for effective decision making.
The Data Analytics Lifecycle is designed specifically for Big Data problems and
data science projects.
The lifecycle has six phases, and project work can occur in several phases at
once. For most phases in the lifecycle, the movement can be either forward or
backward.
Based on the new information received, data analyst can decide whether to
continue with their current research or scrap it and redo the entire analysis.
Throughout the process, they are guided by the Data Analytics Life Cycle.
The Data Analytics lifecycle primarily consists of 6 phases
Key Roles for a Successful Analytics Project
Despite strong focus on the emerging role of the data scientist, there are
actually seven key roles that need to be fulfilled for a well functioning data
science team to execute analytic projects successfully.
Although seven roles are listed, fewer or more people can accomplish the work
depending on the scope of the project, the organizational structure, and the
skills of the participants.
The seven roles are as follows
1. Business User :
The business user is the one who understands the main area of the project
and is also basically benefited from the results.
This user gives advice and consult the team working on the project about
the value of the results obtained and how the operations on the outputs
are done.
The business manager, line manager, or deep subject matter expert in the
project mains fulfills this role.
2. Project Sponsor
The Project Sponsor is the one who is responsible to initiate the project.
Project Sponsor provides the actual requirements for the project and
presents the basic business issue.
He generally provides the funds and measures the degree of value from
the final output of the team working on the project.
This person introduce the prime concern and brooms the desired output.
3. Project Manager
This person ensures that key milestone and purpose of the project is met
on time and of the expected quality.
4. Business Intelligence Analyst
6. Data Engineer :
Data engineer grasps deep technical skills to assist with tuning SQL queries for
data management and data extraction and provides support for data intake
into the analytic sandbox.
The data engineer works jointly with the data scientist to help build data in
correct ways for analysis.
7. Data Scientist :
Data scientist facilitates with the subject matter expertise for analytical
techniques, data modeling, and applying correct analytical techniques for a
given business issues.
He ensures overall analytical objectives are met.
Data scientists outline and apply analytical methods and proceed towards
the data available for the concerned project
Data Analytics Life Cycle
Phase 1: Discovery
The Discovery phase is the foundational step in the Data Analytics Lifecycle. It
involves understanding the business context, framing the problem, and laying
the groundwork for the analytics project.
In this phase, the team learns the business domain, including relevant history
such as whether the organization or business unit has attempted similar projects
in the past from which they can learn.
The team assesses the resources available to support the project in terms of
people, technology, time, and data.
In this phase, the data science team must learn and investigate the problem,
develop context and understanding, and learn about the data sources needed
and available for the project.
In addition, the team formulates initial hypotheses that can later be tested with
data
1. Learning the Business Domain
2. Assessing Resources
As part of the discovery phase, the team needs to assess the resources like
technology, tools, systems, data, and people available to support the
project.
In addition, the team try to evaluate the level of analytical sophistication
within the organization and gaps that may exist related to tools, technology,
and skills available.
In addition to the skills and computing resources, it is advisable to take
inventory of the types of data available to the team for the project. The
team will need to determine whether it must collect additional data,
purchase it from outside sources, or transform existing data.
After taking inventory of the tools, technology, data, and people, it is
necessary to consider if the team has sufficient resources to succeed on
this project, or if additional resources are needed.
Negotiating for resources at the start of the project, while scoping the
goals, objectives, and feasibility, is generally more useful than later in the
process and ensures sufficient time to execute it properly
At the beginning, project sponsors may have a predetermined solution that may
not necessarily realize the desired outcome. In these cases, the team must use
its knowledge and expertise to identify the true underlying problem and
appropriate solution.
When interviewing the main stakeholders, the team needs to take time to
thoroughly interview the project sponsor, who tends to be the one funding the
project or providing the high-level requirements.
This person understands the problem and usually has an idea of a potential
working solution. It is critical to thoroughly understand the sponsor’s
perspective to guide the team in getting started on the project.
Following is a brief list of common questions that are helpful to ask during the
discovery phase when interviewing the project sponsor.
Developing a set of IHs is a key facet of the discovery phase. This step
involves forming ideas that the team can test with data.
Generally, it is best to come up with a few primary hypotheses to test and
then be creative about developing several more.
These IHs form the basis of the analytical tests the team will use in later
phases and serve as the foundation for the findings in Phase 5.
Another part of this process involves gathering and assessing hypotheses
from stakeholders and domain experts who may have their own perspective
on what the problem is, what the solution should be, and how to arrive at a
solution.
These stakeholders would know the domain area well and can offer
suggestions on ideas to test as the team formulates hypotheses during this
phase.
The team will likely collect many ideas that may illuminate the operating
assumptions of the stakeholders. These ideas will also give the team
opportunities to expand the project scope into adjacent spaces where it
makes sense or design experiments in a meaningful way to address the
most important interests of the stakeholders.
7. Identifying Potential Data Sources
The team should perform five main activities during this step
Phase 2: Data Preparation
The second phase of the Data Analytics Lifecycle involves data preparation,
which includes the steps to explore, preprocess, and condition data prior to
modeling and analysis.
In this phase, the team needs to create an environment in which it can explore
the data. Usually, this is done by preparing an analytics sandbox.
To get the data into the sandbox, the team needs to perform ETLT, by a
combination of extracting, transforming, and loading data into the sandbox.
Once the data is in the sandbox, the team needs to learn about the data and
become familiar with it.
Understanding the data in detail is critical to the success of the project. The
team also must decide how to condition and transform data to get it into a
format to facilitate subsequent analysis.
The team may perform data visualizations to help team members understand
the data, including its trends, outliers, and relationships among data variables.
1. Preparing the Analytic Sandbox
In ETL, users perform extract, transform, load processes to extract data from a
data store, perform data transformations, and load the data back into the data
store.
However, the analytic sandbox approach differs slightly; it advocates ELT i.e.
extract, load, and then transform.
In this case, the data is extracted in its raw form and loaded into the data store,
where analysts can choose to transform the data into a new state or leave it in
its original, raw condition.
The reason for this approach is that there is significant value in preserving the
raw data and including it in the sandbox before any transformations take place.
The team may want clean data and aggregated data and may need to keep a
copy of the original data to compare against or look for hidden patterns that
may have existed in the data before the cleaning stage. This process can be
summarized as ETLT to reflect the fact that a team may choose to perform ETL
in one case and ELT in another.
ETL is valuable when it comes to data quality, data security, and data
compliance. It can also save money on data warehousing costs. However, ETL is
slow when ingesting unstructured data, and it can lack flexibility.
ELT is fast when ingesting large amounts of raw, unstructured data. It also brings
flexibility to your data integration and data analytics strategies. However, ELT
sacrifices data quality, security, and compliance in many cases.
ETLT is a “best of both worlds” approach to data integration that (1) speeds up data
ingestion while (2) ensuring data quality and securing sensitive data in accordance
with industry compliance standards.
3. Learning About the Data
The sample data inventory table shown below demonstrates one way to organize
the type of data inventory.
4. Data Conditioning
Key Tasks:
Handle missing values using methods like mean/mode imputation or
advanced algorithms.
Detect and treat outliers to avoid skewing results.
Normalize or standardize data for consistency across variables.
Encode categorical variables using techniques like one-hot or label encoding
5. Survey and Visualize
After the team has collected and obtained
at least some of the datasets needed for
the subsequent analysis, a useful step is to
leverage data visualization tools to gain an
overview of the data.
Seeing high-level patterns in the data
enables one to understand characteristics
about the data very quickly.
This step use visual tools to detect outliers,
gaps, or trends that might impact analysis.
Visualization Techniques:
1. Hadoop can perform massively parallel ingest and custom analysis for web
traffic parsing, GPS location analytics, genomic analysis, and combining of
massive unstructured data feeds from multiple sources.
2. Alpine Miner provides a graphical user interface (GUI) for creating analytic
workflows, including data manipulations and a series of analytic events
such as staged data-mining techniques on Postgres SQL and other Big Data
sources.
3. OpenRefine (formerly called Google Refine) is “a free, open source,
powerful tool for working with messy data.” It is a popular GUI-based tool
for performing data transformations, and it’s one of the most robust free
tools currently available.
4. Data Wrangler is an interactive tool for data cleaning and transformation.
Wrangler was developed at Stanford University and can be used to perform
many transformations on a given dataset. In addition, data transformation
outputs can be put into Java or Python. The advantage of this feature is that
a subset of the data can be manipulated in Wrangler via its GUI, and then
the same operations can be written out as Java or Python code to be
executed against the full, larger dataset offline in a local analytic sandbox
Why Data Preparation is Critical
Improves Model Accuracy: High-quality data ensures better predictive power and
insights.
Saves Time: Identifying and fixing issues early avoids delays in later phases.
Mitigates Risks: Reduces the chance of biased or misleading results caused by
poor-quality data.
Ensures Business Value: Reliable data leads to actionable insights that drive
business decisions.
The Data Preparation phase lays the groundwork for accurate and meaningful
analytics. Without a robust preparation process, even the most advanced models
and techniques can yield unreliable results.
Phase 3: Model Planning
In this phase, the data science team identifies candidate models to apply to the
data for clustering, classifying, or finding relationships in the data depending on
the goal of the project
Table shown below summarizes the results of an exercise of this type, involving
several domain areas and the types of models previously used in a classification
type of problem after conducting research on churn models in multiple industry
verticals.
Performing this sort of exercise gives the team ideas of how others have solved
similar problems and presents the team with a list of candidate models to try as
part of the model planning phase.
1. Data Exploration and Variable Selection
The objective of the data exploration is to understand the relationships among the
variables to inform selection of the variables and methods and to understand the
problem domain.
2. Model Selection
Common Tools for the Model Planning Phase
By the end of the Model Planning phase, the team will have a clear strategy for building and
validating models, ensuring a focused and effective approach to solving the problem.
Phase 4: Model Building
In this phase , the data science team needs to develop datasets for training,
testing, and production purposes.
These datasets enable the data scientist to develop the analytical model and
train it, while holding aside some of the data for testing the model.
During this process, it is critical to ensure that the training and test datasets are
sufficiently robust for the model and analytical techniques.
Creating robust models that are suitable to a specific situation requires thoughtful
consideration to ensure the models being developed ultimately meet the objectives
outlined in Phase 1.
After executing the model, the team needs to compare the outcomes of the
modeling to the criteria established for success and failure.
In this phase, the team considers how best to articulate the findings and
outcomes to the various team members and stakeholders, taking into account
caveats, assumptions, and any limitations of the results.
Ensures Business Impact: Converts analytics into actionable tools that influence
real-world decisions.
Maintains Model Effectiveness: Ongoing monitoring ensures the solution remains
relevant and reliable.
Promotes Scalability: Proper deployment allows the solution to be scaled across
the organization.
Closes the Analytics Loop: By integrating results into workflows, this phase
completes the journey from data to decision-making.
The Operationalize phase is where analytics becomes actionable, ensuring the
project delivers sustained value and supports continuous innovation within the
organization.
Although these seven roles represent many interests within a project, these
interests usually overlap, and most of them can be met with four main
deliverables.
In summary, the Data Analytics Lifecycle is essential for ensuring that analytics
projects are purposeful, efficient, and impactful. It transforms raw data into insights
that drive business success while fostering collaboration and reducing risks.
Case Study :
airbnb has taught some valuable lessons when it comes to considering big data
as the voice of customers.
The takeaway from the success of airbnb for any company is to –
1. Consider data as the soul of your business.
2. Hire data scientists who can decipher what customers need just by looking
at the data.
3. Make data-driven product decisions that will drive success.
Phase 1: Discovery
Behavioral Aspect:
determined by how the user interacts with the Airbnb website.
Dimensional Factor:
device used, language and location preferred
Sentiment:
lodging reviews, survey results, and ratings are vital deciding factors
Imputed:
sorts the location preference of the traveler, for example, city vs. local towns
There are several data science techniques being used by AirBnB to learn
more about its users
1. A/B Testing
2. Image Recognition and Analysis
3. Natural Language Processing
4. Predictive Modelling
5. Regression Analysis
6. Collaborative Filtering
Phase 5: Communicate Results
The insight gained from its data enables Airbnb to ensure the company is concentrating
on signing up hosts in the most popular destinations around the world at peak times –
and ensure accommodation is marketed at an appropriate price.
For example, Airbnb helps its hosts determine the right price for their property or room
using certain algorithms.
The appropriate price of accommodation is determined by a number of data points, such
as location, transport links, type of accommodation, time of year, etc.
The data science team at AirBnB redesigned the algorithm and removed the
“Neighbourhood” link for visitors from Asian countries. They rather listed Top Travelling
Destinations in Singapore, China, Japan and Korea. The result was astonishing - Asian
visitor’s conversion rates increased by 10%.
Phase 6: Operationalize
At the heart of the Airbnb site is its search. Carefully tuned, its search has been
designed to inspire, amaze and delight customers at every step.
But it wasn’t always such a walk in the park. Originally, Airbnb didn’t know
what kind of data to give customers, so it settled on a model which returned
the highest quality listings within a certain radius based on the user’s search.
Because Airbnb is using data to constantly improve itself, is also forging into
new frontiers where laws and regulations have yet to catch up.
Case in point, the launch of its “price tips” feature a little over a year ago. With
Price Tips, a host can look at the calendar to see which dates are likely to be
booked at their current price, as well as which aren’t, and get suggestions
Thank You