Unit2-Data Science
Unit2-Data Science
DR.A.SHANTHINI
ASSOCIATE PROFESSOR
DEPARTMENT OF DSBS
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
Data Science Team
► There are certain key roles that are required for the complete and fulfilled functioning of the
data
► Each key plays a crucial role in developing a successful analytics project.
► There is no hard and fast rule for considering the listed roles, they can be used fewer or more
depending on the scope of the project, skills of the participants, and organizational structure.
► Data science team to execute projects on analytics successfully
Key roles of successful analytics project
► Business users
► Project Sponsor
► Project manager
► Business intelligence Analyst
► Data base administrator
► Data Engineer
► Data Scientist
Key Roles for a Data analytics project :
► Business User :
∙ The business user is the one who understands the main area of the project and is also
basically benefited from the results.
∙ This user gives advice and consult the team working on the project about the value of
the results obtained and how the operations on the outputs are done.
∙ The business manager, line manager, or deep subject matter expert in the project
mains fulfills this role.
Cont..
► Project Sponsor :
∙ The Project Sponsor is the one who is responsible to initiate the project. Project
Sponsor provides the actual requirements for the project and presents the basic
business issue.
∙ He generally provides the funds and measures the degree of value from the final output
of the team working on the project.
∙ This person introduce the prime concern and brooms the desired output.
► Project Manager :
This person ensures that key milestone and purpose of the project is met on time and of the
expected quality
Cont..
∙ Business Intelligence Analyst provides business domain perfection based on a detailed and deep
understanding of the data, key performance indicators (KPIs), key matrix, and business intelligence
from a reporting point of view.
∙ This person generally creates fascia and reports and knows about the data feeds and sources.
∙ DBA facilitates and arrange the database environment to support the analytics need of the team
working on a project.
∙ His responsibilities may include providing permission to key databases or tables and making sure
that the appropriate security stages are in their correct places related to the data repositories or not.
Cont..
► Data Engineer :
∙ Data engineer grasps deep technical skills to assist with tuning SQL queries for data
management and data extraction and provides support for data intake into the analytic
sandbox.
∙ The data engineer works jointly with the data scientist to help build data in correct ways
for analysis.
► Data Scientist :
∙ Data scientist facilitates with the subject matter expertise for analytical techniques, data
modelling, and applying correct analytical techniques for a given business issues.
∙ Data scientists outline and apply analytical methods and proceed towards the data
available for the concerned project.
►
Over view of Data analytics life cycle
Phase 1: Discovery
Phase 2: Data Preparation
Phase 3: Model Planning
Phase 4: Model Building
Phase 5: Communicate Results
Phase 6: Operationalize
Phase 1.Discovery Phase
► Includes the steps to explore, preprocess, and condition data prior to modeling
and analysis
► Explore the data- done by preparing an analytics sandbox
► To get the data into the sandbox, the team needs to perform ETLT, by a
combination of extracting data from data sources, transforming, and loading data
into the sandbox.
► must decide how to condition and transform data to get it into a format to
facilitate subsequent analysis
► The team may perform data visualizations to help team members understand
the data, including its trends, outliers, and relationships among data variables.
Learning About Data
► Preparing the Analytic Sandbox- Expect the sandbox to be large. It may contain raw data,
aggregated data, and other data types that are less commonly used in organizations. Sandbox
size can vary greatly depending on the project. A good rule is to plan for the sandbox to be at
least 5–10 times the size of the original datasets.
► Performing ETLT-In ETL, users perform extract, transform, load processes to extract data from
a datastore, perform data transformations, and load the data back into the datastore. However,
the analytic sandbox approach differs slightly; it advocates extract, load, and then transform.
► Learning About the Data- Data-to understand what constitutes a reasonable value and
expected output versus what is a surprising finding. Identify additional data sources
► Data Conditioning- the process of cleaning data, normalizing datasets, and performing
transformations on the data. often viewed as a preprocessing step.
► Survey and Visualize- a useful step is to leverage data visualization tools to gain an overview
of the data.
► Common Tools for the Data Preparation Phase- Hadoop, Alpine Miner, OpenRefine, Data
Wrangler
Phase 3: Model Planning
► the data science team identifies candidate models to apply to the data for clustering,
classifying, or finding relationships in the data depending on the goal of the project
► Data Exploration and Variable Selection- activities focus mainly on data hygiene and
on assessing the quality of the data itself. A common way to conduct this step involves
using tools to perform data visualizations. Approaching the data exploration in this way
aids the team in previewing the data and assessing relationships between variables at a
high level
► Model Selection- to choose an analytical technique based on the end goal of the
project. In the case of machine learning and data mining, these rules and conditions are
grouped into several general sets of techniques, such as classification, association rules,
and clustering.
► Common Tools for the Model Planning Phase – R, SQL Analysis services,
SAS/ACCESS
Phase 4: Model Building
► the data science team needs to develop datasets for training, testing, and production purposes.
► an analytical model is developed and fit on the training data and evaluated (scored) against the
test data
► logic required to develop models can be highly complex, the actual duration of this phase can be
short compared to the time spent preparing the data and defining the approaches
► During this phase, users run models from analytical software packages, such as R or SAS, on
file extracts and small datasets for testing purposes. On a small scale, assess the validity of the
model and its results
► Common Tools for the Model Building Phase
► Commercial Tools -SAS Enterprise Miner, SPSS Modeler, Matlab, Alpine Miner,
STATISTICA and Mathematica
► Free or Open Source tools- R and PL/R, Octave, WEKA, Python, SQL
Phase 5: Communicate Results
► The team considers how best to articulate the findings and outcomes to the various team
members and stakeholders, taking into account caveats, assumptions, and any limitations of the
results.
► The key is to remember that the team must be rigorous enough with the data to determine
whether it will prove or disprove the hypotheses outlined in Phase 1 (discovery)
► If the results are valid, identify the aspects of the results that stand out and may provide salient
findings when it comes time to communicate them.
► If the results are not valid, think about adjustments that can be made to refine and iterate on the
model to make it valid.
► Depending on what emerged as a result of the model, the team may need to spend time
quantifying the business impact of the results to help prepare for the presentation and
demonstrate the value of the findings.
► The team will have documented the key findings and major insights derived from the analysis.
The deliverable of this phase will be the most visible portion of the process to the outside
stakeholders and sponsors, so take care to clearly articulate the results, methodology, and
business value of the findings
Phase 6: Operationalize
► sets up a pilot project to deploy the work in a controlled way before broadening
the work to a full enterprise or ecosystem of users.
► represents the first time that most analytics teams approach deploying the new
analytical methods or models in a production environment.
► This approach enables the team to learn about the performance and related
constraints of the model in a production environment on a small scale and make
adjustments before a full deployment
► Part of the operationalizing phase includes creating a mechanism for performing
ongoing monitoring of model accuracy.
key deliverables of an analytics
project
Four main deliverables.