Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
17 views

1 Introduction To Data Science

data science thapar

Uploaded by

onlyforuse0709
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

1 Introduction To Data Science

data science thapar

Uploaded by

onlyforuse0709
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Introduction to Data Science

Introduction

• Data Science is an interdisciplinary field that focuses on extracting knowledge


from data sets which are typically huge in amount.
• The field encompasses analysis, preparing data for analysis, and presenting
findings to inform high-level decisions in an organization.
• As such, it incorporates skills from computer science, mathematics, statics,
inform.
• Some of the techniques utilized in Data Science encompass machine learning,
visualization, pattern recognition, probability modeling data, data
engineering, signal processing, etc.
Prerequisites for Data Science
The following are the three essential traits of Data Scientist:

Curiosity

Curiosity: Only when you ask questions, you will have a better understanding of the
business problem.
Common Sense: To identify new ways to solve a business problems and to detect
priority problems.
Communication Skills: A Data Scientist needs to communicate their findings to
business teams to act upon the insights
Skills required for Data Scientist

Domain Knowledge:
• To get useful information out of raw data that
benefits a company’s business.
• Know about the business model of the
company .
• Ask the right questions to produce valuable
results.

Math Skills:
• Linear Algebra, Calculus, and other concepts
of mathematics help us to understand the
complex behavior of Machine Learning
algorithms.
• Probability and statistics are mainly used in
predictive modeling and clustering.
Skills required for Data Scientist
Computer Science:
• To implement Data Science techniques using programming
languages like Python, R, SQL, Scala, Julia, JavaScript, etc.
• To deal with varied databases and loud networks to process the
data.
• Knowledge about algorithms, relational and non-relational
databases, Distributed Computing, and Machine Learning.

Communication Skills:
• To have good communication when working in team.
• To draw conclusions from the data analysis and make
presentation.
Data Science Three skill tracks:

Engineering , Modeling , Analysis

Engineering
• Involves in building the data pipeline infrastructure.
• It involves the software and the hardware used to store the data and perform
data ETL (i.e., extract, transform, and load).
• Store and compute data on the cloud.
• The fundamental building block for automation is maintaining the data
pipeline through modular, well-commented code and version control.
• Key task involved are:-
Engineering
• Key task involved are:-
1. Data Environment: Designing and setting up the entire environment to
support data, science workflow is the prerequisite for data science
projects. It may include setting up storage in the cloud, Kafka
platform, Hadoop and Spark cluster, etc
2. Data Management: Automated data collection, that includes parsing
the logs (depending on the stage of the company and the type of
industry you are in), web scraping, API queries, and interrogating data
streams. Determine and construct data schema to support analytical
and modeling needs. Use tools, processes, guidelines to ensure data is
correct, standardized, and documented.
3. Production: Involves the whole pipeline from data access,
preprocessing, modeling to final deployment. It is necessary to make
the system work smoothly with all existing software stacks.
Data Science Three skill tracks:
Analysis
• Analysis turns raw information into insights in a fast and often exploratory
way.
• In general, an analyst needs to have decent domain knowledge, do
exploratory analysis efficiently, and present the results using storytelling.
• Key point includes are:-

1. Domain Knowledge: understanding of the organization or industry where


you apply data science. You can’t make sense of data without context.
2. Exploratory Analysis: team look at as much data as possible so that the
decision-makers can get a sense of what’s worth further pursuing. It often
involves different ways to slice and aggregate data.
3. Storytelling: It is the art of telling people what the numbers signify. It
usually requires data summarization, aggregation, and visualization. It is
crucial to answering the following questions before you begin down the
path of creating a data story.
• Who is your audience?
• What do you want your audience to know or do?
• How can you use data to help make your point?
Data Science Three skill tracks:
Modeling
• A process that dives deeper into the data to discover the pattern we don’t
readily see.
• A model only occupy a small part of a typical data scientist’s day-to-day
time.
• Some of the models are :-

1. Supervised Learning: In supervised learning, each sample corresponds to a


response measurement. There are two flavors of supervised learning:
regression and classification.
 In regression, the response is a real number, such as the total net sales
in 2017 for a company or the yield of corn next year for a state.
 The goal for regression is to approximate the response measurement as
much as possible.
 In classification, the response is a class label, such as a dichotomous
response of yes/no.
 The response can also have more than two categories, such as four
segments of customers
1. Un-supervised Learning: In unsupervised learning, there is no response
variable. Clustering approach is used for data analysis.
2. Customized model development: A data scientist may need to develop new
models to accommodate the subtleties of the problem at hand. For example,
people may use Bayesian models to include domain knowledge as the
modeling process’s prior distribution.
Data Science Prerequisites
What type of problem you are solving?
Description:
• The primary analytic problem is to summarize and explore a data set with
descriptive statistics (mean, standard deviation, and so forth) and visualization
methods.
• Data description is often used to check data, find the appropriate data
preprocessing method, and demonstrate the model results.
Comparison :
• The first common modeling problem is to compare different groups. Is A better
in some way than B? Or more comparisons: Is there any difference among A, B,
and C in a particular aspect?
• The commonly used statistical tests are chisquare test, t-test, and ANOVA. There
are also methods using Bayesian methods.
Data Science Prerequisites
Clustering :
• Please note that clustering is unsupervised learning; there are no response
variables. The most common clustering algorithms include K-Means and
Hierarchical Clustering.
Classification :
• For classification problems, there are one or more label columns to define the
ground truth of classes. We use other features of the training dataset as
explanatory variables for model training. We can use the trained classifier to
predict the labels of a new observation.
• The random forest algorithm is usually used as the baseline model to set model
performance expectations.
Data Science Prerequisites
Regression :
• Generally used for prediction and to answer the questions:-
• What will be the temperature tomorrow? What is the projected net income for
the next season? How much inventory should we have?
Optimization :
• It is an expansion of comparison problem and can solve problems such as:
• What is the best route to deliver the packages? What is the optimal
advertisement strategy to promote a new product?.

You might also like