Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 1 Full Notes

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 52

Unit 1

Foundations of Data Science

Dr. Sonakshi Vij


VSE&T
C.O.
Course Outcomes:
Understand and identify the basic concepts of data science for
 CO1
performing data analysis.
Apply & perform pre-processing steps along with data visualization to
 CO2
get insights from data.
Analyze and apply different modules of data science to evaluate
 CO3
mathematical, and scientific problems of data analysis.
Develop the model for data analysis and evaluate the model’s
performance to optimize business decisions and create competitive  CO4
advantage with data analytics.
Books
Text Books:
1. Wes Mckinney. Python for Data Analysis, First edition, Publisher O’Reilly Media.
2. Foundational Python for Data Science, 1st edition, Kennedy Behrman, Pearson
Publication.
3. Data analytics using Python, Bharti Motwani, Wiley Publication.
Reference Books:
1. Allen Downey, Jeffrey Elkner, Chris Meyers, Learning with Python, Dreamtech Press.
2. Reema Thareja. Python Programming using Problem Solving approach, Oxford
University press.
What is Data?
Data Science Skillset
Data Scientist vs Other Analytics Roles
Responsibilities of a Data Scientist
Data Scientist Description
Data Scientist vs Data Engineer
Roles Related to Data Science
Data Scientist Skillset
Data Scientist Skillset
Data Scientist Skillset
Data Scientist: Job Description
Data Scientist: Job Description
Why Python?
Need of Python for Data Science
• Python is a multi-functional, maximally interpreted
programming language with several advantages that
are often used to streamline massive, and complex
data sets.
• Python has a number of distinguishing characteristics
that make it the best option for data analysis.
Python: Easy to learn
• Python focuses on both simplicity and readability, while
also providing a plethora of useful options for data
analysts/scientists.
• As a result, even novices can easily use its relatively
simple syntax to create effective solutions for complex
scenarios, with just a few lines of code.
Python: Flexibility
• Python’s extreme versatility is another powerful attribute
that makes it popular among data scientists and analysts.
• As a result, data models can be created, data sets can be
systematized, ML-powered algorithms can be developed,
web services can be developed, and data mining can be
used to complete various tasks in a short amount of time.
Python: Huge Library Collection
• It has many completely free libraries that are open to the public.
• That is a key factor that makes Python essential for data analysis and in
data science as well.
• Users who work in the field of data science are probably familiar with
names like Pandas, SciPy, StatsModels, and other libraries that are
widely used in the data science community.
• It’s worth noting that libraries are constantly expanding, providing
robust solutions.
Python: Huge Library Collection
Python: Graphics and Visualization
• Visual information is well known for being much easier to
comprehend, work, and recall.
• Python provides users with a plethora of different visualization
options. As a consequence, it is a must-have method for all data
science, not just data processing.
• By developing numerous charts and graphics, as well as web-ready
interactive plots, data analysts can make data more available.
Python: Built-In Data Analytics Tool
• Python’s built-in analytics tools make it a perfect tool
for processing complex data.
• Python’s built-in analytics tools can also easily
penetrate patterns, correlate information in extensive
sets, and provide better insights, in addition to other
critical matrices in evaluating performance.
In a Nutshell
• Any company’s ability to derive information and insights from
data in order to make good strategic decisions, can stay
competitive, and its progress is directly linked to its performance.
• Python is an internationally recognized programming language that
can assist users in managing data better for a variety of reasons.
• Python is the most popular language among data analysts and data
scientists due to its extensive set of graphical options and
visualization tools that make data more available.
• Furthermore, it is constantly evolving and becoming more
powerful, multi-featured, and versatile.
Data Understanding
• Data understanding (IBM Case Study) involves accessing
the data and exploring it using tables and graphics
that can be organized in IBM® SPSS® Modeler using
the CRISP-DM project tool.
• This enables you to determine the quality of the data and
describe the results of these steps in the project
documentation.
If you understand data, you'll understand the process of gathering feedback in a
better manner. Feedback helps you find out what that group of customers think
about your product.
Data Understanding
Data Understanding to Data Preparation
Domain Knowledge
• In data science, the term domain knowledge is used to refer to the
general background knowledge of the field or environment to which
the methods of data science are being applied.
• Data science, as a discipline, can be thought of as the study of tools
used to model data, generate insights from data, and make decisions
based on data.
• They are generic tools applicable to many fields such as engineering,
laws, medicine, finance, etc.
Domain Knowledge
Domain Knowledge
1. Computer Science and Programming
• Computational science and programming refer to the
study of computational tools like programming
languages, software libraries, and other tools.
• The knowledge of programming is essential for anyone
who wishes to apply data science to problems in their
field.
Domain Knowledge
2. Statistics and Machine Learning
• Statistics and machine learning form the theoretical
foundations of data science methods and algorithms.
• An understanding of the theoretical underpinnings of data
science is required to know the limits of the methods
being applied, as well as to interpret the results of the data
science process properly.
Domain Knowledge
3. Domain Knowledge
• Domain knowledge is often referred to as a general discipline or field to which data
science is applied to.
• An expert or specialist in a field such as biotech is said to possess domain
knowledge of that industry.
• The first two items in the list above are essential skills that are required by all
practitioners of data science and are common to all applications of data science
regardless of the domain.
• On the other hand, domain knowledge is more specialized.
• The lack of domain knowledge makes it difficult to apply the right methods as well
as to judge their performance properly.
• In fact, the application of domain knowledge must be pervasive throughout the data
science process in order for it to be effective.
Data Science Process and Domain Knowledge

• The data science process can be divided into four sub-processes as


described below.
• The following figure summarizes the data science process:
Data Science Process and Domain Knowledge
1. Problem Definition
• The first step in any data science is defining the problem to be solved.
• It starts from a generic description of the problem and involves
defining desired performance criteria.
• Defining the problem is an easy step for a simple problem like
predicting credit default, where the problem definition is simply
predicting the probability of default based on the data on past
borrowers.
• On the other hand, consider a problem in robotics or medicine, where a
person without any domain knowledge cannot even define the pattern
they are looking for in the data.
Data Science Process and Domain Knowledge
2. Data Cleaning and Feature Engineering
• Most data collected in any field is seldom clean and ready
for use.
• The process of preparing the data for the modeling process
is data cleaning and feature engineering.
• Data cleaning and feature engineering involve
transforming the data.
• Incorrectly transformed data can lead to spurious results.
Data Science Process and Domain Knowledge
2. Data Cleaning and Feature Engineering
• For example, while analyzing the relationship between, say stock price
and financial results like cash flows, one might scale down cash flows.
• However, the scaling would introduce a look-ahead bias in the data as
the naïve scaling process will use future data to scale past data.
• It will lead to spurious results in any analysis based on incorrectly
transformed data.
• Further, domain knowledge is required in choosing the correct features
from the data, which will provide the most predictive power.
Data Science Process and Domain Knowledge

3. Model Building
• The model-building step involves fitting a model to data.
• The model built here is used to solve the problem defined
in the first step.
• The choice of an appropriate model is essential to the
success of the data science process.
• Again, this choice depends on the field of application and
is enhanced by strong domain knowledge.
Data Science Process and Domain Knowledge
4. Performance Measurement
• Performance measurement is the final step in the data science process that
involves measuring how the model performs on new data or out of sample
data, which was not used while building the model.
• The choice of performance metrics and thresholds is primarily driven by
domain knowledge.
• For example, when building a model to predict credit defaults, a false negative
(predicting a potential defaulter to be in good credit) is costlier than a false
positive (predicting a non-defaulter to be a defaulter).
• Such asymmetries will be different across disciplines, and it would be hard to
detect them without domain knowledge.
• Further computing the costs from model failure can only be accurately
estimated by a person with domain knowledge.
Data Science Process and Domain Knowledge
4. Performance Measurement
Beyond the Curriculum:
Case Study for Domain
Knowledge
Domain Knowledge: Case Study: Beyond the Curriculum

Case Study: Predicting Credit Card Delinquency


• Predicting credit card delinquency is a common problem in
consumer finance, where a credit card provider must decide whether
to issue credit cards to a particular customer. It also helps the
provider make risk assessments and strategic decisions.
• The data consists of about 100,000 individual customers with data on
10 attributes, including one indicating whether the customer was
delinquent.
Domain Knowledge: Case Study: Beyond the Curriculum

Step 1: Problem Definition


• In this case, the problem is easy to define. Predict the
value of the delinquency indicator.
Step 2: Data Cleaning and Feature Engineering
• Data cleaning and feature engineering is an important part
of the process in our case.
• The reason behind this is that the data is imbalanced,
meaning that it does not have an equal representation of
delinquents and non-delinquents.
Domain Knowledge: Case Study: Beyond the Curriculum

Step 2: Data Cleaning and Feature Engineering


• In fact, the data has 93% non-delinquents, which is expected in the real
world as most people do not default on their credit card debt.
• This imbalance can affect the choice of model and performance metric
used.
• It will also affect the quality of the model.
• A domain expert in credit risk would be aware that such an imbalance
would create the need for methods to tackle the issue.
• A simple solution would be to resample the data, which creates a balanced
but smaller dataset.
• This trade-off between the quality and quantity of data requires domain
knowledge to identify and adjust.
Domain Knowledge: Case Study: Beyond the Curriculum

Step 3: Model Building


• When building a model for predicting credit delinquency,
a domain expert would be familiar with past approaches to
the problem.
• In this case, the literature includes extensive use of
logistic regression.
• Thus, it can provide a good starting point for model
selection and a baseline to benchmark new approaches.
Domain Knowledge: Case Study: Beyond the Curriculum

Step 4: Performance Measurement


• Selecting good performance measurement criteria is a critical step in
developing a credit risk model, as an incorrect metric can lead to a
wrong model being selected.
• In our case, where the data is imbalanced, a naïve model that labels all
customers as non-delinquents will have very high accuracy, but in the
process, it may label all defaulters as non-delinquents as well.
• Such a problem requires a model that predicts most of the defaulters
correctly while misidentifying only a few non-delinquents.
Domain Knowledge: Case Study: Beyond the Curriculum

Step 4: Performance Measurement


• It should be reflected in the performance metric.
• Further, in this case, mislabeling a delinquent customer is
costlier than mislabeling a non-delinquent, and such a
trade-off requires a proper understanding of the costs and
risks involved.
• It cannot be a completely objective decision based on a
singular metric and subjective judgment based on domain
knowledge is required.
Structured and Unstructured Data
Structured and Unstructured Data
Structured and Unstructured Data
Structured and Unstructured Data

You might also like