0% found this document useful (0 votes)

4 views

Module1 Data Science

Uploaded by

darkvaderkx007

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Module1 Data Science

Uploaded by

darkvaderkx007

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

1.What is Data Science?

Data science is defined as the field of study of various scientific

methods, algorithms, tools, and processes that extract useful insights
from a vast amount of data.

It also enables data scientists to discover hidden patterns from raw

data. This concept allows us to deal with Big Data that including
extraction, organizing, preparation, and analyzing.

Data can be either structured or unstructured both.

Data science process:

Data science process involves the following steps:

1.Define the problem statement:

The first step in the data science lifecycle is to define the problem
that needs to be solved.
This involves clearly articulating the business objective and
understanding the key requirements and constraints.
Effective problem definition requires a systematic approach. Data
scientists can employ techniques such as:

 Stakeholder interviews: Engaging with key stakeholders to

understand their requirements, expectations, and pain
points.

 Problem framing: Breaking down the overarching problem

into smaller, more manageable sub-problems.

 Defining success criteria: Establishing clear and

measurable criteria for evaluating the success of the data
science project.
 Setting priorities: Identifying the most critical aspects of the
problem that need to be addressed first.

 Documenting requirements: Documenting the problem

statement, goals, and constraints to ensure that all team
members are aligned.

2.Data collection :
Once the problem has been defined, the next step is to collect and
prepare the relevant data for analysis.
This involves identifying the data sources, acquiring the data,
and transforming it into a format suitable for analysis.
Data scientists can collect data from various sources, including
internal databases, external APIs, web scraping, and surveys.
During the data collection process, it is essential to ensure the
privacy and security of the data, especially when dealing with
sensitive or personally identifiable information.

3.Data Cleaning/ Preparing Data for Analysis:

Before diving into the analysis, data scientists need to prepare the
data by cleaning, transforming, and restructuring it. This involves
tasks such as:

 Data cleaning: Removing outliers, handling missing values,

and resolving inconsistencies.

 Data integration: Combining data from different sources

and resolving any discrepancies or conflicts.

 Feature engineering: Creating new features that capture

relevant information and improve the performance of
machine learning models.

 Data reduction: Reducing the dimensionality of the data to

focus on the most informative variables.
4.Data Exploration /Analysing the data
 Once the data has been collected and prepared, the next step
is to explore and analyse the data. This involves applying
statistical techniques and data visualisation to gain
insights and identify patterns and relationships.

The significance of data exploration

 Data exploration is a crucial step in the data science

lifecycle, as it allows data scientists to understand the
characteristics and quirks of the data.
 Through data exploration, they can uncover hidden
insights, identify outliers or anomalies, and validate
assumptions.
 Data exploration also helps data scientists identify
potential data quality issues or biases that may influence
the analysis.
 By visualising the data and conducting exploratory
analyses, they can gain a holistic understanding of the
dataset and make informed decisions about subsequent
analyses.
Methods for thorough data analysis

Data scientists employ various methods and techniques to

analyse data effectively. These methods include:

 Descriptive statistics: Calculating summary statistics,

such as mean, median, and standard deviation, to
summarise the data.

 Statistical modelling: Applying statistical models, such as

regression or time series analysis, to uncover relationships
and make predictions.

 Data visualisation: Creating charts, graphs, and

interactive visualisations to present the data in a
meaningful and engaging way.
 Machine learning: Using machine learning algorithms to
identify patterns, classify data, or make predictions.

Model Building and Evaluation:

 In the model-building and evaluation stage, data scientists

develop and refine predictive models based on the insights
gained from the previous stages.
Building a data model: what you need to know

 Building a data model entails selecting a suitable algorithm

or technique that aligns with the problem and the
characteristics of the data.
 Data scientists can choose from a wide range of models,
including linear regression, decision trees, neural
networks, and support vector machines.
Evaluating your data model’s performance

 To evaluate the performance of a data model, data scientists

employ various evaluation metrics, such as accuracy,
precision, recall, and F1 score.
 These metrics quantify the model’s predictive accuracy
and allow for the comparison of different models or
approaches.
 Data scientists should also perform a thorough analysis of
the model’s strengths and weaknesses.
 This includes assessing potential biases or errors,
determining the model’s interpretability, and identifying
areas for improvement.

5.Model Deployment:
After successfully building and evaluating the data model, the next
crucial phase in the data science lifecycle is deployment and
maintenance.
Deployment strategies

Deploying a data model requires careful planning to minimise

disruptions and ensure its practical utility. Common deployment
strategies include:
 Batch Processing: Implementing the model
periodically to analyse large volumes of data in
batches, suitable for scenarios with less urgency.

 Real-time Processing: Enabling the model to

process data in real-time, providing instantaneous
insights and predictions, ideal for applications
requiring quick responses.

 Cloud Deployment: Leveraging cloud platforms for

deployment, offering scalability, flexibility, and
accessibility, facilitating easier updates and
maintenance.

Once deployed, continuous monitoring and maintenance

are essential to sustain the model’s performance.

Key considerations include:

 Performance Monitoring: Regularly assessing the

model’s accuracy and responsiveness to ensure it aligns
with the expected outcomes.

 Data Drift Detection: Monitoring changes in input

data distribution to identify potential shifts that might
impact the model’s performance.
 Updating Models: Periodically updating the model to
incorporate new data, adapt to changing patterns, and
improve predictive capabilities.

 Security Measures: Implementing robust security

measures to protect the model and data, especially when
dealing with sensitive information.

Data Science Profile:

A data scientist's profile includes their responsibilities, skills, and
education:
Data scientists collect, analyze, and interpret large amounts of data to
help businesses find patterns and solve problems.

They use statistical methods, machine learning, and other tools to

extract meaning from data.

They also present their results in a clear way and communicate with
company leaders
Skills:

Data scientists need technical, analytical, and communication

skills. They also need to be persistent and have software engineering
skills.

1.Data Analyst:

Data Analysts are the individuals who are responsible for reviewing
the data so that they can identify the key information in the business
of customers.
It is the process of collecting,processing and analysing the data to
extract meaningful insights and also data analyst support in decision-
making process.
2.Data Scientist:

Data scientists are the individual who discover the data

sources,analyse the information based on trends and patterns.Data
Scientist generates the predictive models and builds the machine
learning algorithm. which based on the patterns and trends Data

3.Data Engineer:

Data engineers are the experts who are responsible for maintaining
,designing and optimizing the data infrastructure for the data
management and transform them.
Data engineers are in the change of creating pipelines to convert the
raw data n to the valuable formats for data scientists to use them.

4.Buisness Analyst:

Buisness Analyst are the peoples who help the business organization
to fullfill their goals and also assess the organization,analyze the data
and improve the systems and processes for the future.
They are the expert in allocating .forecasting ,budgeting and
resources in the business.
5.Data Architect:

Data architect are the IT individuals who use their computer science
and designing skills to analyze and review the data infrastructure of
business,plan the databases which needs to be used in future and
implement the useful solutions

Big Data and its characteristics:

Big Data contains a large amount of data that is not being processed
by traditional data storage or the processing unit.
It is used by many multinational companies to process the data and
business of many organizations. The data flow would exceed 150
exabytes per day before replication.

Types of Big Data:

Structured data is data that has a standardized format and is organized

into tables with rows and columns, while unstructured data is data
that doesn't fit into a structured format:

 Structured data

This data is organized and easy to search because it has a fixed record
format. It's usually stored in data warehouses and is often in the form
of numbers and text.

Structured data is typically tabular with rows and columns that clearly
define data attributes.

Examples of structured data include customer contact information,

such as first name, last name, and phone number.

 Unstructured data

This data doesn't fit neatly into a data table because of its size or
nature. It's often stored in its native format and can be human or
machine generated.
Unstructured data can include multimedia files, emails, text messages,
mobile activity, social media posts, satellite imagery, and more.
Unstructured data is usually stored in data lakes, which are
repositories that store data in its original format or after a basic
cleaning process.
Semi-structured data is a type of data that is not purely structured, but
also not completely unstructured.

It contains some level of organization or structure, but does not

conform to a rigid schema or data model, and may contain elements
that are not easily categorized or classified.
Semi-structured data is data that does not conform to a data model but
has some structure. It lacks a fixed or rigid schema.
Semi-structured data is typically characterized by the use of metadata
or tags that provide additional information about the data elements.
For example, an XML document might contain tags that indicate the
structure of the document.
here are five v's of Big Data that explains the characteristics.

5 V's of Big Data

o Volume:
The name ‘Big Data’ itself is related to a size which is
enormous.
Volume is a huge amount of data. To determine the value
of data, size of data plays a very crucial role.
If the volume of data is very large, then it is actually
considered as a ‘Big Data’.

o Velocity:
o Velocity refers to the high speed of accumulation of data.
o In Big Data velocity data flows in from sources like
machines, networks, social media, mobile phones etc.
o There is a massive and continuous flow of data. This
determines the potential of data that how fast the data is
generated and processed to meet the demands.

Variety:
o It refers to nature of data that is structured, semi-structured
and unstructured data.
o It also refers to heterogeneous sources.
o Variety is basically the arrival of data from new sources
that are both inside and outside of an enterprise. It can be
structured, semi-structured and unstructured.
Veracity:
o It refers to inconsistencies and uncertainty in data, that is
data which is available can sometimes get messy and
quality and accuracy are difficult to control.
o Big Data is also variable because of the multitude of data
dimensions resulting from multiple disparate data types
and sources.
o Example: Data in bulk could create confusion whereas less
amount of data could convey half or Incomplete
Information.
Value :
Value is an essential characteristic of big data. It is not the
data that we process or store. It
is valuable and reliable data that we store, process, and
also analyze
Sources of Data:
What are the different sources of data?

The following are the two sources of data:

1. Internal sources
 When data is collected from reports and records of the
organisation itself, they are known as the internal sources.
 For example, a company publishes its annual report’ on profit
and loss, total sales, loans, wages, etc.
2. External sources

 When data is collected from sources outside the organisation,

they are known as the external sources.
 For example, if a tour and travel company obtains information on
Karnataka tourism from Karnataka Transport Corporation, it
would be known as an external source of data.
Types of Data

A) Primary data
 Primary data means first-hand information collected by an
investigator.
 It is collected for the first time.
 It is original and more reliable.
 For example, the population census conducted by the government
of India after every ten years is primary data.

The data which is Raw, original, and extracted directly from the
official sources is known as primary data.

This type of data is collected directly by performing techniques such

as questionnaires, interviews, and surveys.

The data collected must be according to the demand and

requirements of the target audience on which analysis is performed
otherwise it would be a burden in the data processing.
Few methods of collecting primary data:
1. Interview method:

The data collected during this process is through interviewing the

target audience by a person called interviewer and the person who
answers the interview is known as the interviewee.
Some basic business or product related questions are asked and
noted down in the form of notes, audio, or video and this data is
stored for processing.
These can be both structured and unstructured like personal
interviews or formal interviews through telephone, face to face,
email, etc.

2. Survey method:
The survey method is the process of research where a list of
relevant questions are asked and answers are noted down in the
form of text, audio, or video.
The survey method can be obtained in both online and offline mode
like through website forms and email.

Then that survey answers are stored for analyzing data.

Examples are online surveys or surveys through social media polls.

3. Observation method:

The observation method is a method of data collection in which the

researcher keenly observes the behavior and practices of the target
audience using some data collecting tool and stores the observed
data in the form of text, audio, video, or any raw formats.
In this method, the data is collected directly by posting a few
questions on the participants.

For example, observing a group of customers and their behavior

towards the products.
4. Experimental method:

The experimental method is the process of collecting data through

performing experiments, research, and investigation.

B) Secondary data
 Secondary data refers to second-hand information.
 It is not originally collected and rather obtained from already
published or unpublished sources.
 For example, the address of a person taken from the telephone
directory or the phone number of a company taken from Just
Dial are secondary data
Secondary data is the data which has already been collected and
reused again for some valid purpose.

This type of data is previously recorded from primary data and it has
two types of sources named internal source and external source.

Other sources:

1. Sensors data: With the advancement of IoT devices, the sensors

of these devices collect data which can be used for sensor data
analytics to track the performance and usage of products.

2. Satellites data: Satellites collect a lot of images and data in

terabytes on daily basis through surveillance cameras which
can be used to collect useful information.

3. Web traffic: Due to fast and cheap internet facilities many

formats of data which is uploaded by users on different
platforms can be predicted and collected with their permission
for data analysis.
The search engines also provide their data through keywords and
queries searched mostly.

Domestic and Industral Installation-2 PDF
80% (5)
Domestic and Industral Installation-2 PDF
120 pages
Bose Sounddock 10 Service Manual
100% (13)
Bose Sounddock 10 Service Manual
79 pages
WBS For Game Development
0% (1)
WBS For Game Development
3 pages
670 Series Version 2.2 IEC: Installation Manual
No ratings yet
670 Series Version 2.2 IEC: Installation Manual
100 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
Week 3
No ratings yet
Week 3
3 pages
Data Science & Cyber Security
No ratings yet
Data Science & Cyber Security
13 pages
DataScience Week3
No ratings yet
DataScience Week3
2 pages
Satyam Rana 4 sem business analytics
No ratings yet
Satyam Rana 4 sem business analytics
29 pages
Data Science
No ratings yet
Data Science
2 pages
Data Science Management_vss
No ratings yet
Data Science Management_vss
84 pages
Fundamentals of Datascience
No ratings yet
Fundamentals of Datascience
80 pages
fundamentals_of_Datascience1
No ratings yet
fundamentals_of_Datascience1
83 pages
Exporatory Data Analytics Notes ME SEM 2
No ratings yet
Exporatory Data Analytics Notes ME SEM 2
132 pages
EBook - Data Science 4
No ratings yet
EBook - Data Science 4
14 pages
Fundamentals of Datascience
No ratings yet
Fundamentals of Datascience
81 pages
text 4
No ratings yet
text 4
1 page
DataScience - Week 10
No ratings yet
DataScience - Week 10
2 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
Chap 1
No ratings yet
Chap 1
42 pages
Data Science Is A Multidisciplinary Field That Uses Scientific Methods
No ratings yet
Data Science Is A Multidisciplinary Field That Uses Scientific Methods
2 pages
Unit V
No ratings yet
Unit V
3 pages
Da CH1 Slqa
No ratings yet
Da CH1 Slqa
6 pages
Unit 1 Half
No ratings yet
Unit 1 Half
8 pages
DS
No ratings yet
DS
94 pages
MLM FDS
No ratings yet
MLM FDS
19 pages
CCS105 Data Analytics Class
No ratings yet
CCS105 Data Analytics Class
5 pages
data science unit 1
No ratings yet
data science unit 1
30 pages
unit 1 ds
No ratings yet
unit 1 ds
10 pages
6001_DATASCIENCE WITH BIGDATA
No ratings yet
6001_DATASCIENCE WITH BIGDATA
34 pages
HTTTTC- FINAL EXAM
No ratings yet
HTTTTC- FINAL EXAM
4 pages
What Is Data Anaysis
No ratings yet
What Is Data Anaysis
8 pages
DSUR_EA2352001010391_W3
No ratings yet
DSUR_EA2352001010391_W3
3 pages
Job Summary - 2025 Batch Campus Recruitment
No ratings yet
Job Summary - 2025 Batch Campus Recruitment
4 pages
Data Science
No ratings yet
Data Science
11 pages
2 Data Analytics
No ratings yet
2 Data Analytics
49 pages
Module 4 Notes
No ratings yet
Module 4 Notes
53 pages
Data Science Unit-1 Notes
No ratings yet
Data Science Unit-1 Notes
19 pages
BBA 202 Business Analytics
No ratings yet
BBA 202 Business Analytics
52 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
UNIT- I
No ratings yet
UNIT- I
17 pages
QB 2 Marker
No ratings yet
QB 2 Marker
25 pages
Data Mining Lifecycle
No ratings yet
Data Mining Lifecycle
2 pages
Data Science
No ratings yet
Data Science
5 pages
S2-Slo1 & Slo2
No ratings yet
S2-Slo1 & Slo2
3 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
Data Science Life Cycle
No ratings yet
Data Science Life Cycle
7 pages
dsbd
No ratings yet
dsbd
23 pages
Data Science Using Python
No ratings yet
Data Science Using Python
85 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Unit I
No ratings yet
Unit I
13 pages
Differences between Data Science and Data Analytics
No ratings yet
Differences between Data Science and Data Analytics
10 pages
PAM UNIT 1 (1)
No ratings yet
PAM UNIT 1 (1)
37 pages
24
No ratings yet
24
4 pages
Lecture 1 introduction PM (1)
No ratings yet
Lecture 1 introduction PM (1)
21 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Ba Unit 4 - Part1
No ratings yet
Ba Unit 4 - Part1
7 pages
test (1)
No ratings yet
test (1)
7 pages
Data Analytics Phase - 5 Cyber
No ratings yet
Data Analytics Phase - 5 Cyber
19 pages
Data Analyst Question-Answers
No ratings yet
Data Analyst Question-Answers
17 pages
3
No ratings yet
3
2 pages
Ass 2
No ratings yet
Ass 2
6 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Refrigerant Leak Model
No ratings yet
Refrigerant Leak Model
8 pages
Data anh Thống
No ratings yet
Data anh Thống
99 pages
Unistrut Opm 0295 13 (Page 5c 1)
No ratings yet
Unistrut Opm 0295 13 (Page 5c 1)
1 page
Market Survey-Acoustics Materia
50% (2)
Market Survey-Acoustics Materia
23 pages
And Conversation, Shows That Many of The Incivilities That Plague Our Offices Today Are Virtually
No ratings yet
And Conversation, Shows That Many of The Incivilities That Plague Our Offices Today Are Virtually
6 pages
Shear Wall Notes:: Ac Pad Must Meet Minimum Design Flood Elevation in Flood Zone Locations 1'-0" 3'-0"
No ratings yet
Shear Wall Notes:: Ac Pad Must Meet Minimum Design Flood Elevation in Flood Zone Locations 1'-0" 3'-0"
1 page
Megger Grounding Bonding
No ratings yet
Megger Grounding Bonding
118 pages
An Introduction To The AWS Command Line Tool - Linux
No ratings yet
An Introduction To The AWS Command Line Tool - Linux
5 pages
Manual For Xioami XL Inventer
No ratings yet
Manual For Xioami XL Inventer
112 pages
SAP_PM_Overview
No ratings yet
SAP_PM_Overview
2 pages
Resume Marjorie Turingan v2021-2
No ratings yet
Resume Marjorie Turingan v2021-2
3 pages
A Novel Healthy and Time Aware Food Recommender Syst - 2023 - Expert Systems Wit PDF
No ratings yet
A Novel Healthy and Time Aware Food Recommender Syst - 2023 - Expert Systems Wit PDF
22 pages
CH 07
No ratings yet
CH 07
86 pages
JobAdvertisement 198486
No ratings yet
JobAdvertisement 198486
3 pages
CSC508 Test 2
No ratings yet
CSC508 Test 2
6 pages
Diff.equations and Numerical Methods II Sem.
No ratings yet
Diff.equations and Numerical Methods II Sem.
1 page
English 9.performance Task. Tri-Fold Brochure - Communicative Styles - WEEK 10.2 Pages
100% (2)
English 9.performance Task. Tri-Fold Brochure - Communicative Styles - WEEK 10.2 Pages
2 pages
121CH0078 Abhinav-Kansal 1
No ratings yet
121CH0078 Abhinav-Kansal 1
1 page
DWG Honeycom Beam
No ratings yet
DWG Honeycom Beam
2 pages
Asphalt Paver Catalog
No ratings yet
Asphalt Paver Catalog
5 pages
Chapter 7 Cost
No ratings yet
Chapter 7 Cost
5 pages
BGP Weakness
No ratings yet
BGP Weakness
21 pages
Safetygram 7 PDF
No ratings yet
Safetygram 7 PDF
8 pages
Electric Power Systems Research: Sciencedirect
No ratings yet
Electric Power Systems Research: Sciencedirect
7 pages
Bsbins 601 Project Portfolio
No ratings yet
Bsbins 601 Project Portfolio
21 pages
Immediate download Generative Deep Learning 1st Edition David Foster ebooks 2024
100% (11)
Immediate download Generative Deep Learning 1st Edition David Foster ebooks 2024
40 pages