Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Module1 Data Science

Uploaded by

darkvaderkx007
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Module1 Data Science

Uploaded by

darkvaderkx007
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

1.What is Data Science?

Data science is defined as the field of study of various scientific


methods, algorithms, tools, and processes that extract useful insights
from a vast amount of data.

It also enables data scientists to discover hidden patterns from raw


data. This concept allows us to deal with Big Data that including
extraction, organizing, preparation, and analyzing.

Data can be either structured or unstructured both.

Data science process:


Data science process involves the following steps:

1.Define the problem statement:


The first step in the data science lifecycle is to define the problem
that needs to be solved.
This involves clearly articulating the business objective and
understanding the key requirements and constraints.
Effective problem definition requires a systematic approach. Data
scientists can employ techniques such as:

 Stakeholder interviews: Engaging with key stakeholders to


understand their requirements, expectations, and pain
points.

 Problem framing: Breaking down the overarching problem


into smaller, more manageable sub-problems.

 Defining success criteria: Establishing clear and


measurable criteria for evaluating the success of the data
science project.
 Setting priorities: Identifying the most critical aspects of the
problem that need to be addressed first.

 Documenting requirements: Documenting the problem


statement, goals, and constraints to ensure that all team
members are aligned.

2.Data collection :
Once the problem has been defined, the next step is to collect and
prepare the relevant data for analysis.
This involves identifying the data sources, acquiring the data,
and transforming it into a format suitable for analysis.
Data scientists can collect data from various sources, including
internal databases, external APIs, web scraping, and surveys.
During the data collection process, it is essential to ensure the
privacy and security of the data, especially when dealing with
sensitive or personally identifiable information.

3.Data Cleaning/ Preparing Data for Analysis:


Before diving into the analysis, data scientists need to prepare the
data by cleaning, transforming, and restructuring it. This involves
tasks such as:

 Data cleaning: Removing outliers, handling missing values,


and resolving inconsistencies.

 Data integration: Combining data from different sources


and resolving any discrepancies or conflicts.

 Feature engineering: Creating new features that capture


relevant information and improve the performance of
machine learning models.

 Data reduction: Reducing the dimensionality of the data to


focus on the most informative variables.
4.Data Exploration /Analysing the data
 Once the data has been collected and prepared, the next step
is to explore and analyse the data. This involves applying
statistical techniques and data visualisation to gain
insights and identify patterns and relationships.

The significance of data exploration

 Data exploration is a crucial step in the data science


lifecycle, as it allows data scientists to understand the
characteristics and quirks of the data.
 Through data exploration, they can uncover hidden
insights, identify outliers or anomalies, and validate
assumptions.
 Data exploration also helps data scientists identify
potential data quality issues or biases that may influence
the analysis.
 By visualising the data and conducting exploratory
analyses, they can gain a holistic understanding of the
dataset and make informed decisions about subsequent
analyses.
Methods for thorough data analysis

Data scientists employ various methods and techniques to


analyse data effectively. These methods include:

 Descriptive statistics: Calculating summary statistics,


such as mean, median, and standard deviation, to
summarise the data.

 Statistical modelling: Applying statistical models, such as


regression or time series analysis, to uncover relationships
and make predictions.

 Data visualisation: Creating charts, graphs, and


interactive visualisations to present the data in a
meaningful and engaging way.
 Machine learning: Using machine learning algorithms to
identify patterns, classify data, or make predictions.

Model Building and Evaluation:

 In the model-building and evaluation stage, data scientists


develop and refine predictive models based on the insights
gained from the previous stages.
Building a data model: what you need to know

 Building a data model entails selecting a suitable algorithm


or technique that aligns with the problem and the
characteristics of the data.
 Data scientists can choose from a wide range of models,
including linear regression, decision trees, neural
networks, and support vector machines.
Evaluating your data model’s performance

 To evaluate the performance of a data model, data scientists


employ various evaluation metrics, such as accuracy,
precision, recall, and F1 score.
 These metrics quantify the model’s predictive accuracy
and allow for the comparison of different models or
approaches.
 Data scientists should also perform a thorough analysis of
the model’s strengths and weaknesses.
 This includes assessing potential biases or errors,
determining the model’s interpretability, and identifying
areas for improvement.

5.Model Deployment:
After successfully building and evaluating the data model, the next
crucial phase in the data science lifecycle is deployment and
maintenance.
Deployment strategies

Deploying a data model requires careful planning to minimise


disruptions and ensure its practical utility. Common deployment
strategies include:
 Batch Processing: Implementing the model
periodically to analyse large volumes of data in
batches, suitable for scenarios with less urgency.

 Real-time Processing: Enabling the model to


process data in real-time, providing instantaneous
insights and predictions, ideal for applications
requiring quick responses.

 Cloud Deployment: Leveraging cloud platforms for


deployment, offering scalability, flexibility, and
accessibility, facilitating easier updates and
maintenance.

Once deployed, continuous monitoring and maintenance


are essential to sustain the model’s performance.

Key considerations include:

 Performance Monitoring: Regularly assessing the


model’s accuracy and responsiveness to ensure it aligns
with the expected outcomes.

 Data Drift Detection: Monitoring changes in input


data distribution to identify potential shifts that might
impact the model’s performance.
 Updating Models: Periodically updating the model to
incorporate new data, adapt to changing patterns, and
improve predictive capabilities.

 Security Measures: Implementing robust security


measures to protect the model and data, especially when
dealing with sensitive information.

Data Science Profile:


A data scientist's profile includes their responsibilities, skills, and
education:
Data scientists collect, analyze, and interpret large amounts of data to
help businesses find patterns and solve problems.

They use statistical methods, machine learning, and other tools to


extract meaning from data.

They also present their results in a clear way and communicate with
company leaders
Skills:

Data scientists need technical, analytical, and communication


skills. They also need to be persistent and have software engineering
skills.

1.Data Analyst:

Data Analysts are the individuals who are responsible for reviewing
the data so that they can identify the key information in the business
of customers.
It is the process of collecting,processing and analysing the data to
extract meaningful insights and also data analyst support in decision-
making process.
2.Data Scientist:

Data scientists are the individual who discover the data


sources,analyse the information based on trends and patterns.Data
Scientist generates the predictive models and builds the machine
learning algorithm. which based on the patterns and trends Data

3.Data Engineer:

Data engineers are the experts who are responsible for maintaining
,designing and optimizing the data infrastructure for the data
management and transform them.
Data engineers are in the change of creating pipelines to convert the
raw data n to the valuable formats for data scientists to use them.

4.Buisness Analyst:

Buisness Analyst are the peoples who help the business organization
to fullfill their goals and also assess the organization,analyze the data
and improve the systems and processes for the future.
They are the expert in allocating .forecasting ,budgeting and
resources in the business.
5.Data Architect:

Data architect are the IT individuals who use their computer science
and designing skills to analyze and review the data infrastructure of
business,plan the databases which needs to be used in future and
implement the useful solutions

Big Data and its characteristics:


Big Data contains a large amount of data that is not being processed
by traditional data storage or the processing unit.
It is used by many multinational companies to process the data and
business of many organizations. The data flow would exceed 150
exabytes per day before replication.

Types of Big Data:

Structured data is data that has a standardized format and is organized


into tables with rows and columns, while unstructured data is data
that doesn't fit into a structured format:

 Structured data

This data is organized and easy to search because it has a fixed record
format. It's usually stored in data warehouses and is often in the form
of numbers and text.

Structured data is typically tabular with rows and columns that clearly
define data attributes.

Examples of structured data include customer contact information,


such as first name, last name, and phone number.

 Unstructured data

This data doesn't fit neatly into a data table because of its size or
nature. It's often stored in its native format and can be human or
machine generated.
Unstructured data can include multimedia files, emails, text messages,
mobile activity, social media posts, satellite imagery, and more.
Unstructured data is usually stored in data lakes, which are
repositories that store data in its original format or after a basic
cleaning process.
Semi-structured data is a type of data that is not purely structured, but
also not completely unstructured.

It contains some level of organization or structure, but does not


conform to a rigid schema or data model, and may contain elements
that are not easily categorized or classified.
Semi-structured data is data that does not conform to a data model but
has some structure. It lacks a fixed or rigid schema.
Semi-structured data is typically characterized by the use of metadata
or tags that provide additional information about the data elements.
For example, an XML document might contain tags that indicate the
structure of the document.
here are five v's of Big Data that explains the characteristics.

5 V's of Big Data


o Volume:
The name ‘Big Data’ itself is related to a size which is
enormous.
Volume is a huge amount of data. To determine the value
of data, size of data plays a very crucial role.
If the volume of data is very large, then it is actually
considered as a ‘Big Data’.

o Velocity:
o Velocity refers to the high speed of accumulation of data.
o In Big Data velocity data flows in from sources like
machines, networks, social media, mobile phones etc.
o There is a massive and continuous flow of data. This
determines the potential of data that how fast the data is
generated and processed to meet the demands.

Variety:
o It refers to nature of data that is structured, semi-structured
and unstructured data.
o It also refers to heterogeneous sources.
o Variety is basically the arrival of data from new sources
that are both inside and outside of an enterprise. It can be
structured, semi-structured and unstructured.
Veracity:
o It refers to inconsistencies and uncertainty in data, that is
data which is available can sometimes get messy and
quality and accuracy are difficult to control.
o Big Data is also variable because of the multitude of data
dimensions resulting from multiple disparate data types
and sources.
o Example: Data in bulk could create confusion whereas less
amount of data could convey half or Incomplete
Information.
Value :
Value is an essential characteristic of big data. It is not the
data that we process or store. It
is valuable and reliable data that we store, process, and
also analyze
Sources of Data:
What are the different sources of data?

The following are the two sources of data:

1. Internal sources
 When data is collected from reports and records of the
organisation itself, they are known as the internal sources.
 For example, a company publishes its annual report’ on profit
and loss, total sales, loans, wages, etc.
2. External sources

 When data is collected from sources outside the organisation,


they are known as the external sources.
 For example, if a tour and travel company obtains information on
Karnataka tourism from Karnataka Transport Corporation, it
would be known as an external source of data.
Types of Data

A) Primary data
 Primary data means first-hand information collected by an
investigator.
 It is collected for the first time.
 It is original and more reliable.
 For example, the population census conducted by the government
of India after every ten years is primary data.

The data which is Raw, original, and extracted directly from the
official sources is known as primary data.

This type of data is collected directly by performing techniques such


as questionnaires, interviews, and surveys.

The data collected must be according to the demand and


requirements of the target audience on which analysis is performed
otherwise it would be a burden in the data processing.
Few methods of collecting primary data:
1. Interview method:

The data collected during this process is through interviewing the


target audience by a person called interviewer and the person who
answers the interview is known as the interviewee.
Some basic business or product related questions are asked and
noted down in the form of notes, audio, or video and this data is
stored for processing.
These can be both structured and unstructured like personal
interviews or formal interviews through telephone, face to face,
email, etc.

2. Survey method:
The survey method is the process of research where a list of
relevant questions are asked and answers are noted down in the
form of text, audio, or video.
The survey method can be obtained in both online and offline mode
like through website forms and email.

Then that survey answers are stored for analyzing data.

Examples are online surveys or surveys through social media polls.

3. Observation method:

The observation method is a method of data collection in which the


researcher keenly observes the behavior and practices of the target
audience using some data collecting tool and stores the observed
data in the form of text, audio, video, or any raw formats.
In this method, the data is collected directly by posting a few
questions on the participants.

For example, observing a group of customers and their behavior


towards the products.
4. Experimental method:

The experimental method is the process of collecting data through


performing experiments, research, and investigation.

B) Secondary data
 Secondary data refers to second-hand information.
 It is not originally collected and rather obtained from already
published or unpublished sources.
 For example, the address of a person taken from the telephone
directory or the phone number of a company taken from Just
Dial are secondary data
Secondary data is the data which has already been collected and
reused again for some valid purpose.

This type of data is previously recorded from primary data and it has
two types of sources named internal source and external source.

Other sources:

1. Sensors data: With the advancement of IoT devices, the sensors


of these devices collect data which can be used for sensor data
analytics to track the performance and usage of products.

2. Satellites data: Satellites collect a lot of images and data in


terabytes on daily basis through surveillance cameras which
can be used to collect useful information.

3. Web traffic: Due to fast and cheap internet facilities many


formats of data which is uploaded by users on different
platforms can be predicted and collected with their permission
for data analysis.
The search engines also provide their data through keywords and
queries searched mostly.

You might also like