Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
22 views

Data Science Process and Machine Learning

Uploaded by

cs235214205
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Data Science Process and Machine Learning

Uploaded by

cs235214205
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Data Science Process

Data science process consists of six stages :

1. Discovery or Setting the research goal

2. Retrieving data

3. Data preparation

4. Data exploration

5. Data modeling

6. Presentation and automation

• Fig. 1.3.1 shows data science design process.

• Step 1: Discovery or Defining research goal

This step involves acquiring data from all the identified internal and external
sources, which helps to answer the business question.

• Step 2: Retrieving data


It collection of data which required for project. This is the process of gaining a
business understanding of the data user have and deciphering what each piece of
data means. This could entail determining exactly what data is required and the
best methods for obtaining it. This also entails determining what each of the
data points means in terms of the company. If we have given a data set from a
client, for example, we shall need to know what each column and row
represents.

• Step 3: Data preparation

Data can have many inconsistencies like missing values, blank columns, an
incorrect data format, which needs to be cleaned. We need to process, explore
and condition data before modeling. The cleandata, gives the better predictions.

• Step 4: Data exploration

Data exploration is related to deeper understanding of data. Try to understand


how variables interact with each other, the distribution of the data and whether
there are outliers. To achieve this use descriptive statistics, visual techniques
and simple modeling. This steps is also called as Exploratory Data Analysis.

• Step 5: Data modeling

In this step, the actual model building process starts. Here, Data scientist
distributes datasets for training and testing. Techniques like association,
classification and clustering are applied to the training data set. The model, once
prepared, is tested against the "testing" dataset.

• Step 6: Presentation and automation

Deliver the final baselined model with reports, code and technical documents in
this stage. Model is deployed into a real-time production environment after
thorough testing. In this stage, the key findings are communicated to all
stakeholders. This helps to decide if the project results are a success or a failure
based on the inputs from the model.
What is Machine Learning?

Machine learning is a branch of artificial intelligence that enables algorithms to


uncover hidden patterns within datasets, allowing them to make predictions on
new, similar data without explicit programming for each task. Traditional
machine learning combines data with statistical tools to predict outputs, yielding
actionable insights. This technology finds applications in diverse fields such as
image and speech recognition, natural language processing, recommendation
systems, fraud detection, portfolio optimization, and automating tasks.

Types of Machine Learning?

Machine learning algorithms can be trained in many ways, with each method
having its pros and cons. Based on these methods and ways of learning,
machine learning is broadly categorized into four main types:

Types of Machine Learning

1. Supervised machine learning

This type of ML involves supervision, where machines are trained on labeled


datasets and enabled to predict outputs based on the provided training. The
labeled dataset specifies that some input and output parameters are already
mapped. Hence, the machine is trained with the input and corresponding output.
A device is made to predict the outcome using the test dataset in subsequent
phases.

For example, consider an input dataset of parrot and crow images. Initially, the
machine is trained to understand the pictures, including the parrot and crow’s
color, eyes, shape, and size. Post-training, an input picture of a parrot is
provided, and the machine is expected to identify the object and predict the
output. The trained machine checks for the various features of the object, such
as color, eyes, shape, etc., in the input picture, to make a final prediction. This is
the process of object identification in supervised machine learning.

The primary objective of the supervised learning technique is to map the input
variable (a) with the output variable (b). Supervised machine learning is further
classified into two broad categories:

 Classification: These refer to algorithms that address classification


problems where the output variable is categorical; for example, yes or no,
true or false, male or female, etc. Real-world applications of this category
are evident in spam detection and email filtering.

Some known classification algorithms include the Random Forest Algorithm,


Decision Tree Algorithm, Logistic Regression Algorithm, and Support Vector
Machine Algorithm.

 Regression: Regression algorithms handle regression problems where


input and output variables have a linear relationship. These are known to
predict continuous output variables. Examples include weather
prediction, market trend analysis, etc.

Popular regression algorithms include the Simple Linear Regression Algorithm,


Multivariate Regression Algorithm, Decision Tree Algorithm, and Lasso
Regression.

2. Unsupervised machine learning

Unsupervised learning refers to a learning technique that’s devoid of


supervision. Here, the machine is trained using an unlabeled dataset and is
enabled to predict the output without any supervision. An unsupervised learning
algorithm aims to group the unsorted dataset based on the input’s similarities,
differences, and patterns.

For example, consider an input dataset of images of a fruit-filled container.


Here, the images are not known to the machine learning model. When we input
the dataset into the ML model, the task of the model is to identify the pattern of
objects, such as color, shape, or differences seen in the input images and
categorize them. Upon categorization, the machine then predicts the output as it
gets tested with a test dataset.

Unsupervised machine learning is further classified into two types:

 Clustering: The clustering technique refers to grouping objects into


clusters based on parameters such as similarities or differences between
objects. For example, grouping customers by the products they purchase.

Some known clustering algorithms include the K-Means Clustering Algorithm,


Mean-Shift Algorithm, DBSCAN Algorithm, Principal Component Analysis,
and Independent Component Analysis.

 Association: Association learning refers to identifying typical relations


between the variables of a large dataset. It determines the dependency of
various data items and maps associated variables. Typical applications
include web usage mining and market data analysis.

Popular algorithms obeying association rules include the Apriori Algorithm,


Eclat Algorithm, and FP-Growth Algorithm.

3. Semi-supervised learning

Semi-supervised learning comprises characteristics of both supervised and


unsupervised machine learning. It uses the combination of labeled and
unlabeled datasets to train its algorithms. Using both types of datasets, semi-
supervised learning overcomes the drawbacks of the options mentioned above.

Consider an example of a college student. A student learning a concept under a


teacher’s supervision in college is termed supervised learning. In unsupervised
learning, a student self-learns the same concept at home without a teacher’s
guidance. Meanwhile, a student revising the concept after learning under the
direction of a teacher in college is a semi-supervised form of learning.
4. Reinforcement learning

Reinforcement learning is a feedback-based process. Here, the AI component


automatically takes stock of its surroundings by the hit & trial method, takes
action, learns from experiences, and improves performance. The component is
rewarded for each good action and penalized for every wrong move. Thus, the
reinforcement learning component aims to maximize the rewards by performing
good actions.

Unlike supervised learning, reinforcement learning lacks labeled data, and the
agents learn via experiences only. Consider video games. Here, the game
specifies the environment, and each move of the reinforcement agent defines its
state. The agent is entitled to receive feedback via punishment and rewards,
thereby affecting the overall game score. The ultimate goal of the agent is to
achieve a high score.

Reinforcement learning is applied across different fields such as game theory,


information theory, and multi-agent systems. Reinforcement learning is further
divided into two types of methods or algorithms:

 Positive reinforcement learning: This refers to adding a reinforcing


stimulus after a specific behavior of the agent, which makes it more likely
that the behavior may occur again in the future, e.g., adding a reward
after a behavior.
 Negative reinforcement learning: Negative reinforcement learning
refers to strengthening a specific behavior that avoids a negative
outcome.

You might also like