Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
1K views

Predictive Data Analytics With Python

Python provides powerful tools for predictive data analytics through libraries like NumPy, pandas, scikit-learn, TensorFlow, and Keras. The predictive analytics process typically involves data collection, preprocessing, exploration, model selection, evaluation, and deployment. Key preprocessing steps include cleaning, feature selection, and transforming data into suitable formats for analysis.

Uploaded by

toon town
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views

Predictive Data Analytics With Python

Python provides powerful tools for predictive data analytics through libraries like NumPy, pandas, scikit-learn, TensorFlow, and Keras. The predictive analytics process typically involves data collection, preprocessing, exploration, model selection, evaluation, and deployment. Key preprocessing steps include cleaning, feature selection, and transforming data into suitable formats for analysis.

Uploaded by

toon town
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

Predictive Data Analytics with Python

Introduction
Predictive data analytics is a field that focuses on extracting insights from historical data in order to make
predictions or forecasts about future events or trends. It involves using statistical and machine learning
techniques to analyze large datasets and uncover patterns, correlations, and trends that can be used to
make informed predictions.

Python, as a powerful programming language, provides a wide range of libraries and tools for predictive
data analytics. Some of the popular libraries used in Python for predictive analytics include NumPy,
pandas, scikit-learn, TensorFlow, and Keras. These libraries provide various functionalities for data
manipulation, preprocessing, modeling, and evaluation.

The predictive data analytics process typically involves several steps:

1. Data collection: Gathering relevant data from various sources, such as databases, APIs, or CSV files.

2. Data preprocessing: Cleaning and transforming the collected data to ensure it is suitable for analysis.
This may involve tasks such as handling missing values, removing outliers, and normalizing or scaling the
data.

3. Exploratory data analysis (EDA): Analyzing the data to gain insights and understand its characteristics.
This may involve visualizations, summary statistics, and correlation analysis.

4. Feature selection: Identifying the most relevant features or variables that are likely to have an impact
on the prediction task. This step helps in reducing complexity and improving model performance.

5. Model selection and training: Choosing an appropriate predictive model based on the problem at
hand and training it using the historical data. This could involve using techniques such as regression,
classification, or time series forecasting.

6. Model evaluation: Assessing the performance of the trained model using evaluation metrics and
validation techniques. This helps in determining how well the model is likely to perform on unseen data.
7. Prediction and deployment: Using the trained model to make predictions on new or future data. The
predictions can be used for decision-making, optimization, or forecasting purposes. The model can be
deployed as an application, API, or integrated into existing systems.

Python's versatility and extensive ecosystem of libraries make it an ideal choice for predictive data
analytics. Its ease of use, vast community support, and extensive documentation make it accessible to
both beginners and experienced data analysts. With Python, you can leverage the power of machine
learning and statistical techniques to gain valuable insights and make accurate predictions based on your
data.

Essential libraries in python :

There are several essential Python libraries for predictive data analytics. Here are some of the most
commonly used libraries:

1. NumPy: NumPy is a fundamental library for scientific computing in Python. It provides powerful tools
for handling multi-dimensional arrays and performing mathematical operations. NumPy is essential for
data manipulation and preprocessing tasks.

2. pandas: pandas is a versatile library that provides data structures and tools for data analysis. It offers
powerful data manipulation and data wrangling capabilities, such as data cleaning, merging, reshaping,
and filtering. pandas is widely used for data preprocessing and exploratory data analysis.

3. scikit-learn: scikit-learn is a popular machine learning library in Python. It offers a wide range of
algorithms and tools for various tasks, including regression, classification, clustering, and dimensionality
reduction. scikit-learn provides an intuitive and consistent API for training and evaluating predictive
models.

4. TensorFlow: TensorFlow is an open-source library developed by Google for machine learning and deep
learning. It provides a flexible and efficient framework for building and deploying machine learning
models, especially neural networks. TensorFlow is commonly used for tasks such as image recognition,
natural language processing, and time series forecasting.

5. Keras: Keras is a high-level neural network library that runs on top of TensorFlow. It provides a user-
friendly interface for building and training deep learning models. Keras simplifies the process of
designing complex neural networks and is widely used for tasks like image classification, sentiment
analysis, and text generation.

6. StatsModels: StatsModels is a library focused on statistical modeling and analysis. It offers a wide
range of statistical models, including regression analysis, time series analysis, and hypothesis testing.
StatsModels provides comprehensive statistical tools for predictive modeling and inference.

7. XGBoost: XGBoost is an optimized implementation of the gradient boosting algorithm. It is known for
its speed and performance and is widely used for classification and regression problems. XGBoost
provides excellent predictive power and is particularly effective for handling structured data.

8. Matplotlib: Matplotlib is a powerful visualization library for creating static, animated, and interactive
visualizations in Python. It provides a wide range of plotting functions and customization options, making
it suitable for creating various types of charts, graphs, and plots.

These are just a few of the essential libraries used in predictive data analytics with Python. Depending on
your specific requirements, you may also find other libraries, such as seaborn, plotly, PyTorch, or
LightGBM, beneficial for your predictive modeling tasks.
Data Preprocessing :
https://towardsdatascience.com/data-preprocessing-in-python-pandas-
part-6-dropping-duplicates-e35e46bcc9d6

https://www.numpyninja.com/post/data-preprocessing-using-pandas-
drop-and-drop_duplicates-functions

https://www.futurelearn.com/info/courses/data-analytics-python-data-
wrangling-and-ingestion/0/steps/186666

Data transformation using function or maping :

Data Transformation in Data Mining


Raw data is difficult to trace or understand. That's why it needs to be preprocessed before
retrieving any information from it. Data transformation is a technique used to convert the
raw data into a suitable format that efficiently eases data mining and retrieves strategic
information. Data transformation includes data cleaning techniques and a data reduction
technique to convert the data into the appropriate form.

Data transformation is an essential data preprocessing technique that must be performed


on the data before data mining to provide patterns that are easier to understand.

Data transformation changes the format, structure, or values of the data and converts
them into clean, usable data. Data may be transformed at two stages of the data pipeline
for data analytics projects. Organizations that use on-premises data warehouses generally
use an ETL (extract, transform, and load) process, in which data transformation is the
middle step. Today, most organizations use cloud-based data warehouses to scale
compute and storage resources with latency measured in seconds or minutes. The
scalability of the cloud platform lets organizations skip preload transformations and load
raw data into the data warehouse, then transform it at query time.

Data integration, migration, data warehousing, data wrangling may all involve data
transformation. Data transformation increases the efficiency of business and analytic
processes, and it enables businesses to make better data-driven decisions. During the
data transformation process, an analyst will determine the structure of the data. This could
mean that data transformation may be:

o Constructive: The data transformation process adds, copies, or replicates data.


o Destructive: The system deletes fields or records.
o Aesthetic: The transformation standardizes the data to meet requirements or parameters.
o Structural: The database is reorganized by renaming, moving, or combining columns.

Data Transformation Techniques


There are several data transformation techniques that can help structure and clean up the
data before analysis or storage in a data warehouse. Let's study all techniques used for
data transformation, some of which we have already studied in data reduction and data
cleaning.
1. Data Smoothing

Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or
reduce any variance or any other noise form.

The concept behind data smoothing is that it will be able to identify simple changes to
help predict different trends and patterns. This serves as a help to analysts or traders who
need to look at a lot of data which can often be difficult to digest for finding patterns that
they wouldn't see otherwise.

We have seen how the noise is removed from the data using the techniques such as
binning, regression, clustering.

o Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.
o Regression: This method identifies the relation among two dependent attributes so that
if we have one attribute, it can be used to predict the other attribute.
o Clustering: This method groups similar data values and form a cluster. The values that lie
outside a cluster are known as outliers.

2. Attribute Construction

In the attribute construction method, the new attributes consult the existing attributes to
construct a new data set that eases data mining. New attributes are created and applied
to assist the mining process from the given attributes. This simplifies the original data and
makes the mining more efficient.

For example, suppose we have a data set referring to measurements of different plots,
i.e., we may have the height and width of each plot. So here, we can construct a new
attribute 'area' from attributes 'height' and 'weight'. This also helps understand the
relations among the attributes in a data set.

3. Data Aggregation

Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data
sources into a data analysis description. This is a crucial step since the accuracy of data
analysis insights is highly dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to
produce relevant results. The collection of data is useful for everything from decisions
concerning financing or business strategy of the product, pricing, operations, and
marketing strategies.

For example, we have a data set of sales reports of an enterprise that has quarterly sales
of each year. We can aggregate the data to get the enterprise's annual sales report.

4. Data Normalization

Normalizing the data refers to scaling the data values to a much smaller range such as [-
1, 1] or [0.0, 1.0]. There are different methods to normalize the data, as discussed below.

Consider that we have a numeric attribute A and we have n number of observed values
for attribute A that are V1, V2, V3, ….Vn.

o Min-max normalization: This method implements a linear transformation on the original


data. Let us consider that we have minA and maxA as the minimum and maximum value
observed for attribute A and Viis the value for attribute A that has to be normalized.
The min-max normalization would map Vi to the V'i in a new smaller range [new_minA,
new_maxA]. The formula for min-max normalization is given below:

For example, we have $1200 and $9800 as the minimum, and maximum value for the
attribute income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:

o Z-score normalization: This method normalizes the value for attribute A using
the meanand standard deviation. The following formula is used for Z-score
normalization:

Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and
$16,000. And we have to normalize the value $73,600 using z-score normalization.

o Decimal Scaling: This method normalizes the value of attribute A by moving the decimal
point in the value. This movement of a decimal point depends on the maximum absolute
value of A. The formula for the decimal scaling is given below:

Here j is the smallest integer such that max(|v'i|)<1


For example, the observed values for attribute A range from -986 to 917, and the maximum
absolute value for attribute A is 986. Here, to normalize each value of attribute A using
decimal scaling, we have to divide each value of attribute A by 1000, i.e., j=3.
So, the value -986 would be normalized to -0.986, and 917 would be normalized to 0.917.
The normalization parameters such as mean, standard deviation, the maximum absolute
value must be preserved to normalize the future data uniformly.

5. Data Discretization

This is a process of converting continuous data into a set of data intervals. Continuous
attribute values are substituted by small interval labels. This makes the data easier to study
and analyze. If a data mining task handles a continuous attribute, then its discrete values
can be replaced by constant quality attributes. This improves the efficiency of the task.

This method is also called a data reduction mechanism as it transforms a large dataset
into a set of categorical data. Discretization also uses decision tree-based algorithms to
produce short, compact, and accurate results when using discrete values.
Data discretization can be classified into two types: supervised discretization, where the
class information is used, and unsupervised discretization, which is based on which
direction the process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging
strategy'.

For example, the values for the age attribute can be replaced by the interval labels such
as (0-10, 11-20…) or (kid, youth, adult, senior).

6. Data Generalization

It converts low-level data attributes to high-level data attributes using concept hierarchy.
This conversion from a lower level to a higher conceptual level is useful to get a clearer
picture of the data. Data generalization can be divided into two approaches:

o Data cube process (OLAP) approach.


o Attribute-oriented induction (AOI) approach.

For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a
higher conceptual level into a categorical value (young, old).

Data Transformation Process


The entire process for transforming data is known as ETL (Extract, Load, and
Transform). Through the ETL process, analysts can convert data to its desired format.
Here are the steps involved in the data transformation process:
1. Data Discovery: During the first stage, analysts work to understand and identify data in
its source format. To do this, they will use data profiling tools. This step helps analysts
decide what they need to do to get data into its desired format.
2. Data Mapping: During this phase, analysts perform data mapping to determine how
individual fields are modified, mapped, filtered, joined, and aggregated. Data mapping is
essential to many data processes, and one misstep can lead to incorrect analysis and ripple
through your entire organization.
3. Data Extraction: During this phase, analysts extract the data from its original source.
These may include structured sources such as databases or streaming sources such as
customer log files from web applications.
4. Code Generation and Execution: Once the data has been extracted, analysts need to
create a code to complete the transformation. Often, analysts generate codes with the
help of data transformation platforms or tools.
5. Review: After transforming the data, analysts need to check it to ensure everything has
been formatted correctly.
6. Sending: The final step involves sending the data to its target destination. The target
might be a data warehouse or a database that handles both structured and unstructured
data.

Advantages of Data Transformation


Transforming data can help businesses in a variety of ways. Here are some of the essential
advantages of data transformation, such as:

o Better Organization: Transformed data is easier for both humans and computers to use.
o Improved Data Quality: There are many risks and costs associated with bad data. Data
transformation can help your organization eliminate quality issues such as missing values
and other inconsistencies.
o Perform Faster Queries: You can quickly and easily retrieve transformed data thanks to it
being stored and standardizedin a source location.
o Better Data Management: Businesses are constantly generating data from more and
more sources. If there are inconsistencies in the metadata, it can be challenging to
organize and understand it. Data transformation refines your metadata, so it's easier to
organize and understand.
o More Use Out of Data: While businesses may be collecting data constantly, a lot of that
data sits around unanalyzed. Transformation makes it easier to get the most out of your
data by standardizing it and making it more usable.

Disadvantages of Data Transformation


While data transformation comes with a lot of benefits, still there are some challenges to
transforming data effectively, such as:

o Data transformation can be expensive. The cost is dependent on the specific infrastructure,
software, and tools used to process data. Expenses may include licensing, computing
resources, and hiring necessary personnel.
o Data transformation processes can be resource-intensive. Performing transformations in
an on-premises data warehouse after loading or transforming data before feeding it into
applications can create a computational burden that slows down other operations. If you
use a cloud-based data warehouse, you can do the transformations after loading because
the platform can scale up to meet demand.
o Lack of expertise and carelessness can introduce problems during transformation. Data
analysts without appropriate subject matter expertise are less likely to notice incorrect data
because they are less familiar with the range of accurate and permissible values.
o Enterprises can perform transformations that don't suit their needs. A business might
change information to a specific format for one application only to then revert the
information to its prior format for a different application.

Ways of Data Transformation


There are several different ways to transform data, such as:
o Scripting: Data transformation through scripting involves Python or SQL to write the
code to extract and transform data. Python and SQL are scripting languages that allow you
to automate certain tasks in a program. They also allow you to extract information from
data sets. Scripting languages require less code than traditional programming languages.
Therefore, it is less intensive.
o On-Premises ETL Tools: ETL tools take the required work to script the data transformation
by automating the process. On-premises ETL tools are hosted on company servers. While
these tools can help save you time, using them often requires extensive expertise and
significant infrastructure costs.
o Cloud-Based ETL Tools: As the name suggests, cloud-based ETL tools are hosted in the
cloud. These tools are often the easiest for non-technical users to utilize. They allow you
to collect data from any cloud source and load it into your data warehouse. With cloud-
based ETL tools, you can decide how often you want to pull data from your source, and
you can monitor your usage.

In predictive data analytics, transforming data using functions or mappings can be a crucial step
in preparing the data for modeling. Transformations help in improving the quality and structure
of the data, handling outliers, normalizing variables, and creating new features. Here are some
common techniques for transforming data using functions or mappings:

1. Scaling and Normalization:


- Min-Max Scaling: Rescales the data to a specific range (e.g., between 0 and 1) using the
minimum and maximum values of the variable.
- Standardization: Transforms the data to have zero mean and unit variance by subtracting the
mean and dividing by the standard deviation.

2. Logarithmic Transformations:
- Log Transformation: Applies the logarithm function to the data, which can help in handling
skewed distributions and reducing the impact of outliers.
3. Power Transformations:
- Square Root Transformation: Takes the square root of the data, which can be useful for
reducing the impact of large values and handling right-skewed distributions.
- Box-Cox Transformation: A more general power transformation that optimizes the
transformation parameter lambda to achieve the best approximation of a normal distribution.

4. Binning and Discretization:


- Binning: Divides continuous variables into discrete bins or intervals. This can be useful for
converting continuous data into categorical or ordinal variables.
- Discretization: Maps continuous variables to a set of discrete values or categories based on
predefined thresholds or statistical measures.

5. Encoding Categorical Variables:


- One-Hot Encoding: Creates binary columns for each unique category in a categorical
variable, representing the presence or absence of a category.
- Label Encoding: Assigns a unique numerical label to each category in a categorical variable,
allowing algorithms to work with categorical data.

6. Feature Engineering:
- Polynomial Features: Generates higher-order polynomial features by combining existing
features through multiplication or exponentiation.
- Interaction Terms: Creates new features by capturing interactions between existing features
(e.g., multiplying two variables).

These are just a few examples of transformations that can be applied to data using functions or
mappings. The choice of transformation techniques depends on the nature of the data, the
problem at hand, and the requirements of the predictive modeling task. It is important to
carefully analyze and understand the data before applying any transformations to ensure their
effectiveness and suitability for the specific problem.
In data analytics, replacing values in a dataset is a common task, often required to handle missing
data, outliers, or to transform specific values. Python provides various ways to replace values in
a dataset. Let's explore a few techniques:

1. Replacing with a constant value:


You can replace specific values in a dataset with a constant value using pandas' `replace()`
function. Here's an example:

```python
import pandas as pd

data = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [6, 7, 8, 9, 10]})

# Replace all occurrences of 4 with 0 in column 'A'


data['A'] = data['A'].replace(4, 0)
print(data)
```

Output:
```
A B
0 1 6
1 2 7
2 3 8
3 0 9
4 5 10
```

2. Replacing with a calculated value:


You can replace values based on some calculation or condition using pandas' `replace()`
function or other methods. Here's an example:

```python
import pandas as pd
import numpy as np

data = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [6, 7, 8, 9, 10]})

# Replace all odd values in column 'A' with their squares


data['A'] = data['A'].replace(data['A'][data['A'] % 2 != 0], data['A'] ** 2)
print(data)
```

Output:
```
A B
0 1 6
1 4 7
2 9 8
3 4 9
4 25 10
```
3. Replacing missing values:
Missing values are often replaced with appropriate values to avoid issues during analysis.
Pandas provides the `fillna()` function to replace missing values. Here's an example:

```python
import pandas as pd
import numpy as np

data = pd.DataFrame({'A': [1, np.nan, 3, 4, 5], 'B': [6, 7, np.nan, 9, 10]})

# Replace missing values with the mean of respective columns


data = data.fillna(data.mean())
print(data)
```

Output:
```
A B
0 1.0 6.0
1 3.25 7.0
2 3.0 7.5
3 4.0 9.0
4 5.0 10.0
```

In this example, missing values in columns 'A' and 'B' are replaced with the mean of each
column.
These are just a few examples of replacing values in a dataset using Python and pandas.
Depending on your specific requirements, you can also use other methods like `map()`, `apply()`,
or conditional statements to replace values.

Handling missing data is an important task in data analytics as missing values can impact the
accuracy and reliability of analysis and modeling. Python provides several libraries and
techniques to handle missing data. Here are some common approaches:

1. Identifying missing values:


Before handling missing data, it's crucial to identify where the missing values are located in the
dataset. In pandas, missing values are typically represented as NaN (Not a Number) or None.
You can use functions like `isnull()`, `notnull()`, or `isna()` to detect missing values. For
example:

```python
import pandas as pd

data = pd.DataFrame({'A': [1, 2, None, 4, 5], 'B': [6, None, 8, 9, 10]})

# Check for missing values in the DataFrame


print(data.isnull())
```

Output:
```
A B
0 False False
1 False True
2 True False
3 False False
4 False False
```

2. Dropping missing values:


If the missing values are deemed irrelevant or if they are present in a small number of rows,
you may choose to drop those rows or columns. Pandas provides the `dropna()` function to
remove rows or columns containing missing values. Here's an example:

```python
import pandas as pd

data = pd.DataFrame({'A': [1, 2, None, 4, 5], 'B': [6, None, 8, 9, 10]})

# Drop rows with missing values


data = data.dropna()
print(data)
```

Output:
```
A B
3 4.0 9
4 5.0 10
```

3. Filling missing values:


Instead of dropping missing values, you can fill them with appropriate values. Pandas provides
the `fillna()` function to replace missing values with specific values or calculated values. Here's
an example:

```python
import pandas as pd

data = pd.DataFrame({'A': [1, 2, None, 4, 5], 'B': [6, None, 8, 9, 10]})

# Fill missing values with the mean of respective columns


data = data.fillna(data.mean())
print(data)
```

Output:
```
A B
0 1.0 6.0
1 2.0 7.6
2 3.0 8.0
3 4.0 9.0
4 5.0 10.0
```

In this example, missing values are filled with the mean of each respective column.

4. Imputation techniques:
Advanced techniques can be used to impute missing values based on patterns or relationships
in the data. Common imputation methods include mean imputation, median imputation, mode
imputation, or more sophisticated techniques like regression imputation or k-nearest neighbors
imputation. Libraries like scikit-learn and fancy impute provide various imputation algorithms.

```python
from sklearn.impute import SimpleImputer

data = pd.DataFrame({'A': [1, 2, None, 4, 5], 'B': [6, None, 8, 9, 10]}


Data Analytics types :

Types of analytics explained —


descriptive, predictive, prescriptive, and
more

Adobe Communications Team


11-08-2022

Most business leaders have a general understanding of data analytics


and many companies have departments dedicated to gathering and
interpreting information about customers, processes, and markets.
But there is more than one kind of analytics — and each tells a
different story about your business. Understanding the different
types of analytics can help you choose the ones that will benefit
your business most and ultimately drive business objectives.

This post will explore the three most common types of data analytics
and one less known model. This information will help you gain better
insight into what your data says about your business so you can make
adjustments to meet your goals.

 Descriptive analytics
 Predictive analytics
 Prescriptive analytics
 Diagnostic analytics
 The future of analytics

Types of business analytics

The process of business analytics is an essential tool for interpreting


and applying the vast amount of data your company collects and
organizes. From customer behavior and conversion rates to revenue
and business processes, the information generated by your
company’s operations has to tell a helpful story to benefit you.
Business analytics is the process that helps turn those data points
into actionable insights.

The four different types of business analytics are descriptive,


predictive, prescriptive, and diagnostic. Exploring the distinctions
between these models can help you learn how to use each to support
your business goals.
Descriptive analytics

Descriptive analytics examines what happened in the past. You’re


utilizing descriptive analytics when you examine past data sets for
patterns and trends. This is the core of most businesses’ analytics
because it answers important questions like how much you sold and
if you hit specific goals. It’s easy to understand even for non-data
analysts.

Descriptive analytics functions by identifying what metrics you want


to measure, collecting that data, and analyzing it. It turns the
stream of facts your business has collected into information you can
act on, plan around, and measure.

Examples of descriptive analytics include:

 Annual revenue reports


 Survey response summaries
 Year-over-year sales reports

The main difficulty of descriptive analytics is its limitations. It’s a


helpful first step for decision makers and managers, but it can’t go
beyond analyzing data from past events. Once descriptive analytics
is done, it’s up to your team to ask how or why those trends
occurred, brainstorm and develop possible responses or solutions,
and choose how to move forward.
Predictive analytics

Predictive analytics is what it sounds like — it aims to predict likely


outcomes and make educated forecasts using historical data.
Predictive analytics extends trends into the future to see possible
outcomes. This is a more complex version of data analytics because
it uses probabilities for predictions instead of simply interpreting
existing facts.

Use predictive analytics by first identifying what you want to predict


and then bringing existing data together to project possibilities to a
particular date. Statistical modeling or machine learning are
commonly used with predictive analytics. This is how you answer
planning questions such as how much you might sell or if you’re on
track to hit your Q4 targets.

A business is in a better position to set realistic goals and avoid risks


if they use data to create a list of likely outcomes. Predictive
analytics can keep your team or the company as a whole aligned on
the same strategic vision.

Examples of predictive analytics include:


 Ecommerce businesses that use a customer’s browsing and
purchasing history to make product recommendations.
 Financial organizations that need help determining whether a
customer is likely to pay their credit card bill on time.
 Marketers who analyze data to determine the likelihood that
new customers will respond favorably to a given campaign or
product offering.

The primary challenge with predictive analytics is that the insights it


generates are limited to the data. First, that means that smaller or
incomplete data sets will not yield predictions as accurate as larger
data sets might. Getting good business intelligence (BI) from
predictive analytics requires sufficient data, but what counts as
“sufficient” depends on the industry, business, audience, and the
use case.

Additionally, the challenge of predictive analytics being restricted to


the data simply means that even the best algorithms with the
biggest data sets can’t weigh intangible or distinctly human factors.
A sudden economic shift or even a change in the weather can affect
spending, but a predictive analytics model can’t account for those
variables.
Prescriptive analytics

Prescriptive analytics uses the data from a variety of sources —


including statistics, machine learning, and data mining — to identify
possible future outcomes and show the best option. Prescriptive
analytics is the most advanced of the three types because it provides
actionable insights instead of raw data. This methodology is how you
determine what should happen, not just what could happen.

Using prescriptive analytics enables you to not only envision future


outcomes, but to understand why they will happen. Prescriptive
analytics also can predict the effect of future decisions, including
the ripple effects those decisions can have on different parts of the
business. And it does this in whatever order the decisions may occur.

Prescriptive analytics is a complex process that involves many


variables and tools like algorithms, machine learning, and big data.
Proper data infrastructures need to be established or this type of
analytics could be a challenge to manage.

Examples of prescriptive analytics include:

 Calculating client risk in the insurance industry to determine


what plans and rates an account should be offered.
 Discovering what features to include in a new product to
ensure its success in the market, possibly by analyzing data like
customer surveys and market research to identify what
features are most desirable for customers and prospects.
 Identifying tactics to optimize patient care in healthcare, like
assessing the risk for developing specific health problems in the
future and targeting treatment decisions to reduce those risks.

The most common issue with prescriptive analytics is that it requires


a lot of data to produce useful results, but a large amount of data
isn’t always available. This type of analytics could easily become
inaccessible for most.

Though the use of machine learning dramatically reduces the


possibility of human error, an additional downside is that it can’t
always account for all external variables since it often relies on
machine learning algorithms.

Diagnostic analytics

Another common type of analytics is diagnostic analytics and it helps


explain why things happened the way they did. It’s a more complex
version of descriptive analytics, extending beyond what happened
to why it happened.

Diagnostics analytics identifies trends or patterns in the past and


then goes a step further to explain why the trends occurred the way
they did. It’s a logical step after descriptive analytics because it
answers questions like why a certain amount was sold or why Q1
targets were hit.

Diagnostic analytics is also a useful tool for businesses that want


more confidence to duplicate good outcomes and avoid negative
ones. Descriptive analytics can tell you what happened but then it is
up to your team to figure out what to do with that data. Diagnostic
analytics applies data to figure out why something happened so you
can develop better strategies without so much trial and error.
Examples of diagnostic analytics include:

 Why did year-over-year sales go up?


 Why did a certain product perform above expectations?
 Why did we lose customers in Q3?

The main flaw with diagnostic analytics is its limitation of providing


actionable observations about the future by focusing on past
occurrences. Understanding the causal relationships and sequences
may be enough for some businesses, but it may not provide
sufficient answers for others. For the latter, managing big data will
likely require more advanced analytics solutions and you might have
to implement additional tools — venturing into predictive or
prescriptive analytics — to find meaningful insights.

The future of analytics

The use of analytics in business is not new, but it is on a steep


growth trajectory. Fueled by huge data sets streaming in from the
IoT, advancements in AI, and the growth of self-service BI tools, the
use of analytics in business has yet to peak.

The US Bureau of Labor Statistics predicts huge growth in the


number of research analysts in the coming years, projecting a “must
faster than average” growth rate of 19%. Additionally, some of the
industry’s top experts in data science and analytics predict the ideal
candidate for businesses in the future will be a person who can both
understand and speak data.

Even as the need for analytics experts grows, the market for self-
service tools continues to escalate as well. A report from Allied
Market Research expects the self-service BI market to reach $14.19
billion by 2026, and Gartner cites the growth of business-composed
data and analytics, that focuses on people, “shifting from IT to
business.”

Implementing more advanced analytics — and for some businesses


bringing analytics into business strategy — will continue to become
more important for companies of all sizes.
Association rules :

What are association rules?


Association rules are "if-then" statements, that help to show the probability of
relationships between data items, within large data sets in various types of databases.
Association rule mining has a number of applications and is widely used to help
discover sales correlations in transactional data or in medical data sets.

Use cases for association rules


In data science, association rules are used to find correlations and co-occurrences
between data sets. They are ideally used to explain patterns in data from seemingly
independent information repositories, such as relational databases and transactional
databases. The act of using association rules is sometimes referred to as "association
rule mining" or "mining associations."

Below are a few real-world use cases for association rules:

 Medicine. Doctors can use association rules to help diagnose patients.


There are many variables to consider when making a diagnosis, as many
diseases share symptoms. By using association rules and machine learning-
fueled data analysis, doctors can determine the conditional probability of a
given illness by comparing symptom relationships in the data from past
cases. As new diagnoses get made, machine learning models can adapt the
rules to reflect the updated data.

 Retail. Retailers can collect data about purchasing patterns, recording


purchase data as item barcodes are scanned by point-of-sale systems.
Machine learning models can look for co-occurrence in this data to
determine which products are most likely to be purchased together. The
retailer can then adjust marketing and sales strategy to take advantage of
this information.
 User experience (UX) design. Developers can collect data on how
consumers use a website they create. They can then use associations in the
data to optimize the website user interface -- by analyzing where users
tend to click and what maximizes the chance that they engage with a call to
action, for example.

 Entertainment. Services like Netflix and Spotify can use association rules to
fuel their content recommendation engines. Machine learning models
analyze past user behavior data for frequent patterns, develop association
rules and use those rules to recommend content that a user is likely to
engage with, or organize content in a way that is likely to put the most
interesting content for a given user first.
How association rules work
Association rule mining, at a basic level, involves the use of machine learning models to
analyze data for patterns, or co-occurrences, in a database. It identifies frequent if-
then associations, which themselves are the association rules.

An association rule has two parts: an antecedent (if) and a consequent (then). An
antecedent is an item found within the data. A consequent is an item found in
combination with the antecedent.

Association rules are created by searching data for frequent if-then patterns and using
the criteria support and confidence to identify the most important
relationships. Support is an indication of how frequently the items appear in the data.
Confidence indicates the number of times the if-then statements are found true. A
third metric, called lift, can be used to compare confidence with expected confidence,
or how many times an if-then statement is expected to be found true.

Association rules are calculated from itemsets, which are made up of two or more
items. If rules are built from analyzing all the possible itemsets, there could be so
many rules that the rules hold little meaning. With that, association rules are typically
created from rules well-represented in data.
Measures of the effectiveness of association rules
The strength of a given association rule is measured by two main parameters: support
and confidence. Support refers to how often a given rule appears in the database being
mined. Confidence refers to the amount of times a given rule turns out to be true in
practice. A rule may show a strong correlation in a data set because it appears very
often but may occur far less when applied. This would be a case of high support, but
low confidence.

Conversely, a rule might not particularly stand out in a data set, but continued analysis
shows that it occurs very frequently. This would be a case of high confidence and low
support. Using these measures helps analysts separate causation from correlation and
allows them to properly value a given rule.

A third value parameter, known as the lift value, is the ratio of confidence to support.
If the lift value is a negative value, then there is a negative correlation between
datapoints. If the value is positive, there is a positive correlation, and if the ratio
equals 1, then there is no correlation.

Uses of association rules in data mining


In data mining, association rules are useful for analyzing and predicting customer
behavior. They play an important part in customer analytics, market basket analysis,
product clustering, catalog design and store layout.

Programmers use association rules to build programs capable of machine learning.


Machine learning is a type of artificial intelligence (AI) that seeks to build programs
with the ability to become more efficient without being explicitly programmed.

Examples of association rules in data mining


A classic example of association rule mining refers to a relationship between diapers
and beers. The example, which seems to be fictional, claims that men who go to a
store to buy diapers are also likely to buy beer. Data that would point to that might
look like this:

A supermarket has 200,000 customer transactions. About 4,000 transactions, or about


2% of the total number of transactions, include the purchase of diapers. About 5,500
transactions (2.75%) include the purchase of beer. Of those, about 3,500 transactions,
1.75%, include both the purchase of diapers and beer. Based on the percentages, that
large number should be much lower. However, the fact that about 87.5% of diaper
purchases include the purchase of beer indicates a link between diapers and beer.

How does Association Rule Learning work?


Association rule learning works on the concept of If and Else Statement, such as if A then
B.

Here the If element is called antecedent, and then statement is called as Consequent.
These types of relationships where we can find out some association or relation between
two items is known as single cardinality. It is all about creating rules, and if the number of
items increases, then cardinality also increases accordingly. So, to measure the
associations between thousands of data items, there are several metrics. These metrics
are given below:

o Support
o Confidence
o Lift

Let's understand each of them:

Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already given. It
is the ratio of the transaction that contains X and Y to the number of records that contain
X.

Lift
It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of


each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has
a negative effect on another.
Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate the association rules
between objects. It means how two or more objects are related to one another. In other
words, we can say that the apriori algorithm is an association rule leaning that analyzes
that people who bought product A also bought product B.

The primary objective of the apriori algorithm is to create the association rule between
different objects. The association rule describes how two or more objects are related to
one another. Apriori algorithm is also called frequent pattern mining. Generally, you
operate the Apriori algorithm on a database that consists of a huge number of
transactions. Let's understand the apriori algorithm with the help of an example; suppose
you go to Big Bazar and buy different products. It helps the customers buy their products
with ease and increases the sales performance of the Big Bazar. In this tutorial, we will
discuss the apriori algorithm with examples.

Introduction
We take an example to understand the concept better. You must have noticed that the
Pizza shop seller makes a pizza, soft drink, and breadstick combo together. He also offers
a discount to their customers who buy these combos. Do you ever think why does he do
so? He thinks that customers who buy pizza also buy soft drinks and breadsticks. However,
by making combos, he makes it easy for the customers. At the same time, he also increases
his sales performance.

Similarly, you go to Big Bazar, and you will find biscuits, chips, and Chocolate bundled
together. It shows that the shopkeeper makes it comfortable for the customers to buy
these products in the same place.

The above two examples are the best examples of Association Rules in Data Mining. It
helps us to learn the concept of apriori algorithms.

What is Apriori Algorithm?


Apriori algorithm refers to an algorithm that is used in mining frequent products sets and
relevant association rules. Generally, the apriori algorithm operates on a database
containing a huge number of transactions. For example, the items customers but at a Big
Bazar.
Apriori algorithm helps the customers to buy their products with ease and increases the
sales performance of the particular store.

Components of Apriori algorithm


The given three components comprise the apriori algorithm.

1. Support
2. Confidence
3. Lift

Let's take an example to understand this concept.

We have already discussed above; you need a huge database containing a large no of
transactions. Suppose you have 4000 customers transactions in a Big Bazar. You have to
calculate the Support, Confidence, and Lift for two products, and you may say Biscuits and
Chocolate. This is because customers frequently buy these two items together.

Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these
600 transactions include a 200 that includes Biscuits and chocolates. Using this data, we
will find out the support, confidence, and lift.

Support
Support refers to the default popularity of any product. You find the support as a quotient
of the division of the number of transactions comprising that product by the total number
of transactions. Hence, we get

Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)

= 400/4000 = 10 percent.

Confidence
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that comprise
both biscuits and chocolates by the total number of transactions to get the confidence.

Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions
involving Biscuits)

= 200/400

= 50 percent.

It means that 50 percent of customers who bought biscuits bought chocolates also.

Lift
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates
when you sell biscuits. The mathematical equations of lift are given below.

Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)

= 50/10 = 5

It means that the probability of people buying both biscuits and chocolates together is
five times more than that of purchasing the biscuits alone. If the lift value is below one, it
requires that the people are unlikely to buy both the items together. Larger the value, the
better is the combination.

How does the Apriori Algorithm work in Data


Mining?
We will understand this algorithm with the help of an example

Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk, Apple}.
The database comprises six transactions where 1 represents the presence of the product
and 0 represents the absence of the product.

Transaction ID Rice Pulse Oil Milk Apple

t1 1 1 1 0

t2 0 1 1 1
t3 0 0 0 1

t4 1 1 0 1

t5 1 1 1 0

t6 1 1 1 1

The Apriori Algorithm makes the given assumptions

o All subsets of a frequent itemset must be frequent.


o The subsets of an infrequent item set must be infrequent.
o Fix a threshold support level. In our case, we have fixed it at 50 percent.

Step 1

Make a frequency table of all the products that appear in all the transactions. Now, short
the frequency table to add only those products with a threshold support level of over 50
percent. We find the given frequency table.

Product Frequency (Number of transactions)

Rice (R) 4

Pulse(P) 5

Oil(O) 4

Milk(M) 4

The above table indicated the products frequently bought by the customers.

Step 2
Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given frequency
table.

Itemset Frequency (Number of transactions)

RP 4

RO 3

RM 2

PO 4

PM 3

OM 2

Step 3

Implementing the same threshold support of 50 percent and consider the products that
are more than 50 percent. In our case, it is more than 3

Thus, we get RP, RO, PO, and PM

Step 4

Now, look for a set of three products that the customers buy together. We get the given
combination.

1. RP and RO give RPO


2. PO and PM give POM

Step 5

Calculate the frequency of the two itemsets, and you will get the given frequency table.
Itemset Frequency (Number of transactions)

RPO 4

POM 3

If you implement the threshold assumption, you can figure out that the customers' set of
three products is RPO.

We have considered an easy example to discuss the apriori algorithm in data mining. In
reality, you find thousands of such combinations.

How to improve the efficiency of the Apriori


Algorithm?
There are various methods used for the efficiency of the Apriori algorithm

Hash-based itemset counting

In hash-based itemset counting, you need to exclude the k-itemset whose equivalent
hashing bucket count is least than the threshold is an infrequent itemset.

Transaction Reduction

In transaction reduction, a transaction not involving any frequent X itemset becomes not
valuable in subsequent scans.

Apriori Algorithm in data mining


We have already discussed an example of the apriori algorithm related to the frequent
itemset generation. Apriori algorithm has many applications in data mining.

The primary requirements to find the association rules in data mining are given below.

Use Brute Force


Analyze all the rules and find the support and confidence levels for the individual rule.
Afterward, eliminate the values which are less than the threshold support and confidence
levels.

The two-step approaches

The two-step approach is a better option to find the associations rules than the Brute
Force method.

Step 1

In this article, we have already discussed how to create the frequency table and calculate
itemsets having a greater support value than that of the threshold support.

Step 2

To create association rules, you need to use a binary partition of the frequent itemsets.
You need to choose the ones having the highest confidence levels.

In the above example, you can see that the RPO combination was the frequent itemset.
Now, we find out all the rules using RPO.

RP-O, RO-P, PO-R, O-RP, P-RO, R-PO

You can see that there are six different combinations. Therefore, if you have n elements,
there will be 2n - 2 candidate association rules.

Advantages of Apriori Algorithm


o It is used to calculate large itemsets.
o Simple to understand and apply.

Disadvantages of Apriori Algorithms


o Apriori algorithm is an expensive method to find support since the calculation has to pass
through the whole database.
o Sometimes, you need a huge number of candidate rules, so it becomes computationally
more expensive.
FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the database.
The purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree
represents an item of the itemset.

The root node represents null while the lower nodes represent the itemsets. The association of the
nodes with the lower nodes that is the itemsets with the other itemsets are maintained while
forming the tree.

Frequent Pattern Algorithm Steps


The frequent pattern growth method lets us find the frequent pattern without candidate
generation.

Let us see the steps followed to mine the frequent pattern using frequent pattern growth
algorithm:
#1) The first step is to scan the database to find the occurrences of the itemsets in the database.
This step is the same as the first step of Apriori. The count of 1-itemsets in the database is called
support count or frequency of 1-itemset.
#2) The second step is to construct the FP tree. For this, create the root of the tree. The root is
represented by null.
#3) The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top, the
next itemset with lower count and so on. It means that the branch of the tree is constructed with
transaction itemsets in descending order of count.
#4) The next transaction in the database is examined. The itemsets are ordered in descending
order of count. If any itemset of this transaction is already present in another branch (for example
in the 1st transaction), then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this
transaction.

#5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked according to
transactions.
#6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along
with the links of the lowest nodes. The lowest node represents the frequency pattern length 1.
From this, traverse the path in the FP Tree. This path or paths are called a conditional pattern
base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with
the lowest node (suffix).

#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
#8) Frequent Patterns are generated from the Conditional FP Tree.
Example Of FP-Growth Algorithm
Support threshold=50%, Confidence= 60%
Table 1
Transaction List of items
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3

1. Count of each item


Table 2
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
2. Sort the itemset in descending order.
Table 3
Item Count
I2 5
I1 4
I3 4
I4 4
3. Build FP Tree
1. Considering the root node null.
2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1},
{I3:1}, where I2 is linked as a child to root, I1 is linked to I2 and I3 is linked to
I1.
3. T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to I2
and I4 is linked to I3. But this branch would share I2 node as common as it is
already used in T1.
4. Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as a
child to I3. The count is {I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root
node, hence it will be incremented by 1. Similarly I1 will be incremented by 1 as
it is already linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3},
{I3:2}, {I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4},
{I3:3}, {I4 1}.
4. Mining of FP-tree is summarized below:
1. The lowest node item I5 is not considered as it does not have a min support count,
hence it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}.
Therefore considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3:
1}. This forms the conditional pattern base.
3. The conditional pattern base is considered a transaction database, an FP-tree is
constructed. This will contain {I2:2, I3:2}, I1 is not considered as it does not meet
the min support count.
4. This path will generate all combinations of frequent patterns :
{I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-
tree : {I2:4, I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3},
{I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree:
{I2:4} and frequent patterns are generated: {I2, I1:4}.
Item Conditional Pattern Base Conditional FP-tree Frequent Patterns Generated
I4 {I2,I1,I3:1},{I2,I3:1} {I2:2, I3:2} {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
I3 {I2,I1:3},{I2:1} {I2:4, I1:3} {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}
I1 {I2:4} {I2:4} {I2,I1:4}
The diagram given below depicts the conditional FP tree associated with the conditional node I3.
Advantages Of FP Growth Algorithm
1. This algorithm needs to scan the database only twice when compared to Apriori
which scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages Of FP-Growth Algorithm
1. FP Tree is more cumbersome and difficult to build than Apriori.
2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared memory.
FP Growth vs Apriori
FP Growth Apriori
Pattern Generation
FP growth generates pattern by constructing a Apriori generates pattern by pairing the items into
FP tree singletons, pairs and triplets.
Candidate Generation
There is no candidate generation Apriori uses candidate generation
Process
The process is faster as compared to Apriori. The process is comparatively slower than FP
The runtime of process increases linearly with Growth, the runtime increases exponentially with
increase in number of itemsets. increase in number of itemsets
Memory Usage
A compact version of database is saved The candidates combinations are saved in memory

What Is a Regression?
Regression is a statistical method used in finance, investing, and other
disciplines that attempts to determine the strength and character of the
relationship between one dependent variable (usually denoted by Y) and a
series of other variables (known as independent variables).

Also called simple regression or ordinary least squares (OLS), linear


regression is the most common form of this technique. Linear regression
establishes the linear relationship between two variables based on a line of
best fit. Linear regression is thus graphically depicted using a straight line
with the slope defining how the change in one variable impacts a change in
the other. The y-intercept of a linear regression relationship represents the
value of one variable when the value of the other is zero. Non-linear
regression models also exist, but are far more complex.

Regression analysis is a powerful tool for uncovering the associations


between variables observed in data, but cannot easily indicate causation. It is
used in several contexts in business, finance, and economics. For instance, it
is used to help investment managers value assets and understand the
relationships between factors such as commodity prices and the stocks of
businesses dealing in those commodities.

Regression as a statistical technique should not be confused with the concept


of regression to the mean (mean reversion).

KEY TAKEAWAYS

 A regression is a statistical technique that relates a dependent variable


to one or more independent (explanatory) variables.
 A regression model is able to show whether changes observed in the
dependent variable are associated with changes in one or more of the
explanatory variables.
 It does this by essentially fitting a best-fit line and seeing how the data
is dispersed around this line.
 Regression helps economists and financial analysts in things ranging
from asset valuation to making predictions.
 In order for regression results to be properly interpreted, several
assumptions about the data and the model itself must hold.
0 seconds of 1 minute, 21 secondsVolume 75%

1:21
Regression
Understanding Regression
Regression captures the correlation between variables observed in a data set
and quantifies whether those correlations are statistically significant or not.

The two basic types of regression are simple linear regression and multiple
linear regression, although there are non-linear regression methods for more
complicated data and analysis. Simple linear regression uses one
independent variable to explain or predict the outcome of the dependent
variable Y, while multiple linear regression uses two or more independent
variables to predict the outcome (while holding all others constant).
Regression can help finance and investment professionals as well as
professionals in other businesses. Regression can also help predict sales for
a company based on weather, previous sales, GDP growth, or other types of
conditions. The capital asset pricing model (CAPM) is an often-used
regression model in finance for pricing assets and discovering the costs of
capital.

Regression Analysis in Machine learning


Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
More specifically, Regression analysis helps us to understand how the value of the
dependent variable is changing corresponding to an independent variable when other
independent variables are held fixed. It predicts continuous/real values such
as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement


every year and get sales on that. The below list shows the advertisement made by the
company in the last 5 years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year. So to solve such type of prediction
problems in machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation


between variables and enables us to predict the continuous output variable based on the
one or more predictor variables. It is mainly used for prediction, forecasting, time series
modeling, and determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In
simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance
between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:


o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable, also
called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high value
in comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in
the dataset, because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is
called underfitting.

Why do we use Regression Analysis?


As mentioned above, Regression analysis helps in the prediction of a continuous variable.
There are various scenarios in the real world where we need some future predictions such
as weather condition, sales prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately. So for such case we need
Regression analysis which is a statistical method and used in machine learning and data
science. Below are some other reasons for using Regression analysis:

o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.

Linear Regression:

o Linear regression is a statistical regression method which is used for predictive


analysis.
o It is one of the very simple and easy algorithms which works on regression and
shows the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple
linear regression. And if there is more than one input variable, then such linear
regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be explained
using the below image. Here we are predicting the salary of an employee on the
basis of the year of experience.

o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.

Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve
the classification problems. In classification problems, we have dependent
variables in a binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes
or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear
regression algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex
cost function. This sigmoid function is used to model the data in logistic regression.
The function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

o It uses the concept of threshold levels, values above the threshold level are
rounded up to 1, and values below the threshold level are rounded up to 0.
There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Linear Regression in Machine Learning


Linear regression is one of the easiest and most popular Machine Learning algorithms. It
is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age, product
price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one
or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.

Linear Regression Line


A linear line showing the relationship between the dependent and independent variables
is called a regression line. A regression line can show two types of relationship:

o Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable increases on
X-axis, then such a relationship is termed as a Positive linear relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases on
the X-axis, then such a relationship is called a negative linear relationship.
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that means
the error between predicted values and actual values should be minimized. The best fit
line will have the least error.

The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line,
so to calculate this we use cost function.

Logistic Regression in Machine Learning


o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The below
image is showing the logistic function:

Note: Logistic regression uses the concept of predictive modeling as regression;


therefore, it is called logistic regression, but is used to classify samples; Therefore, it
falls under the classification algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Assumptions for Logistic Regression:


o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

import pandas as pd

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# Load the Titanic dataset

titanic_data = pd.read_csv('titanic.csv')

# Preprocess the data

# Drop irrelevant columns and handle missing values

titanic_data = titanic_data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked'], axis=1)

titanic_data['Age'] = titanic_data['Age'].fillna(titanic_data['Age'].mean())

titanic_data['Fare'] = titanic_data['Fare'].fillna(titanic_data['Fare'].mean())

titanic_data['Sex'] = titanic_data['Sex'].map({'female': 0, 'male': 1})

# Split the data into input features (X) and target variable (y)

X = titanic_data.drop('Survived', axis=1)
y = titanic_data['Survived']

# Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a LogisticRegression model

model = LogisticRegression()

# Fit the model to the training data

model.fit(X_train, y_train)

# Make predictions on the test data

y_pred = model.predict(X_test)

# Evaluate the model's accuracy

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

import pandas as pd

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset

data = pd.read_csv('dataset.csv')

# Split the data into input features (X) and target variable (y)
X = data.drop('target_variable', axis=1)

y = data['target_variable']

# Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a LogisticRegression model

model = LogisticRegression()

# Fit the model to the training data

model.fit(X_train, y_train)

# Make predictions on the test data

y_pred = model.predict(X_test)

# Evaluate the model's accuracy

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

# Print the confusion matrix

confusion_mat = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")

print(confusion_mat)

Classification :

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability
of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?


The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.


P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:


Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable


"Play". So using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions. So to solve this problem, we need to
follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes
8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.


o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.
Decision Tree Classification Algorithm
o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies


 Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.

 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.

 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.

 Pruning: Pruning is the process of removing the unwanted branches from the tree.

 Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or Not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next decision
node (distance from the office) and one leaf node based on the corresponding labels. The
next decision node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the segmentation of a


dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types
of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
Introduction
Let’s start with a practical example of using the Naive Bayes Algorithm.

Assume this is a situation you’ve got into in your data science project:

You are working on a classification problem and have generated your set
of hypotheses, created features, and discussed the importance of variables.
Within an hour, stakeholders want to see the first cut of the model.

What will you do? You have hundreds of thousands of data points and several
variables in your training data set. In such a situation, if I were in your place, I
would have used ‘Naive Bayes,‘ which can be extremely fast relative to
other classification algorithms. It works on Bayes’ theorem of probability to
predict the class of unknown data sets.

In this article, I’ll explain the basics of Naive Bayes (NB) in machine learning
(ML), so that the next time you come across large data sets, you can bring this
classification algorithm in ML (which is a part of AI) to action. In addition, if you
are a newbie in Python or R, you should not be overwhelmed by the presence
of available codes in this article.

Learning Objectives

Become a Full Stack Data Scientist


Transform into an expert and significantly impact the world of data science.
Download Brochure

 Understand the definition and working of the Naive Bayes algorithm.


 Get to know the various applications, pros, and cons of the classifier.
 Learn how to implement the NB Classifier or bayesian classification in R
and Python with a sample project.

If you prefer to learn the Naive Bayes’ theorem from the basics concepts to the
implementation in a structured manner, you can enroll in this free course:

 Naive Bayes Course from Scratch

Are you a beginner in Machine Learning? Do you want to master the machine
learning algorithms like Naive Bayes? Here is a comprehensive course
covering the machine learning and deep learning algorithms in detail –

 Certified AI & ML Blackbelt+ Program

Table of contents

 Introduction
 What Is the Naive Bayes Algorithm?
 Sample Project to Apply Naive Bayes
 How Do Naive Bayes Algorithms Work?
 What Are the Pros and Cons of Naive Bayes?
 Applications of Naive Bayes Algorithms
 How to Build a Basic Model Using Naive Bayes in Python & R?
 Tips to Improve the Power of the NB Model
 Conclusion
 Frequently Asked Questions

What Is the Naive Bayes Algorithm?


It is a classification technique based on Bayes’ Theorem with an independence
assumption among predictors. In simple terms, a Naive Bayes classifier
assumes that the presence of a particular feature in a class is unrelated to the
presence of any other feature.

The Naïve Bayes classifier is a popular supervised machine learning algorithm


used for classification tasks such as text classification. It belongs to the family
of generative learning algorithms, which means that it models the distribution of
inputs for a given class or category. This approach is based on the assumption
that the features of the input data are conditionally independent given the class,
allowing the algorithm to make predictions quickly and accurately.

In statistics, naive Bayes classifiers are considered as simple probabilistic


classifiers that apply Bayes’ theorem. This theorem is based on the probability
of a hypothesis, given the data and some prior knowledge. The naive Bayes
classifier assumes that all features in the input data are independent of each
other, which is often not true in real-world scenarios. However, despite this
simplifying assumption, the naive Bayes classifier is widely used because of its
efficiency and good performance in many real-world applications.

Moreover, it is worth noting that naive Bayes classifiers are among the simplest
Bayesian network models, yet they can achieve high accuracy levels when
coupled with kernel density estimation. This technique involves using a kernel
function to estimate the probability density function of the input data, allowing
the classifier to improve its performance in complex scenarios where the data
distribution is not well-defined. As a result, the naive Bayes classifier is a
powerful tool in machine learning, particularly in text classification, spam
filtering, and sentiment analysis, among others.

For example, a fruit may be considered to be an apple if it is red, round, and


about 3 inches in diameter. Even if these features depend on each other or
upon the existence of the other features, all of these properties independently
contribute to the probability that this fruit is an apple and that is why it is known
as ‘Naive’.

An NB model is easy to build and particularly useful for very large data sets.
Along with simplicity, Naive Bayes is known to outperform even highly
sophisticated classification methods.

Bayes theorem provides a way of computing posterior probability P(c|x) from


P(c), P(x) and P(x|c). Look at the equation below:

Above,

 P(c|x) is the posterior probability of class (c, target)


given predictor (x, attributes).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of the predictor given class.
 P(x) is the prior probability of the predictor.

Sample Project to Apply Naive Bayes

Problem Statement

HR analytics is revolutionizing the way human resources departments operate,


leading to higher efficiency and better results overall. Human resources have
been using analytics for years.

However, the collection, processing, and analysis of data have been largely
manual, and given the nature of human resources dynamics and HR KPIs, the
approach has been constraining HR. Therefore, it is surprising that HR
departments woke up to the utility of machine learning so late in the game. Here
is an opportunity to try predictive analytics in identifying the employees most
likely to get promoted.

Practice Now

How Do Naive Bayes Algorithms Work?

Time needed: 1 minute.

Let’s understand it using an example. Below I have a training data set of


weather and corresponding target variable ‘Play’ (suggesting possibilities of
playing). Now, we need to classify whether players will play or not based on
weather condition. Let’s follow the below steps to perform it.
1. Convert the data set into a frequency table

In this first step data set is converted into a frequency table

2. Create Likelihood table by finding the probabilities

Create Likelihood table by finding the probabilities like Overcast


probability = 0.29 and probability of playing is 0.64.

3. Use Naive Bayesian equation to calculate the posterior probability

Now, use Naive Bayesian equation to calculate the posterior probability


for each class. The class with the highest posterior probability is the
outcome of the prediction.

Problem: Players will play if the weather is sunny. Is this statement correct?

We can solve it using the above-discussed method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here P( Sunny | Yes) * P(Yes) is in the numerator, and P (Sunny) is in the


denominator.
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)=
9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

The Naive Bayes uses a similar method to predict the probability of different
class based on various attributes. This algorithm is mostly used in text
classification (nlp) and with problems having multiple classes.

What Are the Pros and Cons of Naive Bayes?


Pros:

 It is easy and fast to predict class of test data set. It also perform well in
multi class prediction
 When assumption of independence holds, the classifier performs better
compared to other machine learning models like logistic regression or
decision tree, and requires less training data.
 It perform well in case of categorical input variables compared to
numerical variable(s). For numerical variable, normal distribution is
assumed (bell curve, which is a strong assumption).

Cons:

 If categorical variable has a category (in test data set), which was not
observed in training data set, then model will assign a 0 (zero) probability
and will be unable to make a prediction. This is often known as “Zero
Frequency”. To solve this, we can use the smoothing technique. One of
the simplest smoothing techniques is called Laplace estimation.
 On the other side, Naive Bayes is also known as a bad estimator, so the
probability outputs from predict_proba are not to be taken too seriously.
 Another limitation of this algorithm is the assumption of independent
predictors. In real life, it is almost impossible that we get a set of predictors
which are completely independent.

Applications of Naive Bayes Algorithms

 Real-time Prediction: Naive Bayesian classifier is an eager learning


classifier and it is super fast. Thus, it could be used for making predictions
in real time.
 Multi-class Prediction: This algorithm is also well known for multi class
prediction feature. Here we can predict the probability of multiple classes
of target variable.
 Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayesian
classifiers mostly used in text classification (due to better result in multi
class problems and independence rule) have higher success rate as
compared to other algorithms. As a result, it is widely used in Spam
filtering (identify spam e-mail) and Sentiment Analysis (in social media
analysis, to identify positive and negative customer sentiments)
 Recommendation System: Naive Bayes Classifier and Collaborative
Filtering together builds a Recommendation System that uses machine
learning and data mining techniques to filter unseen information and
predict whether a user would like a given resource or not.
How to Build a Basic Model Using Naive Bayes in Python & R?
Again, scikit learn (python library) will help here to build a Naive Bayes model
in Python. There are five types of NB models under the scikit-learn library:

 Gaussian Naive Bayes: gaussiannb is used in classification tasks and


it assumes that feature values follow a gaussian distribution.
 Multinomial Naive Bayes: It is used for discrete counts. For example, let’s
say, we have a text classification problem. Here we can consider
Bernoulli trials which is one step further and instead of “word occurring in
the document”, we have “count how often word occurs in the document”,
you can think of it as “number of times outcome number x_i is observed
over the n trials”.
 Bernoulli Naive Bayes: The binomial model is useful if your feature vectors
are boolean (i.e. zeros and ones). One application would be text
classification with ‘bag of words’ model where the 1s & 0s are “word
occurs in the document” and “word does not occur in the document”
respectively.
 Complement Naive Bayes: It is an adaptation of Multinomial NB where the
complement of each class is used to calculate the model weights. So, this
is suitable for imbalanced data sets and often outperforms the MNB on
text classification tasks.
 Categorical Naive Bayes: Categorical Naive Bayes is useful if the features
are categorically distributed. We have to encode the categorical variable
in the numeric format using the ordinal encoder for using this algorithm.

Python Code:
Try out the below code in the coding window and check your results on the fly!
R Code:
require(e1071) #Holds the Naive Bayes Classifier
Train <- read.csv(file.choose())
Test <- read.csv(file.choose())

#Make sure the target variable is of a two-class classification


problem only

levels(Train$Item_Fat_Content)

model <- naiveBayes(Item_Fat_Content~., data = Train)


class(model)
pred <- predict(model,Test)
table(pred)

Above, we looked at the basic NB Model. You can improve the power of this
basic model by tuning parameters and handling assumptions intelligently. Let’s
look at the methods to improve the performance of this model. I recommend
you go through this document for more details on Text classification using
Naive Bayes.

Tips to Improve the Power of the NB Model


Here are some tips for improving power of Naive Bayes Model:
 If continuous features do not have normal distribution, we should use
transformation or different methods to convert it in normal distribution.
 If test data set has zero frequency issue, apply smoothing techniques
“Laplace Correction” to predict the class of test data set.
 Remove correlated features, as the highly correlated features are voted
twice in the model and it can lead to over inflating importance.
 Naive Bayes classifiers has limited options for parameter tuning like
alpha=1 for smoothing, fit_prior=[True|False] to learn class prior
probabilities or not and some other options (look at detail here). I would
recommend to focus on your pre-processing of data and the feature
selection.
 You might think to apply some classifier combination technique
like ensembling, bagging and boosting but these methods would not
help. Actually, “ensembling, boosting, bagging” won’t help since their
purpose is to reduce variance. Naive Bayes has no variance to minimize.

Conclusion

In this article, we looked at one of the supervised machine learning algorithms,


“Naive Bayes” mainly used for classification. Congrats, if you’ve thoroughly &
understood this article, you’ve already taken your first step toward mastering
this algorithm. From here, all you need is practice.

Further, I would suggest you focus more on data pre-processing and feature
selection before applying the algorithm. In a future post, I will discuss about text
and document classification using naive bayes in more detail.
Did you find this article helpful? Please share your opinions/thoughts in the
comments section below.

Key Takeaways

 The Naive Bayes algorithm is one of the most popular and


simple machine learning classification algorithms.
 It is based on the Bayes’ Theorem for calculating probabilities and
conditional probabilities.
 You can use it for real-time and multi-class predictions, text
classifications, spam filtering, sentiment analysis, and a lot more.

Frequently Asked Questions


Q1. What is naive in Naive Bayes classifier?

A. Naive Bayes classifier assumes features are independent of each other.


Since that is rarely possible in real-life data, the classifier is called naive.

Q2. What are the 3 different Naive Bayes classifiers?

A. Out of the 5 different Naive Bayes classifiers under sklearn.naive_bayes, the


3 most widely used ones are Gaussian, Multinomial, and Bernoulli.

Q3. What is the difference between Naive Bayes and Maximum Entropy?

A. In Naive Bayes classifier each feature is used independently, while in


Maximum Entropy classifier, the model uses search-based optimization to find
weights for features.
Reference:http://sentiment.christopherpotts.net/classifiers.html
You can use the following free resource to learn more

Scikit-learn is a popular open-source machine learning library in Python. It provides a wide


range of tools and algorithms for various machine learning tasks such as classification,
regression, clustering, dimensionality reduction, and model selection. Scikit-learn is built on top
of other scientific computing libraries in Python, such as NumPy, SciPy, and matplotlib, and it is
designed to be easy to use, efficient, and accessible to both beginners and experienced machine
learning practitioners.
Key Features of Scikit-learn:

1. Consistent API: Scikit-learn provides a consistent API across different algorithms, making it
easy to switch between different models and experiment with different techniques. The API
follows a similar structure with fit-predict paradigm for training and making predictions.

2. Wide Range of Algorithms: Scikit-learn offers a comprehensive selection of machine learning


algorithms, including popular ones such as linear regression, logistic regression, decision trees,
random forests, support vector machines (SVM), k-nearest neighbors (KNN), and many more.
These algorithms are implemented efficiently and optimized for performance.

3. Preprocessing and Feature Extraction: Scikit-learn provides a variety of tools for data
preprocessing, feature scaling, and feature extraction. It includes functions for handling missing
values, encoding categorical variables, scaling numerical features, and transforming data into a
suitable format for machine learning algorithms.

4. Model Evaluation and Validation: Scikit-learn offers utilities for model evaluation and
validation, including functions for cross-validation, hyperparameter tuning, and model selection.
It provides metrics for evaluating model performance such as accuracy, precision, recall, F1-
score, and many others.

5. Integration with Other Libraries: Scikit-learn integrates well with other libraries in the Python
ecosystem. It can be easily combined with libraries like pandas for data manipulation, matplotlib
and seaborn for data visualization, and NumPy and SciPy for advanced numerical computations.

6. Community and Documentation: Scikit-learn has a vibrant community with active


development and ongoing support. It offers comprehensive documentation with examples,
tutorials, and API references, making it easier for users to get started and learn how to use the
library effectively.

Overall, Scikit-learn is a powerful and versatile library for machine learning in Python. Whether
you're a beginner or an experienced practitioner, it provides the necessary tools and algorithms to
solve a wide range of machine learning problems efficiently.
To install scikit-learn (also known as sklearn) in Python, you can use pip, which is a package
manager for Python. Here are the steps to install scikit-learn:

1. Open a terminal or command prompt.

2. Make sure you have pip installed. You can check by running the following command:
```
pip --version
```

3. If pip is not installed, you can install it by following the official pip installation instructions for
your operating system.

4. Once pip is installed, you can install scikit-learn by running the following command:
```
pip install scikit-learn
```

This command will download and install the latest stable version of scikit-learn along with its
dependencies.

5. After the installation is complete, you can verify if scikit-learn is installed properly by
importing it in a Python script or the Python interpreter:
```python
import sklearn
print(sklearn.__version__)
```

This will print the version of scikit-learn installed on your system.

That's it! You have successfully installed scikit-learn in your Python environment. Now you can
start using scikit-learn for various machine learning tasks in your Python projects.

scikit-learn provides several built-in datasets that you can easily load and use for machine
learning tasks. These datasets are useful for practicing, prototyping, and testing machine learning
algorithms. Here are some of the commonly used datasets available in scikit-learn:

1. `load_iris`: This dataset contains measurements of sepal length, sepal width, petal length, and
petal width of 150 iris flowers, along with their corresponding species labels (setosa, versicolor,
and virginica). It is a popular dataset for classification tasks.

2. `load_digits`: This dataset consists of 8x8 images of digits (0-9). It is commonly used for digit
recognition tasks and has a total of 1,797 samples.

3. `load_boston`: This dataset contains housing prices in Boston, Massachusetts. It includes


various features such as crime rate, average number of rooms, and accessibility to highways. It is
often used for regression tasks.

4. `load_diabetes`: This dataset contains measurements related to diabetes, such as blood


pressure, body mass index (BMI), and age, along with the target variable of disease progression.
It is commonly used for regression tasks.
5. `fetch_california_housing`: This dataset contains housing-related information for various
regions in California, including features such as average income, average house age, and average
number of rooms. It is often used for regression tasks.

To load these datasets, you can use the corresponding functions provided by scikit-learn, such as
`load_iris()`, `load_digits()`, `load_boston()`, `load_diabetes()`, and `fetch_california_housing()`.
Here's an example of loading the iris dataset:

```python
from sklearn.datasets import load_iris

# Load the iris dataset


iris = load_iris()

# Print the features (X) and target variable (y)


X = iris.data
y = iris.target

# Print the shape of the data


print("Shape of features:", X.shape)
print("Shape of target variable:", y.shape)
```

In this example, we use the `load_iris()` function to load the iris dataset. The `iris` object
contains the data and target variables. We assign the features to `X` and the target variable to `y`.
Finally, we print the shape of the data to verify its dimensions.

You can explore other datasets and their respective attributes by referring to the scikit-learn
documentation. Additionally, scikit-learn provides methods to split the datasets into training and
test sets, making it easier to evaluate and train machine learning models using these datasets.
Matplotlib is a popular plotting library in Python that is often used in conjunction with scikit-
learn for data visualization and analysis. It provides a wide range of functions and tools for
creating various types of plots, such as line plots, scatter plots, bar plots, histograms, and more.
Matplotlib can be integrated seamlessly with scikit-learn to visualize data, evaluate model
performance, and analyze results. Here are some common use cases of using Matplotlib with
scikit-learn:

1. Data Visualization: Matplotlib can be used to visualize the data before and after applying
machine learning algorithms. You can plot scatter plots to visualize the relationship between two
features, histograms to examine the distribution of a feature, or line plots to track the
performance of a model over iterations.

2. Model Evaluation: Matplotlib can help visualize the results of model evaluation. For example,
you can plot a confusion matrix to understand the classification performance of a model, plot
precision-recall curves or ROC curves to analyze the trade-off between precision and recall or
true positive rate and false positive rate, respectively.

3. Feature Importance: If you are using models that provide feature importance scores, such as
decision trees or random forests, you can use Matplotlib to plot the importance of each feature.
This can help you understand which features are most relevant in predicting the target variable.

4. Model Comparison: If you are comparing the performance of multiple models, you can use
Matplotlib to create bar plots or box plots to visualize and compare their evaluation metrics, such
as accuracy, F1-score, or mean squared error.

Here's a simple example of using Matplotlib with scikit-learn to visualize a scatter plot:

```python
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Load the iris dataset


iris = load_iris()

# Get the features (X) and target variable (y)


X = iris.data
y = iris.target

# Create a scatter plot


plt.scatter(X[:, 0], X[:, 1], c=y)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('Iris Dataset: Sepal Length vs Sepal Width')
plt.show()
```

In this example, we load the iris dataset using `load_iris()` from scikit-learn. We assign the
features to `X` and the target variable to `y`. Then, we use `plt.scatter()` from Matplotlib to
create a scatter plot, where the x-axis represents the sepal length and the y-axis represents the
sepal width. The color of each point is determined by the target variable (class label) using the
`c` parameter. We add labels and a title to the plot using `plt.xlabel()`, `plt.ylabel()`, and
`plt.title()`. Finally, we display the plot using `plt.show()`.
Scikit-learn provides a wide range of algorithms and tools for regression and classification tasks.
Here's an overview of how to perform regression and classification using scikit-learn:

Regression with Scikit-learn:

1. Prepare the data: Load your dataset and split it into features (X) and the target variable (y).
2. Split the data: Split the dataset into training and test sets using `train_test_split()` from scikit-
learn's `model_selection` module. This allows you to train your model on a portion of the data
and evaluate its performance on unseen data.

3. Choose a regression algorithm: Select a regression algorithm from scikit-learn's regression


models, such as Linear Regression, Decision Tree Regression, Random Forest Regression, or
Support Vector Regression.

4. Create and train the model: Create an instance of the chosen regression algorithm, then use the
`fit()` method to train the model on the training data.

5. Make predictions: Use the trained model to make predictions on the test set using the
`predict()` method.

6. Evaluate the model: Assess the performance of the regression model by comparing the
predicted values with the actual target values. Calculate evaluation metrics such as mean squared
error (MSE), mean absolute error (MAE), or R-squared score using functions from scikit-learn's
`metrics` module.

Classification with Scikit-learn:

1. Prepare the data: Load your dataset and split it into features (X) and the target variable (y).

2. Split the data: Split the dataset into training and test sets using `train_test_split()` from scikit-
learn's `model_selection` module.

3. Choose a classification algorithm: Select a classification algorithm from scikit-learn's


classification models, such as Logistic Regression, Decision Tree Classifier, Random Forest
Classifier, or Support Vector Machines (SVM).

4. Create and train the model: Create an instance of the chosen classification algorithm, then use
the `fit()` method to train the model on the training data.
5. Make predictions: Use the trained model to make predictions on the test set using the
`predict()` method.

6. Evaluate the model: Evaluate the performance of the classification model by comparing the
predicted class labels with the true labels. Calculate evaluation metrics such as accuracy,
precision, recall, or F1-score using functions from scikit-learn's `metrics` module.

Here's an example code snippet for both regression and classification using scikit-learn:

```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.datasets import load_boston, load_iris

# Regression example
boston = load_boston()
X_reg = boston.data
y_reg = boston.target

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2,


random_state=42)

reg_model = LinearRegression()
reg_model.fit(X_train_reg, y_train_reg)

y_pred_reg = reg_model.predict(X_test_reg)
mse = mean_squared_error(y_test_reg, y_pred_reg)
print("Mean Squared Error:", mse)
# Classification example
iris = load_iris()
X_cls = iris.data
y_cls = iris.target

X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(X_cls, y_cls, test_size=0.2,


random_state=42)

cls_model = LogisticRegression()
cls_model.fit(X_train_cls, y_train_cls)

y_pred_cls = cls_model.predict(X_test_cls)
accuracy = accuracy_score(y_test_cls, y_pred_cls)
print("Accuracy:", accuracy)
```

In this example, we load the Boston Housing dataset for regression and the Iris dataset for
classification. We split the datasets into training and test sets using `train_test_split()`. Then, we
create instances of the Linear Regression model and Logistic Regression model, fit them to the
training data, and make predictions on the test data. Finally, we evaluate the performance of the
models using the mean squared error for regression and accuracy for classification.

Sure! Here's an example of how to apply data preprocessing methods to the Iris dataset from
scikit-learn:

```python
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Load the Iris dataset


iris = load_iris()

# Separate features (X) and target variable (y)


X = iris.data
y = iris.target

# Split the dataset into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Data preprocessing

# 1. Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 2. Label encoding (if target variable is categorical)


label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Print the preprocessed data


print("Scaled training data:")
print(X_train_scaled)
print("Scaled test data:")
print(X_test_scaled)
print("Encoded training labels:")
print(y_train_encoded)
print("Encoded test labels:")
print(y_test_encoded)
```

In this example, we first load the Iris dataset using `load_iris()` from scikit-learn. We separate the
features (X) and the target variable (y). Then, we split the dataset into training and test sets using
`train_test_split()`.

Next, we apply two common data preprocessing methods:

1. Standardization: We use `StandardScaler` from scikit-learn's `preprocessing` module to


standardize the features. The `fit_transform()` method is used to compute the mean and standard
deviation of the training data and scale it accordingly. The `transform()` method is used to scale
the test data using the same mean and standard deviation obtained from the training data.

2. Label encoding: In case the target variable is categorical, we can use `LabelEncoder` from
scikit-learn's `preprocessing` module to encode the categorical labels into numerical values. The
`fit_transform()` method is used to fit the label encoder on the training labels and transform them
into encoded values. The `transform()` method is then used to encode the test labels using the
same encoder.

Finally, we print the preprocessed data to see the results.

Note that different datasets may require different preprocessing steps, and you can explore other
preprocessing techniques provided by scikit-learn, such as one-hot encoding, handling missing
values, or feature scaling methods like min-max scaling. The choice of preprocessing methods
depends on the specific requirements of your machine learning task.

You might also like