Predictive Data Analytics With Python
Predictive Data Analytics With Python
Introduction
Predictive data analytics is a field that focuses on extracting insights from historical data in order to make
predictions or forecasts about future events or trends. It involves using statistical and machine learning
techniques to analyze large datasets and uncover patterns, correlations, and trends that can be used to
make informed predictions.
Python, as a powerful programming language, provides a wide range of libraries and tools for predictive
data analytics. Some of the popular libraries used in Python for predictive analytics include NumPy,
pandas, scikit-learn, TensorFlow, and Keras. These libraries provide various functionalities for data
manipulation, preprocessing, modeling, and evaluation.
1. Data collection: Gathering relevant data from various sources, such as databases, APIs, or CSV files.
2. Data preprocessing: Cleaning and transforming the collected data to ensure it is suitable for analysis.
This may involve tasks such as handling missing values, removing outliers, and normalizing or scaling the
data.
3. Exploratory data analysis (EDA): Analyzing the data to gain insights and understand its characteristics.
This may involve visualizations, summary statistics, and correlation analysis.
4. Feature selection: Identifying the most relevant features or variables that are likely to have an impact
on the prediction task. This step helps in reducing complexity and improving model performance.
5. Model selection and training: Choosing an appropriate predictive model based on the problem at
hand and training it using the historical data. This could involve using techniques such as regression,
classification, or time series forecasting.
6. Model evaluation: Assessing the performance of the trained model using evaluation metrics and
validation techniques. This helps in determining how well the model is likely to perform on unseen data.
7. Prediction and deployment: Using the trained model to make predictions on new or future data. The
predictions can be used for decision-making, optimization, or forecasting purposes. The model can be
deployed as an application, API, or integrated into existing systems.
Python's versatility and extensive ecosystem of libraries make it an ideal choice for predictive data
analytics. Its ease of use, vast community support, and extensive documentation make it accessible to
both beginners and experienced data analysts. With Python, you can leverage the power of machine
learning and statistical techniques to gain valuable insights and make accurate predictions based on your
data.
There are several essential Python libraries for predictive data analytics. Here are some of the most
commonly used libraries:
1. NumPy: NumPy is a fundamental library for scientific computing in Python. It provides powerful tools
for handling multi-dimensional arrays and performing mathematical operations. NumPy is essential for
data manipulation and preprocessing tasks.
2. pandas: pandas is a versatile library that provides data structures and tools for data analysis. It offers
powerful data manipulation and data wrangling capabilities, such as data cleaning, merging, reshaping,
and filtering. pandas is widely used for data preprocessing and exploratory data analysis.
3. scikit-learn: scikit-learn is a popular machine learning library in Python. It offers a wide range of
algorithms and tools for various tasks, including regression, classification, clustering, and dimensionality
reduction. scikit-learn provides an intuitive and consistent API for training and evaluating predictive
models.
4. TensorFlow: TensorFlow is an open-source library developed by Google for machine learning and deep
learning. It provides a flexible and efficient framework for building and deploying machine learning
models, especially neural networks. TensorFlow is commonly used for tasks such as image recognition,
natural language processing, and time series forecasting.
5. Keras: Keras is a high-level neural network library that runs on top of TensorFlow. It provides a user-
friendly interface for building and training deep learning models. Keras simplifies the process of
designing complex neural networks and is widely used for tasks like image classification, sentiment
analysis, and text generation.
6. StatsModels: StatsModels is a library focused on statistical modeling and analysis. It offers a wide
range of statistical models, including regression analysis, time series analysis, and hypothesis testing.
StatsModels provides comprehensive statistical tools for predictive modeling and inference.
7. XGBoost: XGBoost is an optimized implementation of the gradient boosting algorithm. It is known for
its speed and performance and is widely used for classification and regression problems. XGBoost
provides excellent predictive power and is particularly effective for handling structured data.
8. Matplotlib: Matplotlib is a powerful visualization library for creating static, animated, and interactive
visualizations in Python. It provides a wide range of plotting functions and customization options, making
it suitable for creating various types of charts, graphs, and plots.
These are just a few of the essential libraries used in predictive data analytics with Python. Depending on
your specific requirements, you may also find other libraries, such as seaborn, plotly, PyTorch, or
LightGBM, beneficial for your predictive modeling tasks.
Data Preprocessing :
https://towardsdatascience.com/data-preprocessing-in-python-pandas-
part-6-dropping-duplicates-e35e46bcc9d6
https://www.numpyninja.com/post/data-preprocessing-using-pandas-
drop-and-drop_duplicates-functions
https://www.futurelearn.com/info/courses/data-analytics-python-data-
wrangling-and-ingestion/0/steps/186666
Data transformation changes the format, structure, or values of the data and converts
them into clean, usable data. Data may be transformed at two stages of the data pipeline
for data analytics projects. Organizations that use on-premises data warehouses generally
use an ETL (extract, transform, and load) process, in which data transformation is the
middle step. Today, most organizations use cloud-based data warehouses to scale
compute and storage resources with latency measured in seconds or minutes. The
scalability of the cloud platform lets organizations skip preload transformations and load
raw data into the data warehouse, then transform it at query time.
Data integration, migration, data warehousing, data wrangling may all involve data
transformation. Data transformation increases the efficiency of business and analytic
processes, and it enables businesses to make better data-driven decisions. During the
data transformation process, an analyst will determine the structure of the data. This could
mean that data transformation may be:
Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or
reduce any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to
help predict different trends and patterns. This serves as a help to analysts or traders who
need to look at a lot of data which can often be difficult to digest for finding patterns that
they wouldn't see otherwise.
We have seen how the noise is removed from the data using the techniques such as
binning, regression, clustering.
o Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.
o Regression: This method identifies the relation among two dependent attributes so that
if we have one attribute, it can be used to predict the other attribute.
o Clustering: This method groups similar data values and form a cluster. The values that lie
outside a cluster are known as outliers.
2. Attribute Construction
In the attribute construction method, the new attributes consult the existing attributes to
construct a new data set that eases data mining. New attributes are created and applied
to assist the mining process from the given attributes. This simplifies the original data and
makes the mining more efficient.
For example, suppose we have a data set referring to measurements of different plots,
i.e., we may have the height and width of each plot. So here, we can construct a new
attribute 'area' from attributes 'height' and 'weight'. This also helps understand the
relations among the attributes in a data set.
3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data
sources into a data analysis description. This is a crucial step since the accuracy of data
analysis insights is highly dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to
produce relevant results. The collection of data is useful for everything from decisions
concerning financing or business strategy of the product, pricing, operations, and
marketing strategies.
For example, we have a data set of sales reports of an enterprise that has quarterly sales
of each year. We can aggregate the data to get the enterprise's annual sales report.
4. Data Normalization
Normalizing the data refers to scaling the data values to a much smaller range such as [-
1, 1] or [0.0, 1.0]. There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values
for attribute A that are V1, V2, V3, ….Vn.
For example, we have $1200 and $9800 as the minimum, and maximum value for the
attribute income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:
o Z-score normalization: This method normalizes the value for attribute A using
the meanand standard deviation. The following formula is used for Z-score
normalization:
Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and
$16,000. And we have to normalize the value $73,600 using z-score normalization.
o Decimal Scaling: This method normalizes the value of attribute A by moving the decimal
point in the value. This movement of a decimal point depends on the maximum absolute
value of A. The formula for the decimal scaling is given below:
5. Data Discretization
This is a process of converting continuous data into a set of data intervals. Continuous
attribute values are substituted by small interval labels. This makes the data easier to study
and analyze. If a data mining task handles a continuous attribute, then its discrete values
can be replaced by constant quality attributes. This improves the efficiency of the task.
This method is also called a data reduction mechanism as it transforms a large dataset
into a set of categorical data. Discretization also uses decision tree-based algorithms to
produce short, compact, and accurate results when using discrete values.
Data discretization can be classified into two types: supervised discretization, where the
class information is used, and unsupervised discretization, which is based on which
direction the process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging
strategy'.
For example, the values for the age attribute can be replaced by the interval labels such
as (0-10, 11-20…) or (kid, youth, adult, senior).
6. Data Generalization
It converts low-level data attributes to high-level data attributes using concept hierarchy.
This conversion from a lower level to a higher conceptual level is useful to get a clearer
picture of the data. Data generalization can be divided into two approaches:
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a
higher conceptual level into a categorical value (young, old).
o Better Organization: Transformed data is easier for both humans and computers to use.
o Improved Data Quality: There are many risks and costs associated with bad data. Data
transformation can help your organization eliminate quality issues such as missing values
and other inconsistencies.
o Perform Faster Queries: You can quickly and easily retrieve transformed data thanks to it
being stored and standardizedin a source location.
o Better Data Management: Businesses are constantly generating data from more and
more sources. If there are inconsistencies in the metadata, it can be challenging to
organize and understand it. Data transformation refines your metadata, so it's easier to
organize and understand.
o More Use Out of Data: While businesses may be collecting data constantly, a lot of that
data sits around unanalyzed. Transformation makes it easier to get the most out of your
data by standardizing it and making it more usable.
o Data transformation can be expensive. The cost is dependent on the specific infrastructure,
software, and tools used to process data. Expenses may include licensing, computing
resources, and hiring necessary personnel.
o Data transformation processes can be resource-intensive. Performing transformations in
an on-premises data warehouse after loading or transforming data before feeding it into
applications can create a computational burden that slows down other operations. If you
use a cloud-based data warehouse, you can do the transformations after loading because
the platform can scale up to meet demand.
o Lack of expertise and carelessness can introduce problems during transformation. Data
analysts without appropriate subject matter expertise are less likely to notice incorrect data
because they are less familiar with the range of accurate and permissible values.
o Enterprises can perform transformations that don't suit their needs. A business might
change information to a specific format for one application only to then revert the
information to its prior format for a different application.
In predictive data analytics, transforming data using functions or mappings can be a crucial step
in preparing the data for modeling. Transformations help in improving the quality and structure
of the data, handling outliers, normalizing variables, and creating new features. Here are some
common techniques for transforming data using functions or mappings:
2. Logarithmic Transformations:
- Log Transformation: Applies the logarithm function to the data, which can help in handling
skewed distributions and reducing the impact of outliers.
3. Power Transformations:
- Square Root Transformation: Takes the square root of the data, which can be useful for
reducing the impact of large values and handling right-skewed distributions.
- Box-Cox Transformation: A more general power transformation that optimizes the
transformation parameter lambda to achieve the best approximation of a normal distribution.
6. Feature Engineering:
- Polynomial Features: Generates higher-order polynomial features by combining existing
features through multiplication or exponentiation.
- Interaction Terms: Creates new features by capturing interactions between existing features
(e.g., multiplying two variables).
These are just a few examples of transformations that can be applied to data using functions or
mappings. The choice of transformation techniques depends on the nature of the data, the
problem at hand, and the requirements of the predictive modeling task. It is important to
carefully analyze and understand the data before applying any transformations to ensure their
effectiveness and suitability for the specific problem.
In data analytics, replacing values in a dataset is a common task, often required to handle missing
data, outliers, or to transform specific values. Python provides various ways to replace values in
a dataset. Let's explore a few techniques:
```python
import pandas as pd
Output:
```
A B
0 1 6
1 2 7
2 3 8
3 0 9
4 5 10
```
```python
import pandas as pd
import numpy as np
Output:
```
A B
0 1 6
1 4 7
2 9 8
3 4 9
4 25 10
```
3. Replacing missing values:
Missing values are often replaced with appropriate values to avoid issues during analysis.
Pandas provides the `fillna()` function to replace missing values. Here's an example:
```python
import pandas as pd
import numpy as np
Output:
```
A B
0 1.0 6.0
1 3.25 7.0
2 3.0 7.5
3 4.0 9.0
4 5.0 10.0
```
In this example, missing values in columns 'A' and 'B' are replaced with the mean of each
column.
These are just a few examples of replacing values in a dataset using Python and pandas.
Depending on your specific requirements, you can also use other methods like `map()`, `apply()`,
or conditional statements to replace values.
Handling missing data is an important task in data analytics as missing values can impact the
accuracy and reliability of analysis and modeling. Python provides several libraries and
techniques to handle missing data. Here are some common approaches:
```python
import pandas as pd
Output:
```
A B
0 False False
1 False True
2 True False
3 False False
4 False False
```
```python
import pandas as pd
Output:
```
A B
3 4.0 9
4 5.0 10
```
```python
import pandas as pd
Output:
```
A B
0 1.0 6.0
1 2.0 7.6
2 3.0 8.0
3 4.0 9.0
4 5.0 10.0
```
In this example, missing values are filled with the mean of each respective column.
4. Imputation techniques:
Advanced techniques can be used to impute missing values based on patterns or relationships
in the data. Common imputation methods include mean imputation, median imputation, mode
imputation, or more sophisticated techniques like regression imputation or k-nearest neighbors
imputation. Libraries like scikit-learn and fancy impute provide various imputation algorithms.
```python
from sklearn.impute import SimpleImputer
This post will explore the three most common types of data analytics
and one less known model. This information will help you gain better
insight into what your data says about your business so you can make
adjustments to meet your goals.
Descriptive analytics
Predictive analytics
Prescriptive analytics
Diagnostic analytics
The future of analytics
Diagnostic analytics
Even as the need for analytics experts grows, the market for self-
service tools continues to escalate as well. A report from Allied
Market Research expects the self-service BI market to reach $14.19
billion by 2026, and Gartner cites the growth of business-composed
data and analytics, that focuses on people, “shifting from IT to
business.”
Entertainment. Services like Netflix and Spotify can use association rules to
fuel their content recommendation engines. Machine learning models
analyze past user behavior data for frequent patterns, develop association
rules and use those rules to recommend content that a user is likely to
engage with, or organize content in a way that is likely to put the most
interesting content for a given user first.
How association rules work
Association rule mining, at a basic level, involves the use of machine learning models to
analyze data for patterns, or co-occurrences, in a database. It identifies frequent if-
then associations, which themselves are the association rules.
An association rule has two parts: an antecedent (if) and a consequent (then). An
antecedent is an item found within the data. A consequent is an item found in
combination with the antecedent.
Association rules are created by searching data for frequent if-then patterns and using
the criteria support and confidence to identify the most important
relationships. Support is an indication of how frequently the items appear in the data.
Confidence indicates the number of times the if-then statements are found true. A
third metric, called lift, can be used to compare confidence with expected confidence,
or how many times an if-then statement is expected to be found true.
Association rules are calculated from itemsets, which are made up of two or more
items. If rules are built from analyzing all the possible itemsets, there could be so
many rules that the rules hold little meaning. With that, association rules are typically
created from rules well-represented in data.
Measures of the effectiveness of association rules
The strength of a given association rule is measured by two main parameters: support
and confidence. Support refers to how often a given rule appears in the database being
mined. Confidence refers to the amount of times a given rule turns out to be true in
practice. A rule may show a strong correlation in a data set because it appears very
often but may occur far less when applied. This would be a case of high support, but
low confidence.
Conversely, a rule might not particularly stand out in a data set, but continued analysis
shows that it occurs very frequently. This would be a case of high confidence and low
support. Using these measures helps analysts separate causation from correlation and
allows them to properly value a given rule.
A third value parameter, known as the lift value, is the ratio of confidence to support.
If the lift value is a negative value, then there is a negative correlation between
datapoints. If the value is positive, there is a positive correlation, and if the ratio
equals 1, then there is no correlation.
Here the If element is called antecedent, and then statement is called as Consequent.
These types of relationships where we can find out some association or relation between
two items is known as single cardinality. It is all about creating rules, and if the number of
items increases, then cardinality also increases accordingly. So, to measure the
associations between thousands of data items, there are several metrics. These metrics
are given below:
o Support
o Confidence
o Lift
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already given. It
is the ratio of the transaction that contains X and Y to the number of records that contain
X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
The primary objective of the apriori algorithm is to create the association rule between
different objects. The association rule describes how two or more objects are related to
one another. Apriori algorithm is also called frequent pattern mining. Generally, you
operate the Apriori algorithm on a database that consists of a huge number of
transactions. Let's understand the apriori algorithm with the help of an example; suppose
you go to Big Bazar and buy different products. It helps the customers buy their products
with ease and increases the sales performance of the Big Bazar. In this tutorial, we will
discuss the apriori algorithm with examples.
Introduction
We take an example to understand the concept better. You must have noticed that the
Pizza shop seller makes a pizza, soft drink, and breadstick combo together. He also offers
a discount to their customers who buy these combos. Do you ever think why does he do
so? He thinks that customers who buy pizza also buy soft drinks and breadsticks. However,
by making combos, he makes it easy for the customers. At the same time, he also increases
his sales performance.
Similarly, you go to Big Bazar, and you will find biscuits, chips, and Chocolate bundled
together. It shows that the shopkeeper makes it comfortable for the customers to buy
these products in the same place.
The above two examples are the best examples of Association Rules in Data Mining. It
helps us to learn the concept of apriori algorithms.
1. Support
2. Confidence
3. Lift
We have already discussed above; you need a huge database containing a large no of
transactions. Suppose you have 4000 customers transactions in a Big Bazar. You have to
calculate the Support, Confidence, and Lift for two products, and you may say Biscuits and
Chocolate. This is because customers frequently buy these two items together.
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these
600 transactions include a 200 that includes Biscuits and chocolates. Using this data, we
will find out the support, confidence, and lift.
Support
Support refers to the default popularity of any product. You find the support as a quotient
of the division of the number of transactions comprising that product by the total number
of transactions. Hence, we get
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that comprise
both biscuits and chocolates by the total number of transactions to get the confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions
involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates
when you sell biscuits. The mathematical equations of lift are given below.
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates together is
five times more than that of purchasing the biscuits alone. If the lift value is below one, it
requires that the people are unlikely to buy both the items together. Larger the value, the
better is the combination.
Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk, Apple}.
The database comprises six transactions where 1 represents the presence of the product
and 0 represents the absence of the product.
t1 1 1 1 0
t2 0 1 1 1
t3 0 0 0 1
t4 1 1 0 1
t5 1 1 1 0
t6 1 1 1 1
Step 1
Make a frequency table of all the products that appear in all the transactions. Now, short
the frequency table to add only those products with a threshold support level of over 50
percent. We find the given frequency table.
Rice (R) 4
Pulse(P) 5
Oil(O) 4
Milk(M) 4
The above table indicated the products frequently bought by the customers.
Step 2
Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given frequency
table.
RP 4
RO 3
RM 2
PO 4
PM 3
OM 2
Step 3
Implementing the same threshold support of 50 percent and consider the products that
are more than 50 percent. In our case, it is more than 3
Step 4
Now, look for a set of three products that the customers buy together. We get the given
combination.
Step 5
Calculate the frequency of the two itemsets, and you will get the given frequency table.
Itemset Frequency (Number of transactions)
RPO 4
POM 3
If you implement the threshold assumption, you can figure out that the customers' set of
three products is RPO.
We have considered an easy example to discuss the apriori algorithm in data mining. In
reality, you find thousands of such combinations.
In hash-based itemset counting, you need to exclude the k-itemset whose equivalent
hashing bucket count is least than the threshold is an infrequent itemset.
Transaction Reduction
In transaction reduction, a transaction not involving any frequent X itemset becomes not
valuable in subsequent scans.
The primary requirements to find the association rules in data mining are given below.
The two-step approach is a better option to find the associations rules than the Brute
Force method.
Step 1
In this article, we have already discussed how to create the frequency table and calculate
itemsets having a greater support value than that of the threshold support.
Step 2
To create association rules, you need to use a binary partition of the frequent itemsets.
You need to choose the ones having the highest confidence levels.
In the above example, you can see that the RPO combination was the frequent itemset.
Now, we find out all the rules using RPO.
You can see that there are six different combinations. Therefore, if you have n elements,
there will be 2n - 2 candidate association rules.
The root node represents null while the lower nodes represent the itemsets. The association of the
nodes with the lower nodes that is the itemsets with the other itemsets are maintained while
forming the tree.
Let us see the steps followed to mine the frequent pattern using frequent pattern growth
algorithm:
#1) The first step is to scan the database to find the occurrences of the itemsets in the database.
This step is the same as the first step of Apriori. The count of 1-itemsets in the database is called
support count or frequency of 1-itemset.
#2) The second step is to construct the FP tree. For this, create the root of the tree. The root is
represented by null.
#3) The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top, the
next itemset with lower count and so on. It means that the branch of the tree is constructed with
transaction itemsets in descending order of count.
#4) The next transaction in the database is examined. The itemsets are ordered in descending
order of count. If any itemset of this transaction is already present in another branch (for example
in the 1st transaction), then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this
transaction.
#5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked according to
transactions.
#6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along
with the links of the lowest nodes. The lowest node represents the frequency pattern length 1.
From this, traverse the path in the FP Tree. This path or paths are called a conditional pattern
base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with
the lowest node (suffix).
#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
#8) Frequent Patterns are generated from the Conditional FP Tree.
Example Of FP-Growth Algorithm
Support threshold=50%, Confidence= 60%
Table 1
Transaction List of items
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3
What Is a Regression?
Regression is a statistical method used in finance, investing, and other
disciplines that attempts to determine the strength and character of the
relationship between one dependent variable (usually denoted by Y) and a
series of other variables (known as independent variables).
KEY TAKEAWAYS
1:21
Regression
Understanding Regression
Regression captures the correlation between variables observed in a data set
and quantifies whether those correlations are statistically significant or not.
The two basic types of regression are simple linear regression and multiple
linear regression, although there are non-linear regression methods for more
complicated data and analysis. Simple linear regression uses one
independent variable to explain or predict the outcome of the dependent
variable Y, while multiple linear regression uses two or more independent
variables to predict the outcome (while holding all others constant).
Regression can help finance and investment professionals as well as
professionals in other businesses. Regression can also help predict sales for
a company based on weather, previous sales, GDP growth, or other types of
conditions. The capital asset pricing model (CAPM) is an often-used
regression model in finance for pricing assets and discovering the costs of
capital.
We can understand the concept of regression analysis using the below example:
In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In
simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance
between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or not.
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.
Linear Regression:
1. Y= aX+b
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve
the classification problems. In classification problems, we have dependent
variables in a binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes
or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear
regression algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex
cost function. This sigmoid function is used to model the data in logistic regression.
The function can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are
rounded up to 1, and values below the threshold level are rounded up to 0.
There are three types of logistic regression:
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Linear regression algorithm shows a linear relationship between a dependent (y) and one
or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line,
so to calculate this we use cost function.
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
import pandas as pd
titanic_data = pd.read_csv('titanic.csv')
titanic_data['Age'] = titanic_data['Age'].fillna(titanic_data['Age'].mean())
titanic_data['Fare'] = titanic_data['Fare'].fillna(titanic_data['Fare'].mean())
# Split the data into input features (X) and target variable (y)
X = titanic_data.drop('Survived', axis=1)
y = titanic_data['Survived']
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy)
import pandas as pd
data = pd.read_csv('dataset.csv')
# Split the data into input features (X) and target variable (y)
X = data.drop('target_variable', axis=1)
y = data['target_variable']
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy)
print("Confusion Matrix:")
print(confusion_mat)
Classification :
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes Classifier:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether
he should accept the offer or Not. So, to solve this problem, the decision tree starts with
the root node (Salary attribute by ASM). The root node splits further into the next decision
node (distance from the office) and one leaf node based on the corresponding labels. The
next decision node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types
of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.
Assume this is a situation you’ve got into in your data science project:
You are working on a classification problem and have generated your set
of hypotheses, created features, and discussed the importance of variables.
Within an hour, stakeholders want to see the first cut of the model.
What will you do? You have hundreds of thousands of data points and several
variables in your training data set. In such a situation, if I were in your place, I
would have used ‘Naive Bayes,‘ which can be extremely fast relative to
other classification algorithms. It works on Bayes’ theorem of probability to
predict the class of unknown data sets.
In this article, I’ll explain the basics of Naive Bayes (NB) in machine learning
(ML), so that the next time you come across large data sets, you can bring this
classification algorithm in ML (which is a part of AI) to action. In addition, if you
are a newbie in Python or R, you should not be overwhelmed by the presence
of available codes in this article.
Learning Objectives
If you prefer to learn the Naive Bayes’ theorem from the basics concepts to the
implementation in a structured manner, you can enroll in this free course:
Are you a beginner in Machine Learning? Do you want to master the machine
learning algorithms like Naive Bayes? Here is a comprehensive course
covering the machine learning and deep learning algorithms in detail –
Table of contents
Introduction
What Is the Naive Bayes Algorithm?
Sample Project to Apply Naive Bayes
How Do Naive Bayes Algorithms Work?
What Are the Pros and Cons of Naive Bayes?
Applications of Naive Bayes Algorithms
How to Build a Basic Model Using Naive Bayes in Python & R?
Tips to Improve the Power of the NB Model
Conclusion
Frequently Asked Questions
Moreover, it is worth noting that naive Bayes classifiers are among the simplest
Bayesian network models, yet they can achieve high accuracy levels when
coupled with kernel density estimation. This technique involves using a kernel
function to estimate the probability density function of the input data, allowing
the classifier to improve its performance in complex scenarios where the data
distribution is not well-defined. As a result, the naive Bayes classifier is a
powerful tool in machine learning, particularly in text classification, spam
filtering, and sentiment analysis, among others.
An NB model is easy to build and particularly useful for very large data sets.
Along with simplicity, Naive Bayes is known to outperform even highly
sophisticated classification methods.
Above,
Problem Statement
However, the collection, processing, and analysis of data have been largely
manual, and given the nature of human resources dynamics and HR KPIs, the
approach has been constraining HR. Therefore, it is surprising that HR
departments woke up to the utility of machine learning so late in the game. Here
is an opportunity to try predictive analytics in identifying the employees most
likely to get promoted.
Practice Now
Problem: Players will play if the weather is sunny. Is this statement correct?
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
The Naive Bayes uses a similar method to predict the probability of different
class based on various attributes. This algorithm is mostly used in text
classification (nlp) and with problems having multiple classes.
It is easy and fast to predict class of test data set. It also perform well in
multi class prediction
When assumption of independence holds, the classifier performs better
compared to other machine learning models like logistic regression or
decision tree, and requires less training data.
It perform well in case of categorical input variables compared to
numerical variable(s). For numerical variable, normal distribution is
assumed (bell curve, which is a strong assumption).
Cons:
If categorical variable has a category (in test data set), which was not
observed in training data set, then model will assign a 0 (zero) probability
and will be unable to make a prediction. This is often known as “Zero
Frequency”. To solve this, we can use the smoothing technique. One of
the simplest smoothing techniques is called Laplace estimation.
On the other side, Naive Bayes is also known as a bad estimator, so the
probability outputs from predict_proba are not to be taken too seriously.
Another limitation of this algorithm is the assumption of independent
predictors. In real life, it is almost impossible that we get a set of predictors
which are completely independent.
Python Code:
Try out the below code in the coding window and check your results on the fly!
R Code:
require(e1071) #Holds the Naive Bayes Classifier
Train <- read.csv(file.choose())
Test <- read.csv(file.choose())
levels(Train$Item_Fat_Content)
Above, we looked at the basic NB Model. You can improve the power of this
basic model by tuning parameters and handling assumptions intelligently. Let’s
look at the methods to improve the performance of this model. I recommend
you go through this document for more details on Text classification using
Naive Bayes.
Conclusion
Further, I would suggest you focus more on data pre-processing and feature
selection before applying the algorithm. In a future post, I will discuss about text
and document classification using naive bayes in more detail.
Did you find this article helpful? Please share your opinions/thoughts in the
comments section below.
Key Takeaways
Q3. What is the difference between Naive Bayes and Maximum Entropy?
1. Consistent API: Scikit-learn provides a consistent API across different algorithms, making it
easy to switch between different models and experiment with different techniques. The API
follows a similar structure with fit-predict paradigm for training and making predictions.
3. Preprocessing and Feature Extraction: Scikit-learn provides a variety of tools for data
preprocessing, feature scaling, and feature extraction. It includes functions for handling missing
values, encoding categorical variables, scaling numerical features, and transforming data into a
suitable format for machine learning algorithms.
4. Model Evaluation and Validation: Scikit-learn offers utilities for model evaluation and
validation, including functions for cross-validation, hyperparameter tuning, and model selection.
It provides metrics for evaluating model performance such as accuracy, precision, recall, F1-
score, and many others.
5. Integration with Other Libraries: Scikit-learn integrates well with other libraries in the Python
ecosystem. It can be easily combined with libraries like pandas for data manipulation, matplotlib
and seaborn for data visualization, and NumPy and SciPy for advanced numerical computations.
Overall, Scikit-learn is a powerful and versatile library for machine learning in Python. Whether
you're a beginner or an experienced practitioner, it provides the necessary tools and algorithms to
solve a wide range of machine learning problems efficiently.
To install scikit-learn (also known as sklearn) in Python, you can use pip, which is a package
manager for Python. Here are the steps to install scikit-learn:
2. Make sure you have pip installed. You can check by running the following command:
```
pip --version
```
3. If pip is not installed, you can install it by following the official pip installation instructions for
your operating system.
4. Once pip is installed, you can install scikit-learn by running the following command:
```
pip install scikit-learn
```
This command will download and install the latest stable version of scikit-learn along with its
dependencies.
5. After the installation is complete, you can verify if scikit-learn is installed properly by
importing it in a Python script or the Python interpreter:
```python
import sklearn
print(sklearn.__version__)
```
That's it! You have successfully installed scikit-learn in your Python environment. Now you can
start using scikit-learn for various machine learning tasks in your Python projects.
scikit-learn provides several built-in datasets that you can easily load and use for machine
learning tasks. These datasets are useful for practicing, prototyping, and testing machine learning
algorithms. Here are some of the commonly used datasets available in scikit-learn:
1. `load_iris`: This dataset contains measurements of sepal length, sepal width, petal length, and
petal width of 150 iris flowers, along with their corresponding species labels (setosa, versicolor,
and virginica). It is a popular dataset for classification tasks.
2. `load_digits`: This dataset consists of 8x8 images of digits (0-9). It is commonly used for digit
recognition tasks and has a total of 1,797 samples.
To load these datasets, you can use the corresponding functions provided by scikit-learn, such as
`load_iris()`, `load_digits()`, `load_boston()`, `load_diabetes()`, and `fetch_california_housing()`.
Here's an example of loading the iris dataset:
```python
from sklearn.datasets import load_iris
In this example, we use the `load_iris()` function to load the iris dataset. The `iris` object
contains the data and target variables. We assign the features to `X` and the target variable to `y`.
Finally, we print the shape of the data to verify its dimensions.
You can explore other datasets and their respective attributes by referring to the scikit-learn
documentation. Additionally, scikit-learn provides methods to split the datasets into training and
test sets, making it easier to evaluate and train machine learning models using these datasets.
Matplotlib is a popular plotting library in Python that is often used in conjunction with scikit-
learn for data visualization and analysis. It provides a wide range of functions and tools for
creating various types of plots, such as line plots, scatter plots, bar plots, histograms, and more.
Matplotlib can be integrated seamlessly with scikit-learn to visualize data, evaluate model
performance, and analyze results. Here are some common use cases of using Matplotlib with
scikit-learn:
1. Data Visualization: Matplotlib can be used to visualize the data before and after applying
machine learning algorithms. You can plot scatter plots to visualize the relationship between two
features, histograms to examine the distribution of a feature, or line plots to track the
performance of a model over iterations.
2. Model Evaluation: Matplotlib can help visualize the results of model evaluation. For example,
you can plot a confusion matrix to understand the classification performance of a model, plot
precision-recall curves or ROC curves to analyze the trade-off between precision and recall or
true positive rate and false positive rate, respectively.
3. Feature Importance: If you are using models that provide feature importance scores, such as
decision trees or random forests, you can use Matplotlib to plot the importance of each feature.
This can help you understand which features are most relevant in predicting the target variable.
4. Model Comparison: If you are comparing the performance of multiple models, you can use
Matplotlib to create bar plots or box plots to visualize and compare their evaluation metrics, such
as accuracy, F1-score, or mean squared error.
Here's a simple example of using Matplotlib with scikit-learn to visualize a scatter plot:
```python
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
In this example, we load the iris dataset using `load_iris()` from scikit-learn. We assign the
features to `X` and the target variable to `y`. Then, we use `plt.scatter()` from Matplotlib to
create a scatter plot, where the x-axis represents the sepal length and the y-axis represents the
sepal width. The color of each point is determined by the target variable (class label) using the
`c` parameter. We add labels and a title to the plot using `plt.xlabel()`, `plt.ylabel()`, and
`plt.title()`. Finally, we display the plot using `plt.show()`.
Scikit-learn provides a wide range of algorithms and tools for regression and classification tasks.
Here's an overview of how to perform regression and classification using scikit-learn:
1. Prepare the data: Load your dataset and split it into features (X) and the target variable (y).
2. Split the data: Split the dataset into training and test sets using `train_test_split()` from scikit-
learn's `model_selection` module. This allows you to train your model on a portion of the data
and evaluate its performance on unseen data.
4. Create and train the model: Create an instance of the chosen regression algorithm, then use the
`fit()` method to train the model on the training data.
5. Make predictions: Use the trained model to make predictions on the test set using the
`predict()` method.
6. Evaluate the model: Assess the performance of the regression model by comparing the
predicted values with the actual target values. Calculate evaluation metrics such as mean squared
error (MSE), mean absolute error (MAE), or R-squared score using functions from scikit-learn's
`metrics` module.
1. Prepare the data: Load your dataset and split it into features (X) and the target variable (y).
2. Split the data: Split the dataset into training and test sets using `train_test_split()` from scikit-
learn's `model_selection` module.
4. Create and train the model: Create an instance of the chosen classification algorithm, then use
the `fit()` method to train the model on the training data.
5. Make predictions: Use the trained model to make predictions on the test set using the
`predict()` method.
6. Evaluate the model: Evaluate the performance of the classification model by comparing the
predicted class labels with the true labels. Calculate evaluation metrics such as accuracy,
precision, recall, or F1-score using functions from scikit-learn's `metrics` module.
Here's an example code snippet for both regression and classification using scikit-learn:
```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.datasets import load_boston, load_iris
# Regression example
boston = load_boston()
X_reg = boston.data
y_reg = boston.target
reg_model = LinearRegression()
reg_model.fit(X_train_reg, y_train_reg)
y_pred_reg = reg_model.predict(X_test_reg)
mse = mean_squared_error(y_test_reg, y_pred_reg)
print("Mean Squared Error:", mse)
# Classification example
iris = load_iris()
X_cls = iris.data
y_cls = iris.target
cls_model = LogisticRegression()
cls_model.fit(X_train_cls, y_train_cls)
y_pred_cls = cls_model.predict(X_test_cls)
accuracy = accuracy_score(y_test_cls, y_pred_cls)
print("Accuracy:", accuracy)
```
In this example, we load the Boston Housing dataset for regression and the Iris dataset for
classification. We split the datasets into training and test sets using `train_test_split()`. Then, we
create instances of the Linear Regression model and Logistic Regression model, fit them to the
training data, and make predictions on the test data. Finally, we evaluate the performance of the
models using the mean squared error for regression and accuracy for classification.
Sure! Here's an example of how to apply data preprocessing methods to the Iris dataset from
scikit-learn:
```python
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
# Data preprocessing
# 1. Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
In this example, we first load the Iris dataset using `load_iris()` from scikit-learn. We separate the
features (X) and the target variable (y). Then, we split the dataset into training and test sets using
`train_test_split()`.
2. Label encoding: In case the target variable is categorical, we can use `LabelEncoder` from
scikit-learn's `preprocessing` module to encode the categorical labels into numerical values. The
`fit_transform()` method is used to fit the label encoder on the training labels and transform them
into encoded values. The `transform()` method is then used to encode the test labels using the
same encoder.
Note that different datasets may require different preprocessing steps, and you can explore other
preprocessing techniques provided by scikit-learn, such as one-hot encoding, handling missing
values, or feature scaling methods like min-max scaling. The choice of preprocessing methods
depends on the specific requirements of your machine learning task.