Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Mid-1 ML

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

1) Productive use of machine learning:-

By using this we can find some data and then blindly applying machine learning
algorithms to it.
2) various applications of diverse fields in machine learning
It tries to decrease the redundancy in the training data, the. learned model as well as the
inference and provide more. information for machine learning process.

Diversity Of Data
The diversity in machine learning tries to decrease the redundancy in the training
data, the learned model as well as the inference and provide more information for
machine learning process. It can improve the performance of the model and has
played an important role in machine learning process.

Big Data includes huge volume, high velocity, and extensible variety of data. There are 3
types: Structured data, Semi-structured data and unstructured data.

1. Structured data
Structured data is data whose elements are addressable for effective analysis. It has
been organized into a repository database. It concerns all data which can be stored in
database SQL in a table with rows and columns. They have relational keys and can easily
be mapped into pre-designed fields.
Example: Relational data.
2. Semi-Structured data
Semi-structured data is information that does not reside in a relational database but
that has some organizational properties that make it easier to analyze. With some
processes, you can store them in the relation database (it could be very hard for some
kind of semi-structured data), but Semi-structured exist to ease space.
Example: XML data.
3. Unstructured data
Unstructured data is a data which is not organized in a predefined manner or does not
have a predefined data model .It has more alternative platforms for storing and
managing, it is increasingly prevalent in IT systems and is used by organizations in a
variety of business intelligence and analytics applications.
Example: Word, PDF, Text, Media logs.
Properties Structured data Semi-structured data Unstructured data

It is based on It is based on It is based on


Technology Relational database XML/RDF(Resource character and
table Description Framework). binary data

Matured transaction No transaction


Transaction and various Transaction is adapted management
management concurrency from DBMS not matured and no
techniques concurrency

Version Versioning over Versioning over tuples or Versioned as a


management tuples,row,tables graph is possible whole

It is more flexible than It is more flexible


It is schema
structured data but less and there is
Flexibility dependent and less
flexible than unstructured absence of
flexible
data schema

It is very difficult to It’s scaling is simpler than It is more


Scalability
scale DB schema structured data scalable.

New technology, not very


Robustness Very robust —
spread

Structured query Only textual


Query Queries over anonymous
allow complex queries are
performance nodes are possible
joining possible
3) what are the metrics for assessing regression in machine learning:-
Machine learning is an effective tool for predicting numerical values, and regression is
one of its key applications. In the arena of regression analysis, accurate estimation is crucial
for measuring the overall performance of predictive models. We will use numerous
regression algorithms, such as Linear Regression, Decision Trees, Random Forests, and
Support Vector Machines (SVM), amongst others.
 Mean Squared Error. A typical statistic for judging the quality of regression models is
the Mean Squared Error (MSE)
 Root Mean Squared Error  Mean Absolute Percentage Error
 R-squared  COD
 Adjusted R-squared  Wrap up

1. True Values and Predicted Values:


In regression, we’ve got two units of values to compare: the actual target values
(authentic values) and the values expected by our version (anticipated values). The
performance of the model is assessed by means of measuring the similarity among these
sets.
2. Evaluation Metrics:
Regression metrics are quantitative measures used to evaluate the nice of a regression
model. Scikit-analyze provides several metrics, each with its own strengths and boundaries,
to assess how well a model suits the statistics.

Types of Regression Metrics:-


Some common regression metrics in scikit-learn with examples
 Mean Absolute Error (MAE)
 Mean Squared Error (MSE)
 R-squared (R²) Score
 Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
In the fields of statistics and machine learning, the Mean Absolute Error (MAE) is a
frequently employed metric. It’s a measurement of the typical absolute discrepancies
between a dataset’s actual values and projected values.
Mathematical Formula
The formula to calculate MAE for a data with “n” data points is:

Where:
 xi represents the actual or observed values for the i-th data point.
 yi represents the predicted value for the i-th data point.
Example:-

from sklearn.metrics
import mean_squared_error

true_values = [2.5, 3.7, 1.8, 4.0, 5.2]


predicted_values = [2.1, 3.9, 1.7, 3.8, 5.0]

mae = mean_absolute_error(true_values, predicted_values)


print("Mean Absolute Error:", mae)

output:-
Mean Absolute Error: 0.22000000000000003

Mean Squared Error (MSE)


It is a popular metric in statistics and machine learning is the Mean Squared
Error (MSE). It measures the square root of the average discrepancies between a
dataset’s actual values and projected values. MSE is frequently utilized in regression
issues and is used to assess how well predictive models work.
For a dataset containing ‘n’ data points, the MSE calculation formula is:

where:
 xi represents the actual or observed value for the i-th data point.
 yi represents the predicted value for the i-th data point.
Example:

mse = mean_squared_error(true_values, predicted_values)


print("Mean Squared Error:", mse)

output:-
Mean Squared Error: 0.057999999999999996

R-squared (R²) Score


A statistical metric frequently used to assess the goodness of fit of a regression
model is the R-squared (R2) score, also referred to as the coefficient of determination. It
quantifies the percentage of the dependent variable’s variation that the model’s
independent variables contribute to. R2 is a useful statistic for evaluating the overall
effectiveness and explanatory power of a regression model.
The formula to calculate the R-squared score is as follows:
Where:
 R2 is the R-Squared.
 SSR represents the sum of squared residuals between the predicted values and actual
values.
 SST represents the total sum of squares, which measures the total variance in the
dependent variable.
Example:
r2 = r2_score(true_values, predicted_values)
print("R-squared (R²) Score:", r2)

output:-
R-squared (R²) Score: 0.9588769143505389

Root Mean Squared Error (RMSE)


RMSE stands for Root Mean Squared Error. It is a usually used metric in regression
analysis and machine learning to measure the accuracy or goodness of fit of a predictive
model, especially when the predictions are continuous numerical values.
The RMSE quantifies how well the predicted values from a model align with the actual
observed values in the dataset. Here’s how it works:
1. Calculate the Squared Differences: For each data point, subtract the predicted value from
the actual (observed) value, square the result, and sum up these squared differences.
2. Compute the Mean: Divide the sum of squared differences by the number of data points
to get the mean squared error (MSE).
3. Take the Square Root: To obtain the RMSE, simply take the square root of the MSE.

The formula for RMSE for a data with ‘n’ data points is as follows:

Example:-
# Sample data
true_prices = np.array([250000, 300000, 200000, 400000, 350000])
predicted_prices = np.array([240000, 310000, 210000, 380000, 340000])
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(true_prices, predicted_prices))
print("Root Mean Squared Error (RMSE):", rmse)

output:-
Root Mean Squared Error (RMSE): 12649.110640673518
COD
The Coefficient of Determination (COD) is a comparable statistic to R-squared that
assesses how well the model’s predictions fit the data.

Wrap up
The situation at hand and the aims of the investigation dictate the most appropriate
metrics for regression models. Regression models are often evaluated using MSE, RMSE, MAE,
R-squared, modified R-squared, MAPE, and COD. To thoroughly assess the model’s
performance, it is advised to employ a mix of these regression model metrics.

4) computational learning theory and occam's razor principle:-


If we have a larger hypothesis space, then we will make learning harder (i.e higher
sample complexity) 3. If we want a higher confidence in the classifier we will produce,
sample complexity will be higher. This is called the Occam's Razor because it expresses a
preference towards smaller hypothesis spaces.

KNN classifier in machine learning


 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
 K-NN algorithm assumes the similarity between the new case/data and available cases and put
the new case into the category that is most similar to the available categories.
 K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for
the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action
on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
 Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm,
as it works on a similarity measure. Our KNN model will find the similar features of the new
data set to the cats and dogs images and based on the most similar features it will put it in
either cat or dog category.
The K-NN working can be explained on the basis of the below algorithm:

 Step-1: Select the number K of the neighbors


 Step-2: Calculate the Euclidean distance of K number of neighbors
 Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
 Step-4: Among these k neighbors, count the number of the data points in each category.
 Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
 Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

 Firstly, we will choose the number of neighbors, so we will choose the k=5.
 Next, we will calculate the Euclidean distance between the data points. The Euclidean distance
is the distance between two points, which we have already studied in geometry. It can be
calculated as:
 By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below image:

 As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

select the value of K

 There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
 A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
 Large values for K are good, but it may find some difficulties.

Linear Regression vs Logistic Regression


Linear Regression is a machine learning algorithm based on supervised regression algorithm.
Regression models a target prediction value based on independent variables. It is mostly used for
finding out the relationship between variables and forecasting. Different regression models differ
based on – the kind of relationship between the dependent and independent variables, they are
considering and the number of independent variables being used. Logistic regression is basically
a supervised classification algorithm. In a classification problem, the target variable(or output),
y, can take only discrete values for a given set of features(or inputs), X.
Sl.No. Linear Regression Logistic Regression

Linear Regression is a Logistic Regression is a supervised


1.
supervised regression model. classification model.

Equation of linear regression: Equation of logistic regression


y = a0 + a1x1 + a2x2 + … + y(x) = e(a0 + a1x1 + a2x2 + … + aixi)
aixi / (1 + e(a0 + a1x1 + a2x2 + … + aixi))
Here, Here,
2.
y = response variable y = response variable
xi = ith predictor variable xi = ith predictor variable
ai = average effect on y as xi ai = average effect on y as xi
increases by 1 increases by 1

In Linear Regression, we
In Logistic Regression, we predict the
3. predict the value by an
value by 1 or 0.
integer number.

Here activation function is used to


Here no activation function is
4. convert a linear regression equation
used.
to the logistic regression equation

Here no threshold value is


5. Here a threshold value is added.
needed.

Here we calculate Root Mean


Here we use precision to predict the
6. Square Error(RMSE) to
next weight value.
predict the next weight value.

Here the dependent variable consists


Here dependent variable of only two categories. Logistic
should be numeric and the regression estimates the odds
7.
response variable is outcome of the dependent variable
continuous to value. given a set of quantitative or
categorical independent variables.
Sl.No. Linear Regression Logistic Regression

It is based on the least square It is based on maximum likelihood


8.
estimation. estimation.

Any change in the coefficient leads to


Here when we plot the a change in both the direction and
training datasets, a straight the steepness of the logistic function.
9.
line can be drawn that It means positive slopes result in an
touches maximum plots. S-shaped curve and negative slopes
result in a Z-shaped curve.

Linear regression is used to


estimate the dependent Whereas logistic regression is used to
variable in case of a change calculate the probability of an event.
10.
in independent variables. For For example, classify if tissue is
example, predict the price of benign or malignant.
houses.

Linear regression assumes


Logistic regression assumes the
the normal or gaussian
11. binomial distribution of the
distribution of the dependent
dependent variable.
variable.

Applications of logistic regression:


Applications of linear
Medicine
regression:
Credit scoring
12. Financial risk assessment
Hotel Booking
Business insights
Gaming
Market analysis
Text editing

You might also like