Mid-1 ML
Mid-1 ML
Mid-1 ML
By using this we can find some data and then blindly applying machine learning
algorithms to it.
2) various applications of diverse fields in machine learning
It tries to decrease the redundancy in the training data, the. learned model as well as the
inference and provide more. information for machine learning process.
Diversity Of Data
The diversity in machine learning tries to decrease the redundancy in the training
data, the learned model as well as the inference and provide more information for
machine learning process. It can improve the performance of the model and has
played an important role in machine learning process.
Big Data includes huge volume, high velocity, and extensible variety of data. There are 3
types: Structured data, Semi-structured data and unstructured data.
1. Structured data
Structured data is data whose elements are addressable for effective analysis. It has
been organized into a repository database. It concerns all data which can be stored in
database SQL in a table with rows and columns. They have relational keys and can easily
be mapped into pre-designed fields.
Example: Relational data.
2. Semi-Structured data
Semi-structured data is information that does not reside in a relational database but
that has some organizational properties that make it easier to analyze. With some
processes, you can store them in the relation database (it could be very hard for some
kind of semi-structured data), but Semi-structured exist to ease space.
Example: XML data.
3. Unstructured data
Unstructured data is a data which is not organized in a predefined manner or does not
have a predefined data model .It has more alternative platforms for storing and
managing, it is increasingly prevalent in IT systems and is used by organizations in a
variety of business intelligence and analytics applications.
Example: Word, PDF, Text, Media logs.
Properties Structured data Semi-structured data Unstructured data
Where:
xi represents the actual or observed values for the i-th data point.
yi represents the predicted value for the i-th data point.
Example:-
from sklearn.metrics
import mean_squared_error
output:-
Mean Absolute Error: 0.22000000000000003
where:
xi represents the actual or observed value for the i-th data point.
yi represents the predicted value for the i-th data point.
Example:
output:-
Mean Squared Error: 0.057999999999999996
output:-
R-squared (R²) Score: 0.9588769143505389
The formula for RMSE for a data with ‘n’ data points is as follows:
Example:-
# Sample data
true_prices = np.array([250000, 300000, 200000, 400000, 350000])
predicted_prices = np.array([240000, 310000, 210000, 380000, 340000])
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(true_prices, predicted_prices))
print("Root Mean Squared Error (RMSE):", rmse)
output:-
Root Mean Squared Error (RMSE): 12649.110640673518
COD
The Coefficient of Determination (COD) is a comparable statistic to R-squared that
assesses how well the model’s predictions fit the data.
Wrap up
The situation at hand and the aims of the investigation dictate the most appropriate
metrics for regression models. Regression models are often evaluated using MSE, RMSE, MAE,
R-squared, modified R-squared, MAPE, and COD. To thoroughly assess the model’s
performance, it is advised to employ a mix of these regression model metrics.
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean distance
is the distance between two points, which we have already studied in geometry. It can be
calculated as:
By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below image:
As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
There is no particular way to determine the best value for "K", so we need to try some values to
find the best out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
Large values for K are good, but it may find some difficulties.
In Linear Regression, we
In Logistic Regression, we predict the
3. predict the value by an
value by 1 or 0.
integer number.