Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Iml Practical Assignment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Department of Computer Engineering

Subject: Introduction to Machine Learning(4350702)

Semester: 5

Practical Assignment : (Unit 2 Python libraries suitable for Machine Learning)


1. Explore any one machine learning tool. (like Weka, Tensorflow, Scikit-learn, Colab,
etc.)
Scikit-learn, a popular machine learning library for Python.

Overview of Scikit-learn

Scikit-learn (often abbreviated as sklearn) is a versatile and user-friendly machine learning


library that provides simple and efficient tools for data mining and data analysis. It's built on
top of other scientific libraries like NumPy, SciPy, and Matplotlib.

Key Features

1. Algorithms: Scikit-learn offers a wide range of machine learning algorithms for


classification, regression, clustering, and dimensionality reduction. Examples include:
o Classification: Support Vector Machines (SVM), Random Forests, K-Nearest
Neighbors (KNN), and Logistic Regression.
o Regression: Linear Regression, Ridge Regression, and Decision Trees.
o Clustering: K-Means, DBSCAN, and Hierarchical Clustering.
o Dimensionality Reduction: Principal Component Analysis (PCA) and t-Distributed
Stochastic Neighbor Embedding (t-SNE).
2. Preprocessing: It includes tools for data preprocessing like scaling, normalization,
and encoding categorical variables.
3. Model Selection: Scikit-learn provides utilities for splitting datasets, cross-validation,
and hyperparameter tuning, such as GridSearchCV and RandomizedSearchCV.
4. Evaluation Metrics: It includes various metrics to evaluate model performance, such
as accuracy, precision, recall, F1-score, and confusion matrix.
5. Pipeline: Scikit-learn supports pipelines for creating streamlined workflows that
include data preprocessing and model training.

Example Workflow

Here's a basic workflow using Scikit-learn for a classification problem:

1. Import Libraries:

from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

2. Load data:

data = load_iris()
X = data.data
y = data.target

3. split data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

4. preprocess data:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

5. train model:

model = SVC(kernel='linear')
model.fit(X_train_scaled, y_train)

6. make predictions:

y_pred = model.predict(X_test_scaled)

7. evaluate model:

accuracy = accuracy_score(y_test, y_pred)


print(f"Accuracy: {accuracy:.2f}")
Why Use Scikit-learn?

 Ease of Use: It has a clean and consistent API that makes it easy to use and integrate into
existing Python code.
 Comprehensive Documentation: Excellent documentation and a large community make it
easier to find support and resources.
 Efficiency: Scikit-learn is optimized for performance and can handle large datasets
effectively.
 Integration: It integrates well with other Python libraries, making it a great choice for
building end-to-end machine learning solutions.

Conclusion

Scikit-learn is a powerful and user-friendly tool that caters to both beginners and experienced
practitioners in machine learning. Its extensive library of algorithms, preprocessing tools, and
model evaluation methods make it a go-to choice for many machine learning tasks in Python.
2. Write a NumPy program to implement following operation
 to convert a list of numeric values into a one-dimensional NumPy array
 to create a 3x3 matrix with values ranging from 2 to 10
 to append values at the end of an array
 to create another shape from an array without changing its data(3*2 to 2*3)

import numpy as np

# 1. Convert a list of numeric values into a one-dimensional NumPy array


numeric_list = [1, 2, 3, 4, 5]
array_1d = np.array(numeric_list)
print("One-dimensional array:")
print(array_1d)

# 2. Create a 3x3 matrix with values ranging from 2 to 10


# We use np.arange to generate values from 2 to 10 and then reshape it into a 3x3
matrix
values = np.arange(2, 11).reshape((3, 3))
print("\n3x3 matrix with values ranging from 2 to 10:")
print(values)

# 3. Append values at the end of an array


# Create an array to append
array_to_append = np.array([6, 7, 8])
appended_array = np.append(array_1d, array_to_append)
print("\nArray after appending values:")
print(appended_array)

# 4. Create another shape from an array without changing its data (from 3x2 to 2x3)
# First, create a 3x2 array
original_array = np.arange(6).reshape((3, 2))
print("\nOriginal 3x2 array:")
print(original_array)

# Reshape it to 2x3
reshaped_array = original_array.reshape((2, 3))
print("\nReshaped 2x3 array:")
print(reshaped_array)

Explanation:

1. Convert a List to a One-Dimensional NumPy Array:


o Use np.array() to convert the list numeric_list into a NumPy array.
2. Create a 3x3 Matrix with Values Ranging from 2 to 10:
o np.arange(2, 11) generates values from 2 to 10.
o .reshape((3, 3)) reshapes this range into a 3x3 matrix.
3. Append Values at the End of an Array:
o Use np.append() to add array_to_append to array_1d.
4. Reshape an Array without Changing its Data:
o Create a 3x2 array with np.arange(6).reshape((3, 2)).
o Use .reshape((2, 3)) to change its shape to 2x3.

3. Write a NumPy program to implement following operation


 to split an array of 14 elements into 3 arrays, each with 2, 4, and 8 elements in
the original order
 to stack arrays horizontally (column wise)

Here's a step-by-step NumPy program to achieve this:


import numpy as np

# Create an initial array of 14 elements


original_array = np.arange(1, 15) # This will create an array with values [1, 2, 3, ..., 14]

# Split the array into three parts with sizes 2, 4, and 8


array1 = original_array[:2] # First 2 elements
array2 = original_array[2:6] # Next 4 elements
array3 = original_array[6:] # Remaining 8 elements

# Stack arrays horizontally


stacked_array = np.column_stack((array1, array2, array3))

# Print the results


print("Original Array:")
print(original_array)
print("\nArray 1 (2 elements):")
print(array1)
print("\nArray 2 (4 elements):")
print(array2)
print("\nArray 3 (8 elements):")
print(array3)
print("\nStacked Array (Horizontally):")
print(stacked_array)

Explanation:

1. Creating the Original Array:


o np.arange(1, 15) generates an array with values from 1 to 14.
2. Splitting the Array:
o array1 takes the first 2 elements.
o array2 takes the next 4 elements.
o array3 takes the remaining 8 elements.
3. Stacking Horizontally:
o np.column_stack((array1, array2, array3)) combines the arrays column-
wise, but since their lengths are different, you'll need to ensure that they are all of
equal length when stacking. Since column_stack requires arrays to be of the same
length, it won't work directly here.

Here's how you can achieve a horizontal stack with equal-length arrays or view
them individually:

import numpy as np

# Create an initial array of 14 elements


original_array = np.arange(1, 15)

# Split the array into three parts with sizes 2, 4, and 8


array1 = original_array[:2]
array2 = original_array[2:6]
array3 = original_array[6:]

# Printing arrays to show their separation


print("Original Array:")
print(original_array)
print("\nArray 1 (2 elements):")
print(array1)
print("\nArray 2 (4 elements):")
print(array2)
print("\nArray 3 (8 elements):")
print(array3)

# Optionally, you can pad arrays to make them same length and then stack if needed
max_length = max(len(array1), len(array2), len(array3))
array1_padded = np.pad(array1, (0, max_length - len(array1)), constant_values=np.nan)
array2_padded = np.pad(array2, (0, max_length - len(array2)), constant_values=np.nan)
array3_padded = np.pad(array3, (0, max_length - len(array3)), constant_values=np.nan)

# Stack arrays horizontally


stacked_array = np.column_stack((array1_padded, array2_padded, array3_padded))

print("\nStacked Array (Horizontally, with padding):")


print(stacked_array)

4. Write a NumPy program to implement following operation


 to add, subtract, multiply, divide arguments element-wise
 to round elements of the array to the nearest integer
 to calculate mean across dimension, in a 2D numpy array
 to calculate the difference between neighboring elements, element wise of a
given array
import numpy as np

# Sample arrays for demonstration


array1 = np.array([1.5, 2.5, 3.5, 4.5])
array2 = np.array([2.0, 2.0, 2.0, 2.0])

# 1. Element-wise operations: add, subtract, multiply, divide


addition = np.add(array1, array2)
subtraction = np.subtract(array1, array2)
multiplication = np.multiply(array1, array2)
division = np.divide(array1, array2)

# Print results of element-wise operations


print("Element-wise Addition:")
print(addition)
print("\nElement-wise Subtraction:")
print(subtraction)
print("\nElement-wise Multiplication:")
print(multiplication)
print("\nElement-wise Division:")
print(division)

# 2. Round elements of the array to the nearest integer


rounded_array = np.round(array1)

# Print rounded array


print("\nRounded Array:")
print(rounded_array)

# 3. Mean across dimensions in a 2D numpy array


matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Mean across each row


mean_rows = np.mean(matrix, axis=1)

# Mean across each column


mean_columns = np.mean(matrix, axis=0)
# Print mean values
print("\nMean Across Rows:")
print(mean_rows)
print("\nMean Across Columns:")
print(mean_columns)

# 4. Difference between neighboring elements, element-wise


difference_array = np.diff(array1)

# Print differences
print("\nDifference Between Neighboring Elements:")
print(difference_array)

Explanation:

1. Element-wise Operations:
o np.add, np.subtract, np.multiply, and np.divide perform element-wise
operations on the arrays array1 and array2.
2. Rounding Elements:
o np.round(array1) rounds each element of array1 to the nearest integer.
3. Mean Across Dimensions:
o np.mean(matrix, axis=1) calculates the mean of each row.
o np.mean(matrix, axis=0) calculates the mean of each column.
4. Difference Between Neighboring Elements:
o np.diff(array1) calculates the difference between each pair of neighboring
elements in array1.

5. Write a NumPy program to implement following operation


 to find the maximum and minimum value of a given flattened array
 to compute the mean, standard deviation, and variance of a given array along the
second axis
import numpy as np

# Sample data
flattened_array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
matrix = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

# 1. Find the maximum and minimum value of the flattened array


max_value = np.max(flattened_array)
min_value = np.min(flattened_array)

# Print max and min values


print("Maximum value of flattened array:")
print(max_value)
print("\nMinimum value of flattened array:")
print(min_value)

# 2. Compute the mean, standard deviation, and variance of the 2D array along the
second axis
mean_along_axis_1 = np.mean(matrix, axis=1)
std_dev_along_axis_1 = np.std(matrix, axis=1)
variance_along_axis_1 = np.var(matrix, axis=1)

# Print the computed statistics


print("\nMean along the second axis (axis=1):")
print(mean_along_axis_1)
print("\nStandard Deviation along the second axis (axis=1):")
print(std_dev_along_axis_1)
print("\nVariance along the second axis (axis=1):")
print(variance_along_axis_1)

Explanation:

1. Finding Maximum and Minimum Values:


o np.max(flattened_array) computes the maximum value of the
flattened_array.
o np.min(flattened_array) computes the minimum value of the
flattened_array.
2. Computing Statistics Along the Second Axis:
o np.mean(matrix, axis=1) calculates the mean of each row (along the second
axis).
o np.std(matrix, axis=1) calculates the standard deviation of each row.
o np.var(matrix, axis=1) calculates the variance of each row.

In the code:

 flattened_array is a simple 1D array used to find maximum and minimum values.


 matrix is a 2D array used to compute statistics along the rows.

6. Write a Pandas program to implement following operation


 to convert a NumPy array to a Pandas series
 to convert the first column of a DataFrame as a Series
 to create the mean and standard deviation of the data of a given Series
 to sort a given Series
import pandas as pd
import numpy as np

# Sample data
numpy_array = np.array([10, 20, 30, 40, 50])
data = {
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# 1. Convert a NumPy array to a Pandas Series


series_from_numpy = pd.Series(numpy_array)

# Print the Pandas Series created from NumPy array


print("Pandas Series from NumPy array:")
print(series_from_numpy)

# 2. Convert the first column of the DataFrame to a Series


first_column_series = df.iloc[:, 0]

# Print the Series created from the first column of the DataFrame
print("\nFirst column of DataFrame as a Pandas Series:")
print(first_column_series)

# 3. Create the mean and standard deviation of the data of a given Series
mean_of_series = first_column_series.mean()
std_dev_of_series = first_column_series.std()

# Print mean and standard deviation


print("\nMean of the Series:")
print(mean_of_series)
print("\nStandard Deviation of the Series:")
print(std_dev_of_series)

# 4. Sort the given Series


sorted_series = first_column_series.sort_values()

# Print the sorted Series


print("\nSorted Series:")
print(sorted_series)

Explanation:

1. Convert a NumPy Array to a Pandas Series:


o pd.Series(numpy_array) creates a Pandas Series from a NumPy array.
2. Convert the First Column of a DataFrame to a Series:
o df.iloc[:, 0] extracts the first column from the DataFrame as a Pandas Series.
.iloc is used to access rows and columns by integer index.
3. Create the Mean and Standard Deviation:
o first_column_series.mean() computes the mean of the Series.
o first_column_series.std() computes the standard deviation of the Series.
4. Sort the Given Series:
o first_column_series.sort_values() sorts the Series in ascending order.

7. Write a Pandas program to implement following operation


 to create a dataframe from a dictionary and display it
 to sort the DataFrame first by 'name' in ascending order
 to delete the one specific column from the DataFrame
 to write a DataFrame to CSV file using tab separator
import pandas as pd

# Sample data
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 35, 40],
'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

# 1. Create a DataFrame from the dictionary and display it


df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# 2. Sort the DataFrame by 'name' in ascending order


df_sorted = df.sort_values(by='name')

# Print the sorted DataFrame


print("\nDataFrame sorted by 'name' in ascending order:")
print(df_sorted)

# 3. Delete the 'age' column from the DataFrame


df_dropped = df_sorted.drop(columns=['age'])

# Print the DataFrame after deleting the column


print("\nDataFrame after deleting the 'age' column:")
print(df_dropped)

# 4. Write the DataFrame to a CSV file using tab separator


df_dropped.to_csv('output_data.tsv', sep='\t', index=False)

print("\nDataFrame has been written to 'output_data.tsv' with tab separation.")

Explanation:

1. Create a DataFrame from a Dictionary and Display It:


o pd.DataFrame(data) creates a DataFrame from the provided dictionary data.
o print(df) displays the DataFrame.
2. Sort the DataFrame by 'name' in Ascending Order:
o df.sort_values(by='name') sorts the DataFrame based on the 'name' column.
o The result is assigned to df_sorted and then displayed.
3. Delete a Specific Column from the DataFrame:
o df_sorted.drop(columns=['age']) removes the 'age' column from the
DataFrame.
o The result is assigned to df_dropped and displayed.
4. Write the DataFrame to a CSV File Using a Tab Separator:
o df_dropped.to_csv('output_data.tsv', sep='\t', index=False)
writes the DataFrame to a CSV file named 'output_data.tsv' with tab characters as
separators. index=False prevents writing row indices to the file.

8. Write a Pandas program to create a line plot of the opening, closing stock prices of
given company between two specific dates.

To create a line plot of the opening and closing stock prices for a company between two
specific dates using Pandas and Matplotlib, you'll need to follow these steps:

1. Load the stock price data into a Pandas DataFrame.


2. Filter the data for the specified date range.
3. Plot the opening and closing prices on a line plot.

For this example, let's assume you have stock price data in a CSV file with columns for Date,
Open, and Close. Here’s how you can implement this:

import pandas as pd
import matplotlib.pyplot as plt

# Load stock price data from a CSV file


# Make sure to replace 'stock_data.csv' with the path to your CSV file
# CSV file should have columns: Date, Open, Close
df = pd.read_csv('stock_data.csv')

# Convert the 'Date' column to datetime format


df['Date'] = pd.to_datetime(df['Date'])

# Set 'Date' as the index of the DataFrame


df.set_index('Date', inplace=True)
# Define the date range for filtering
start_date = '2023-01-01'
end_date = '2023-12-31'

# Filter the DataFrame for the date range


filtered_df = df.loc[start_date:end_date]

# Plotting
plt.figure(figsize=(12, 6))
plt.plot(filtered_df.index, filtered_df['Open'], label='Opening Price', color='blue')
plt.plot(filtered_df.index, filtered_df['Close'], label='Closing Price', color='red')

# Adding titles and labels


plt.title('Stock Prices from {} to {}'.format(start_date, end_date))
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()

# Display the plot


plt.grid(True)
plt.show()

Explanation:

1. Load Stock Price Data:


o Use pd.read_csv('stock_data.csv') to read the stock price data from a CSV
file into a DataFrame. Ensure that the CSV file contains Date, Open, and Close
columns.
2. Convert 'Date' to Datetime Format:
o pd.to_datetime(df['Date']) converts the 'Date' column to datetime objects
for easier manipulation.
3. Set 'Date' as Index:
o df.set_index('Date', inplace=True) sets the 'Date' column as the index of
the DataFrame, making it easier to filter by date.
4. Filter Data for the Specified Date Range:
o df.loc[start_date:end_date] filters the DataFrame to include only rows
within the specified date range.
5. Plotting:
o plt.plot() creates line plots for the opening and closing prices.
o plt.title(), plt.xlabel(), plt.ylabel(), and plt.legend() add title, axis
labels, and a legend to the plot.
o plt.grid(True) adds a grid to the plot for better readability.
o plt.show() displays the plot.

9. Write a Pandas program to create a plot of Open, High, Low, Close, Adjusted
Closing prices and Volume of given company between two specific dates.
To create a comprehensive plot of multiple stock prices (Open, High, Low, Close, Adjusted
Close) and Volume between two specific dates using Pandas and Matplotlib, follow these
steps:

1. Load the stock price data into a Pandas DataFrame.


2. Filter the data for the specified date range.
3. Plot the different stock price metrics and volume.

Below is an example program that demonstrates these steps:

import pandas as pd
import matplotlib.pyplot as plt

# Load stock price data from a CSV file


# Replace 'stock_data.csv' with the path to your actual CSV file
# CSV file should have columns: Date, Open, High, Low, Close, Adj Close, Volume
df = pd.read_csv('stock_data.csv')

# Convert the 'Date' column to datetime format


df['Date'] = pd.to_datetime(df['Date'])

# Set 'Date' as the index of the DataFrame


df.set_index('Date', inplace=True)

# Define the date range for filtering


start_date = '2023-01-01'
end_date = '2023-12-31'

# Filter the DataFrame for the date range


filtered_df = df.loc[start_date:end_date]

# Plotting
fig, ax1 = plt.subplots(figsize=(14, 7))

# Plot Open, High, Low, Close, and Adjusted Close prices on the primary y-axis
ax1.plot(filtered_df.index, filtered_df['Open'], label='Open', color='blue')
ax1.plot(filtered_df.index, filtered_df['High'], label='High', color='green')
ax1.plot(filtered_df.index, filtered_df['Low'], label='Low', color='red')
ax1.plot(filtered_df.index, filtered_df['Close'], label='Close', color='orange')
ax1.plot(filtered_df.index, filtered_df['Adj Close'], label='Adjusted Close', color='purple')

# Adding titles and labels for the first y-axis


ax1.set_title('Stock Prices and Volume from {} to {}'.format(start_date, end_date))
ax1.set_xlabel('Date')
ax1.set_ylabel('Price')
ax1.legend(loc='upper left')

# Creating a secondary y-axis to plot Volume


ax2 = ax1.twinx()
ax2.plot(filtered_df.index, filtered_df['Volume'], label='Volume', color='gray', linestyle='--',
alpha=0.6)
ax2.set_ylabel('Volume', color='gray')
ax2.legend(loc='upper right')

# Adding grid
ax1.grid(True)

# Display the plot

plt.show()

Explanation:

1. Load Stock Price Data:


o pd.read_csv('stock_data.csv') reads the CSV file into a DataFrame. Ensure
the CSV file has columns: Date, Open, High, Low, Close, Adj Close, and Volume.
2. Convert 'Date' to Datetime Format:
o pd.to_datetime(df['Date']) converts the 'Date' column to datetime objects
for easy filtering.
3. Set 'Date' as Index:
o df.set_index('Date', inplace=True) sets the 'Date' column as the
DataFrame index, simplifying date-based filtering.
4. Filter Data for the Specified Date Range:
o df.loc[start_date:end_date] filters the DataFrame to include data between
the specified dates.
5. Plotting:
o Create a primary y-axis plot for the Open, High, Low, Close, and Adjusted Close
prices using ax1.plot().
o Add a secondary y-axis for the Volume data using ax2.plot() and ax1.twinx()
to create a dual-axis plot.
o Add titles, labels, and legends to improve readability and clarity.

10. Write a Pandas program to implement following operation


 to find and drop the missing values from the given dataset
 to remove the duplicates from the given dataset

Here's how you can accomplish these tasks using Pandas:

1. Find and Drop Missing Values


2. Remove Duplicates

Below is a complete Pandas program demonstrating these operations:

import pandas as pd

# Sample dataset with missing values and duplicates


data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Charlie'],
'Age': [25, None, 35, 25, 40, None],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Houston', 'Chicago']
}

# Create a DataFrame from the sample data


df = pd.DataFrame(data)

# Display the original DataFrame


print("Original DataFrame:")
print(df)

# 1. Find and drop missing values


# Dropping rows with any missing values
df_dropped_missing = df.dropna()

# Alternatively, to drop missing values in specific columns:


# df_dropped_missing = df.dropna(subset=['Age'])

# Print DataFrame after dropping missing values


print("\nDataFrame after dropping missing values:")
print(df_dropped_missing)

# 2. Remove duplicates
# Dropping duplicate rows based on all columns
df_dropped_duplicates = df_dropped_missing.drop_duplicates()

# Alternatively, to drop duplicates based on specific columns:


# df_dropped_duplicates = df_dropped_missing.drop_duplicates(subset=['Name'])

# Print DataFrame after removing duplicates


print("\nDataFrame after removing duplicates:")
print(df_dropped_duplicates)

Explanation:

1. Find and Drop Missing Values:


o df.dropna() removes any rows that contain missing values (NaN) in any column.
o To drop missing values in specific columns only, you can use
df.dropna(subset=['ColumnName']).
2. Remove Duplicates:
o df_dropped_missing.drop_duplicates() removes duplicate rows based on
all columns.
o To remove duplicates based on specific columns, you can use
df_dropped_missing.drop_duplicates(subset=['ColumnName']).

Additional Notes:

 Missing Values Handling: Sometimes, instead of dropping missing values, you might choose
to fill them with a specific value or use interpolation. For example, df.fillna(value=0)
can be used to replace missing values with 0.
 Duplicates Handling: Removing duplicates helps ensure that the dataset is unique, which is
important for accurate analysis and model training.

11. Write a Pandas program to filter all columns where all entries present, check which
rows and columns has a NaN and finally drop rows with any NaNs from the given
dataset.

write a Pandas program to achieve the following:

1. Filter Columns Where All Entries Are Present (i.e., No NaNs).


2. Check Which Rows and Columns Have NaNs.
3. Drop Rows with Any NaNs from the Dataset.

Here's a complete program that demonstrates these steps:

import pandas as pd

# Sample dataset with missing values


data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, None, 35, 40, None],
'City': ['New York', 'Los Angeles', 'Chicago', None, 'Houston'],
'Salary': [50000, 60000, None, 70000, 80000]
}

# Create a DataFrame from the sample data


df = pd.DataFrame(data)

# Display the original DataFrame


print("Original DataFrame:")
print(df)

# 1. Filter columns where all entries are present (i.e., no NaNs)


columns_with_no_nans = df.columns[df.notna().all()].tolist()
df_no_nans_columns = df[columns_with_no_nans]

# Print DataFrame with columns having all entries present


print("\nDataFrame with columns having all entries present:")
print(df_no_nans_columns)

# 2. Check which rows and columns have NaNs


rows_with_nans = df[df.isna().any(axis=1)]
columns_with_nans = df.columns[df.isna().any()].tolist()

# Print rows with NaNs


print("\nRows with NaNs:")
print(rows_with_nans)

# Print columns with NaNs


print("\nColumns with NaNs:")
print(columns_with_nans)

# 3. Drop rows with any NaNs


df_dropped_rows = df.dropna()

# Print DataFrame after dropping rows with any NaNs


print("\nDataFrame after dropping rows with any NaNs:")
print(df_dropped_rows)

Explanation:

1. Filter Columns Where All Entries Are Present:


o df.notna().all() returns a boolean Series where True indicates columns with
no NaNs.
o df.columns[df.notna().all()] gets the column names that have all non-NaN
entries.
o df[columns_with_no_nans] filters the DataFrame to include only these
columns.
2. Check Which Rows and Columns Have NaNs:
o df.isna().any(axis=1) identifies rows with any NaN values.
o df[df.isna().any(axis=1)] filters the DataFrame to show only those rows.
o df.columns[df.isna().any()] gets the names of columns that contain NaNs.
3. Drop Rows with Any NaNs:
o df.dropna() removes any rows with one or more NaN values.

12. Write a Python program using Scikit-learn to print the keys, number of rows-
columns, feature names and the description of the given data.
from sklearn.datasets import load_iris # You can replace this with any dataset you
want to use

# Load the dataset


data = load_iris() # You can also use load_boston() or any other dataset

# Print the keys of the dataset


print("Keys of the dataset:")
print(data.keys())

# Get the number of rows and columns


X = data.data # Features
y = data.target # Target

num_rows, num_columns = X.shape


print("\nNumber of rows and columns:")
print(f"Rows: {num_rows}, Columns: {num_columns}")

# Print the feature names


print("\nFeature names:")
print(data.feature_names)

# Print the dataset description


print("\nDataset description:")
print(data.DESCR)

Explanation:

1. Load the Dataset:


o load_iris() loads the Iris dataset. You can replace this with other datasets like
load_boston(), load_diabetes(), etc., from Scikit-learn.
2. Print the Keys of the Dataset:
o data.keys() shows the keys in the dataset dictionary. Typically, these include
data, target, feature_names, DESCR, etc.
3. Get the Number of Rows and Columns:
o data.data contains the feature matrix. X.shape gives the shape of this matrix,
where the first value is the number of rows and the second value is the number of
columns.
4. Print the Feature Names:
o data.feature_names provides the names of the features in the dataset.
5. Print the Dataset Description:
o data.DESCR provides a detailed description of the dataset, including information
about the dataset, the features, and any additional notes.

Note:

 Scikit-learn datasets come with useful attributes like data, target, feature_names, and
DESCR which provide a comprehensive view of the dataset.
 Make sure to install Scikit-learn if it’s not already installed. You can do this via pip:
pip install scikit-learn

13. Write a Python program to implement K-Nearest Neighbour supervised machine


learning algorithm for given dataset.

To implement the K-Nearest Neighbors (KNN) supervised machine learning algorithm using
Scikit-learn, you need to follow these steps:

1. Load and Prepare the Dataset: Load the dataset and split it into training and test
sets.
2. Create and Train the KNN Model: Initialize the KNN model and train it on the
training data.
3. Evaluate the Model: Predict on the test set and evaluate the model’s performance.

Here's a complete Python program that demonstrates these steps using the Iris dataset as an
example:

from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset


data = load_iris()
X = data.data # Features
y = data.target # Target

# Split the dataset into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Initialize the K-Nearest Neighbors classifier


# You can specify the number of neighbors, e.g., n_neighbors=5
knn = KNeighborsClassifier(n_neighbors=5)

# Train the model


knn.fit(X_train, y_train)

# Predict on the test set


y_pred = knn.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Print the results


print("Accuracy of the KNN model:")
print(accuracy)
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)

Explanation:

1. Load and Prepare the Dataset:


o load_iris() loads the Iris dataset, which is a common dataset for classification
tasks.
o X contains the feature matrix, and y contains the target labels.
o train_test_split() splits the dataset into training and test sets.
test_size=0.3 means 30% of the data is used for testing, and
random_state=42 ensures reproducibility.
2. Create and Train the KNN Model:
o KNeighborsClassifier(n_neighbors=5) initializes the KNN classifier with 5
neighbors. You can adjust n_neighbors based on your needs.
o knn.fit(X_train, y_train) trains the KNN model on the training data.
3. Evaluate the Model:
o knn.predict(X_test) generates predictions on the test set.
o accuracy_score(y_test, y_pred) calculates the accuracy of the model.
o confusion_matrix(y_test, y_pred) provides a confusion matrix, which
shows the number of correct and incorrect predictions.
o classification_report(y_test, y_pred) provides a detailed report
including precision, recall, and F1-score for each class.

Notes:

 Ensure you have Scikit-learn installed. You can install it using pip if necessary:

pip install scikit-learn

14. Write a Python program to implement a machine learning algorithm for given
dataset. (It is recommended to assign different machine learning algorithms group
wise – micro project)

To implement various machine learning algorithms on a given dataset, you should follow a
structured approach. For this example, I will demonstrate how to implement three common
machine learning algorithms using Scikit-learn on a dataset:

1. Logistic Regression
2. Decision Tree Classifier
3. Support Vector Machine (SVM)

We'll use the Iris dataset for this demonstration. This dataset is often used for classification
tasks and comes built-in with Scikit-learn.

Here's a Python program that covers these steps:


import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset


data = load_iris()
X = data.data # Features
y = data.target # Target

# Split the dataset into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train Logistic Regression model


log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)

# Predict and evaluate Logistic Regression model


y_pred_log_reg = log_reg.predict(X_test)
print("Logistic Regression Results:")
print("Accuracy:", accuracy_score(y_test, y_pred_log_reg))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_log_reg))
print("Classification Report:\n", classification_report(y_test, y_pred_log_reg))

# Initialize and train Decision Tree Classifier model


decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)

# Predict and evaluate Decision Tree Classifier model


y_pred_decision_tree = decision_tree.predict(X_test)
print("\nDecision Tree Classifier Results:")
print("Accuracy:", accuracy_score(y_test, y_pred_decision_tree))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_decision_tree))
print("Classification Report:\n", classification_report(y_test, y_pred_decision_tree))

# Initialize and train Support Vector Machine (SVM) model


svm = SVC()
svm.fit(X_train, y_train)

# Predict and evaluate Support Vector Machine model


y_pred_svm = svm.predict(X_test)
print("\nSupport Vector Machine Results:")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
print("Classification Report:\n", classification_report(y_test, y_pred_svm))
Explanation:

1. Load the Dataset:


o load_iris() loads the Iris dataset. You can replace it with another dataset if
needed.
2. Split the Dataset:
o train_test_split() is used to divide the data into training and test sets.
3. Logistic Regression:
o LogisticRegression(max_iter=200) initializes the logistic regression model.
The max_iter parameter ensures the model has enough iterations to converge.
o fit() trains the model on the training data.
o predict() generates predictions on the test data.
o accuracy_score(), confusion_matrix(), and classification_report()
evaluate the model’s performance.
4. Decision Tree Classifier:
o DecisionTreeClassifier() initializes the decision tree model.
o fit() trains the model on the training data.
o predict(), accuracy_score(), confusion_matrix(), and
classification_report() are used similarly to evaluate the model.
5. Support Vector Machine (SVM):
o SVC() initializes the support vector classifier.
o fit() trains the model.
o predict(), accuracy_score(), confusion_matrix(), and
classification_report() evaluate the SVM model.

Additional Notes:

 Hyperparameter Tuning: For each algorithm, you can perform hyperparameter tuning to
improve performance. This can be done using techniques like GridSearchCV.
 Feature Scaling: For SVM, it's often beneficial to scale features using StandardScaler.
This step is omitted here for simplicity but is worth considering for real-world applications.

You might also like