Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Python Basics for Data Science and Analysis

The document is a lab manual focused on Python basics for data science, covering topics such as variable declaration, data types, list unpacking, arithmetic operations, assignment operators, logical operators, user input, string operations, and conditional statements. It emphasizes the importance of understanding these concepts for effective programming and data manipulation. Additionally, it introduces NumPy for numerical computing and array creation, highlighting its benefits over traditional Python lists.

Uploaded by

Ji huo
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Python Basics for Data Science and Analysis

The document is a lab manual focused on Python basics for data science, covering topics such as variable declaration, data types, list unpacking, arithmetic operations, assignment operators, logical operators, user input, string operations, and conditional statements. It emphasizes the importance of understanding these concepts for effective programming and data manipulation. Additionally, it introduces NumPy for numerical computing and array creation, highlighting its benefits over traditional Python lists.

Uploaded by

Ji huo
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Lab Manual

Name: Laiba Naveed

Roll No: 2K22-BSCS-156

Section: Red
Lab 02
Python Basics for Data Science and
Analysis
Variable Declaration and Printing
Variable declaration in Python is a straightforward process that allows you to store data
for future use. A variable is created by assigning a value to a name using the
assignment operator =. Here’s how you can declare different types of variables:
# Integer variable
age = 25

# Float variable
height = 5.9

# String variable
name = "Alice"

# Boolean variable
is_student = True

Printing Variables
To display the value of a variable, the print() function is used. This function outputs the
variable's value to the console, facilitating debugging and verification of data.
print(age) # Output: 25
print(height) # Output: 5.9
print(name) # Output: Alice
print(is_student) # Output: True

Naming Conventions
When declaring variables, follow these common practices for naming:
• Use descriptive names (e.g., total_price instead of x).
• Use lowercase letters and underscores to separate words.
• Avoid starting variable names with numbers and using reserved keywords.
Understanding variable declaration and the printing of variable values is essential for
effective programming, especially in debugging scenarios where verifying the data
values can save time and enhance code functionality.
Checking Data Types
In Python, checking the data type of a variable is crucial for understanding the kind of
data you are dealing with, which can influence how you manipulate and perform
operations on that data. The built-in type() function allows you to determine the data
type of any variable. Below are common data types you might encounter, along with
examples:

Examples of Data Types


# Integer
num_int = 10
print(type(num_int)) # Output: <class 'int'>

# Float
num_float = 10.5
print(type(num_float)) # Output: <class 'float'>

# String
greeting = "Hello, World!"
print(type(greeting)) # Output: <class 'str'>

# List
fruits = ["apple", "banana", "cherry"]
print(type(fruits)) # Output: <class 'list'>

# Dictionary
person = {"name": "Alice", "age": 30}
print(type(person)) # Output: <class 'dict'>

Significance of Understanding Data Types


Knowing the data types of your variables is essential because:
• Error Prevention: Helps avoid errors related to incompatible operations (e.g.,
trying to add a string to an integer).
• Optimized Performance: Certain data types can lead to more efficient code
execution.
• Data Analysis: In data science, accurately identifying data types is crucial for
proper data manipulation and analysis techniques.
Ultimately, mastering data types contributes to more robust and error-free programming.

Unpacking List Elements


In Python, list unpacking is a convenient way to assign elements of a list to multiple
variables in a single statement. This method enhances code readability and efficiency,
making it particularly useful in scenarios where you want to extract data from a
collection.
Basic Example
Consider the following list of coordinates:
coordinates = [10, 20, 30]
x, y, z = coordinates

In this example, x, y, and z are assigned the values 10, 20, and 30, respectively. This
eliminates the need for indexing and can save time when working with multiple
elements.

Advantages of List Unpacking


• Improved Readability: By directly assigning values from a list to variables, the
code becomes cleaner and easier to understand.
• Fewer Lines of Code: Using unpacking reduces the lines of code needed to
extract values, streamlining your codebase.
• Dynamic Assignment: If you're working with lists of varying lengths, using an
asterisk can help. For instance:
first, *rest = [1, 2, 3, 4, 5]

Here, first will be 1, and rest will be [2, 3, 4, 5].

Use Cases
List unpacking is especially beneficial when handling data structures like tuples returned
from functions or when processing results from mathematical computations, allowing for
clearer, more concise code. It promotes efficient coding practices essential for scalable
data analysis in Python.

Arithmetic Operations
Arithmetic operations are fundamental in Python and encompass the basic
mathematical computations: addition, subtraction, multiplication, and division. Each
operation is supported by an operator, enabling various numeric types, including
integers and floats.

Basic Operations
1. Addition (+)

sum_int = 5 + 7 # Output: 12 (integer)


sum_float = 5.5 + 2.5 # Output: 8.0 (float)

2. Subtraction (-)

difference_int = 10 - 3 # Output: 7 (integer)


difference_float = 9.5 - 4.0 # Output: 5.5 (float)

3. Multiplication (*)
product_int = 3 * 4 # Output: 12 (integer)
product_float = 7.0 * 3.5 # Output: 24.5 (float)

4. Division (/)

division_int = 10 / 2 # Output: 5.0 (float)


division_float = 5.0 / 2.0 # Output: 2.5 (float)

Operator Precedence
Python follows specific rules for operator precedence, determining the order in which
operations are performed. The hierarchy is as follows:
• Parentheses ()
• Exponentiation **
• Multiplication *, Division /, Floor Division //, and Modulo %
• Addition + and Subtraction -
For example:
result = 5 + 2 * 3 # Result: 11 (2 * 3 is calculated first)

Understanding these operations and their precedence is crucial for effective


programming and accurate calculations within scripts.

Assignment Operators
Assignment operators in Python are used to assign values to variables. The simplest
form of an assignment operator is the = operator, which assigns the right-hand value to
the left-hand variable. In addition to standard assignment, Python also provides a set of
compound assignment operators that combine arithmetic operations with assignment
in a more concise way.

Standard Assignment
x = 10 # Assigns the value 10 to x

Compound Assignment Operators


Here are the common compound assignment operators and their usage:
1. Addition Assignment (+=)

This operator adds the right operand to the left operand and assigns the sum to
the left operand.
x = 5
x += 3 # Equivalent to x = x + 3; Now x = 8

2. Subtraction Assignment (-=)


This operator subtracts the right operand from the left operand and assigns the
result to the left operand.
x = 10
x -= 4 # Equivalent to x = x - 4; Now x = 6

3. Multiplication Assignment (*=)

This operator multiplies the left operand by the right operand and assigns the
product to the left operand.
x = 3
x *= 2 # Equivalent to x = x * 2; Now x = 6

4. Division Assignment (/=)

This operator divides the left operand by the right operand and assigns the
quotient to the left operand.
x = 12
x /= 3 # Equivalent to x = x / 3; Now x = 4.0

Summary of Behavior
Using compound assignment operators not only simplifies your code but also enhances
readability by reducing redundancy. It is crucial for beginners to understand these
operators as they form the basis for manipulating variables effectively in Python
programs.

Logical Operators
Logical operators in Python allow developers to combine multiple conditions within
conditional statements. The three primary logical operators are and, or, and not.
Understanding these operators is essential for creating complex decision-making
mechanisms in your code.

Logical Operators Overview


Operator Description Example
and Returns True if both operands are true; otherwise, it returns a > 5 and b <
False. 10
or Returns True if at least one of the operands is true; returns a > 5 or b <
False only if both are false. 10
not Reverses the truth value of its operand. Returns True if the not(a > 5)
operand is false, and vice versa.
Usage in Conditional Statements
Logical operators are commonly used in conditional statements to create more robust
conditions. Here are a few examples:
age = 20
is_student = True

# Using logical 'and'


if age > 18 and is_student:
print("Eligible for student discount.")

# Using logical 'or'


if age < 18 or age > 65:
print("Eligible for youth or senior citizen discount.")

# Using logical 'not'


if not is_student:
print("Not a student.")

Common Use Cases and Best Practices


• Combining Conditions: Use and to ensure all conditions must be true, and or to
allow flexibility in your conditions.
• Negating Conditions: Use not for scenarios where you want to execute a block
when a condition is false.
When writing conditional logic, aim for clarity. Using parentheses can enhance
readability, particularly with complex conditions:
if (age > 18 and is_student) or (age < 12):
print("Special conditions apply.")

By implementing these operators effectively, you enhance the decision-making


capabilities of your Python programs, ultimately resulting in cleaner, more efficient code.

Taking Input (Simulated)


In Python, user input is captured using the input() function. This function prompts the
user for data, which is then read as a string regardless of its content. It's a powerful
feature that enhances interactivity in programs.

Basic Usage
To collect a simple string input, you can use the following example:
user_name = input("Enter your name: ")
print(f"Hello, {user_name}!")
Input Types
While inputs are received as strings, you might need to convert them into other data
types, such as integers or floats. Here's how to handle different types of inputs:
1. String Input

favorite_color = input("What's your favorite color? ") # Output: e.g.,


"blue"

2. Integer Input To convert a string input into an integer:

age = int(input("Enter your age: ")) # Output: e.g., 25

3. Float Input For decimal values, you can use the float() function:

height = float(input("Enter your height in meters: ")) # Output: e.g.,


1.75

Considerations for Input Conversion


It's crucial to ensure that the input can be correctly converted to the desired type. You
should consider handling errors for invalid inputs using try and except statements:
try:
age = int(input("Enter your age: "))
except ValueError:
print("Please enter a valid number.")

By mastering user input in Python, you open avenues for creating dynamic and user-
centered applications that respond effectively to real-time data.

String Operations
String operations in Python are fundamental for manipulating textual data. Python
provides various methods for working with strings, including concatenation, slicing, and
built-in string methods like upper() and lower(). Mastering these techniques is crucial for
effective data manipulation.

Concatenation
String concatenation is the process of combining two or more strings into one. This can
be achieved using the + operator:
first_name = "Alice"
last_name = "Johnson"
full_name = first_name + " " + last_name # Output: "Alice Johnson"

Slicing
String slicing allows you to extract a portion of a string using indices. For example:
message = "Hello, World!"
greeting = message[0:5] # Output: "Hello"

In this example, slicing retrieves characters from index 0 to 4.

String Methods
Python strings come equipped with many built-in methods for string manipulation:
• upper(): Converts all characters to uppercase.

print("hello".upper()) # Output: "HELLO"

• lower(): Converts all characters to lowercase.

print("WORLD".lower()) # Output: "world"

• strip(): Removes leading and trailing whitespace.

spaced_string = " Hello, World! "


print(spaced_string.strip()) # Output: "Hello, World!"

These operations are pivotal when cleaning and preparing text data for analysis,
contributing to more effective data processing tasks.

String Slicing
String slicing in Python is a powerful feature that enables you to extract specific portions
of strings using indices. This flexibility allows for effective manipulation of textual data,
which is vital in data analysis and programming.

Basic Concept of String Slicing


A string in Python is indexed from 0. Here’s a simple example to illustrate slicing:
text = "Programming"
substring = text[0:4] # Output: "Prog"

In this example, characters from index 0 up to, but not including, index 4 are extracted.

Negative Indexing
Python also supports negative indexing, allowing you to reference positions from the
end of the string. For example:
text = "Programming"
last_three = text[-3:] # Output: "ing"

This code retrieves the last three characters of the string, demonstrating how negative
indices work.
Modifying Strings with Slices
While strings in Python are immutable, you can create modified versions by replacing
specific slices. Here’s how you can replace parts of a string:
text = "Hello World"
modified = text[:6] + "Python" # Output: "Hello Python"

In this instance, the substring from the beginning up to index 6 is concatenated with a
new string.

Summary of Slicing Syntax


The general syntax for slicing is string[start:end:step]:
• start: The starting index (inclusive).
• end: The ending index (exclusive).
• step: The interval between each index (optional).
For instance:
text = "Hello World"
slice_example = text[::2] # Output: "Hlo ol"

This example picks every second character from the string, showcasing the versatility of
slicing in data manipulation.

Conditional Statements
Conditional statements in Python allow you to execute different blocks of code based on
certain conditions. The primary conditional statements are if, elif (else if), and else.
These statements serve as the backbone of decision-making in programming.

Basic Structure
Here’s the syntax for a basic if statement:
if condition:
# Execute this block

Example:
age = 18

if age >= 18:


print("You are an adult.")

elif and else


You can extend if statements using elif and else for more complex conditions:
if condition1:
# Execute if condition1 is true
elif condition2:
# Execute if condition1 is false and condition2 is true
else:
# Execute if both conditions are false

Example:
number = 0

if number > 0:
print("Positive number")
elif number < 0:
print("Negative number")
else:
print("Zero")

Nesting Conditional Statements


For more complex logic, you can nest conditional statements. This means placing an if
statement inside another if statement.
if condition1:
if condition2:
# Execute this block if both conditions are true

Example:
age = 20
is_student = True

if age >= 18:


if is_student:
print("Eligible for student discount.")
else:
print("Not a student.")

Summary
Conditional statements are vital for controlling the flow of your program based on
dynamic inputs and conditions, enabling the execution of different code paths according
to varying scenarios. Understanding how to use these effectively enhances your
programming capabilities and facilitates the development of more sophisticated
applications.
Lab 03
NumPy Array Creation
NumPy is a powerful library in Python that facilitates numerical computing, providing
support for large, multi-dimensional arrays and matrices. It allows for efficient data
storage and computation, making it a crucial tool for data analysis in data science.

Creating NumPy Arrays


NumPy arrays can be easily created from standard Python lists or by using built-in
functions. Here are some examples of how to create NumPy arrays:
1. From a Python List

import numpy as np
my_list = [1, 2, 3, 4]
array_from_list = np.array(my_list)

This creates a NumPy array from a Python list.

2. Using arange() The arange() function generates arrays with evenly spaced
values within a given range.
array_range = np.arange(0, 10, 2) # Output: array([0, 2, 4, 6, 8])

3. Using linspace() The linspace() function returns evenly spaced numbers over a
specified range.
array_linspace = np.linspace(0, 1, 5) # Output: array([0. , 0.25, 0.5,
0.75, 1. ])

Benefits of Using NumPy Arrays


NumPy arrays offer several advantages over traditional Python lists:
• Performance: Arrays are faster for numerical computations as they are
implemented in C and optimized for performance.
• Memory Efficiency: NumPy uses contiguous memory storage, resulting in lower
memory overhead.
• Functionality: NumPy provides a vast library of mathematical functions that
operate on arrays efficiently, allowing for element-wise operations.
With these features, NumPy arrays significantly enhance data processing capabilities in
Python, making them essential for data scientists and analysts.
NumPy Indexing and Slicing
Indexing and slicing are fundamental operations when working with NumPy arrays,
facilitating the retrieval of specific elements or subsets of data. Understanding these
concepts helps in efficient data manipulation and analysis.

Indexing NumPy Arrays


NumPy allows you to access array elements using straightforward indexing techniques.
For example, consider the following array:
import numpy as np
arr = np.array([10, 20, 30, 40, 50])

To access the first element:


first_element = arr[0] # Output: 10

When accessing multi-dimensional arrays, indexing can specify the row and column:
matrix = np.array([[1, 2, 3], [4, 5, 6]])
element = matrix[1, 2] # Output: 6 (second row, third column)

Slicing NumPy Arrays


Slicing allows for the selection of a range of elements. The syntax follows the format
array[start:stop:step], similar to Python lists:
slice_arr = arr[1:4] # Output: array([20, 30, 40])

Key concepts to note:


• Inclusive Start: The slice begins at the index specified (inclusive).
• Exclusive Stop: The slice ends at the index specified (exclusive).
• Step: You can specify a step to skip elements in the slice, e.g., arr[::2] selects
every second element.

Differences from Traditional Slicing


• Efficiency: NumPy slicing is more efficient than traditional list slicing because
NumPy operates on contiguous data in memory.
• Broadcasting: NumPy supports broadcasting, allowing operations on differently
sized arrays, whereas Python lists require explicit iterations.
Understanding indexing and slicing in NumPy not only streamlines data retrieval but
also enhances the overall efficiency of data analysis operations.
Handling Missing Data in Pandas
Handling missing data is crucial in data analysis as it can significantly affect results.
Pandas, a powerful data manipulation library in Python, offers convenient methods to
handle missing values efficiently, primarily through dropna() and fillna().

Using dropna()
The dropna() method removes rows or columns containing missing values. By default, it
drops any rows with at least one missing value:
import pandas as pd

data = {
'Name': ['Alice', 'Bob', None, 'David'],
'Age': [25, None, 30, 22]
}
df = pd.DataFrame(data)

cleaned_df = df.dropna()
print(cleaned_df)

Implications: While effective, using dropna() can lead to loss of valuable data,
especially in smaller datasets. Therefore, it should be employed with caution.

Using fillna()
On the other hand, the fillna() method is used to replace missing values with a specified
constant, mean, or other statistical measures. For example:
filled_df = df.fillna({"Name": "Unknown", "Age": df['Age'].mean()})
print(filled_df)

Implications: This method preserves data integrity by substituting missing values, but
it's essential to choose replacement values that maintain the dataset's
representativeness.

Summary
Choosing between these methods depends on the context of the analysis. dropna() is
suitable for smaller datasets where the loss of data is negligible, while fillna() is
advantageous for maintaining data volume and leveraging existing patterns within the
dataset. Proper handling of missing data is fundamental to producing accurate and
reliable analysis results.

Data Cleaning Techniques


Data cleaning is a critical step in data preprocessing, particularly when using Pandas.
This process ensures data integrity, reduces errors, and prepares data for analysis.
Here are some common techniques:
Handling Duplicates
Removing duplicate records is vital for maintaining accurate datasets. Use the
drop_duplicates() method to eliminate duplicate entries:
df = df.drop_duplicates()

This function allows you to specify columns to check for duplicates and remove them
effectively.

Renaming Columns
Clear and consistent naming of columns enhances clarity. The rename() function lets
you rename columns easily:
df.rename(columns={'old_name': 'new_name'}, inplace=True)

This practice ensures that all team members and stakeholders understand the dataset
structure.

Transforming Data Types


Data type consistency is crucial for accurate analyses. Use the astype() method to
convert columns to the desired types:
df['age'] = df['age'].astype(int)
df['date'] = pd.to_datetime(df['date'])

This method can prevent errors during operations and ensure data is manipulated
correctly.

Summary of Techniques
Technique Description
Remove Duplicates Eliminates duplicate records from the dataset.
Rename Columns Enhances readability by giving descriptive names.
Transform Data Types Ensures consistent data types for accurate analysis.

Implementing these techniques is essential for maintaining data integrity and preparing
datasets for meaningful analysis.

Min-Max Scaling (Normalization)


Min-max scaling, also known as normalization, is a data preprocessing technique used
to transform features to lie within a specific range, typically [0, 1]. This is particularly
important in many machine learning algorithms which are sensitive to the scale of input
data, as it ensures that no single feature dominates others due to differing scales.
Mathematical Formula
The formula for min-max normalization is expressed as:
[ X' = \frac{(X - X_{min})}{(X_{max} - X_{min})} ]
Where:
• (X) is the original data point,
• (X_{min}) and (X_{max}) are the minimum and maximum values of the dataset,
respectively,
• (X') is the normalized value.

Implementation in Python
To implement min-max scaling using Python, you can utilize libraries such as Pandas or
NumPy. Below is an example using NumPy:
import numpy as np

data = np.array([10, 20, 30, 40, 50])


normalized_data = (data - np.min(data)) / (np.max(data) - np.min(data))
print(normalized_data) # Output: [0. 0.25 0.5 0.75 1. ]

In this implementation:
• We calculate the minimum and maximum in the array.
• We apply the min-max formula to each element, transforming them into a scale
of 0 to 1.

Usage in Data Processing


Min-max scaling is especially beneficial when dealing with algorithms like k-nearest
neighbors and neural networks, where feature scaling improves convergence speed
and model accuracy.

Standardization (Z-score Scaling)


Standardization, or Z-score scaling, is a technique used to normalize data by adjusting
the values so that they have a mean (average) of 0 and a standard deviation of 1. This
transformation is particularly important in statistical analysis and machine learning, as it
ensures that no feature dominates due to its scale.

Mathematical Formula
The Z-score is calculated using the following formula:
[ Z = \frac{(X - \mu)}{\sigma} ]
Where:
• (X) is the original data point,
• (\mu) is the mean of the dataset,
• (\sigma) is the standard deviation of the dataset,
• (Z) is the Z-score of the data point.

Implementation in Python
To implement standardization in Python, one can utilize libraries such as scikit-learn or
pandas. Here’s how to perform Z-score scaling using scikit-learn:
from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[1, 2], [3, 4], [5, 6]])


scaler = StandardScaler()

standardized_data = scaler.fit_transform(data)
print(standardized_data)

In this code snippet, StandardScaler() calculates the mean and standard deviation for
each feature, applying the Z-score scaling to the dataset.

When to Use Z-score Scaling


Z-score scaling is particularly beneficial in scenarios where:
• Feature distribution: Features have varying distributions, making it necessary to
standardize them for more accurate results.
• Machine Learning: Algorithms such as Support Vector Machines (SVM) and K-
means clustering rely on distances between data points, making scale-sensitive
transformations essential.
• Outlier Value Reduction: Standardization mitigates the influence of outliers,
providing a more robust analysis.
By utilizing Z-score scaling, you enhance the effectiveness of your data analysis and
machine learning models.

Label Encoding (Categorical Data)


Label encoding is a technique employed to convert categorical variables into numerical
values. This transformation is vital in machine learning, as many algorithms work best
with numeric data. The method replaces each category with a unique integer, enabling
models to interpret the data efficiently.

How to Perform Label Encoding


To demonstrate label encoding, consider the following categorical variable representing
colors:
import pandas as pd

data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']}


df = pd.DataFrame(data)

# Label encoding
df['Color_Encoded'] = df['Color'].astype('category').cat.codes
print(df)

In this example:
• The astype('category') method converts the string categories into a categorical
type.
• The .cat.codes attribute retrieves the corresponding integer codes, resulting in a
DataFrame where Red, Blue, and Green are replaced by unique integers (e.g., 0,
1, 2).

Use Cases for Label Encoding


Label encoding is particularly advantageous in several scenarios:
• Tree-based Models: Decision trees and their derivatives can inherently manage
categorical data, making label encoding beneficial.
• Ordinal Relationships: When categories have a defined order (e.g., "Low",
"Medium", "High"), label encoding can effectively represent the relationships
numerically.
However, one must be cautious about using label encoding for nominal categorical
variables, as it may mislead algorithms into interpreting unintended relationships. In
such cases, alternative techniques like one-hot encoding may be more appropriate.

Mapping Categorical Data (Education Levels)


Categorical data often requires conversion into numerical format for analysis, especially
in machine learning algorithms. One effective method of achieving this is label
encoding, where each unique category is assigned a unique integer. This approach is
particularly useful for ordinal data or when the relationships among categories are
significant.

Example: Encoding Education Levels


Consider a dataset that includes education levels such as "High School", "Bachelor's",
"Master's", and "PhD". Here's how we can apply label encoding using Pandas:
import pandas as pd

data = {'Education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor']}


df = pd.DataFrame(data)

# Label encoding
df['Education_Encoded'] = df['Education'].astype('category').cat.codes
print(df)

Output
The DataFrame df will now look as follows:

Education Education_Encoded
High 0
School
Bachelor 1
Master 2
PhD 3
Bachelor 1

Important Considerations
1. Ordinal vs. Nominal: If the categories have a natural order (e.g., High School <
Bachelor's < Master's < PhD), label encoding is appropriate. However, if the
categories are nominal (e.g., colors), one-hot encoding—creating binary columns
for each category—should be preferred to avoid introducing unintended ordinal
relationships.
2. Model Compatibility: Many machine learning algorithms can directly interpret
label-encoded values, so understanding your model's requirements is essential
when preprocessing your data.
By effectively mapping categorical data, you enhance the model's ability to recognize
patterns and relationships, leading to more accurate predictions in data analysis.
Lab 04
Data Preprocessing & Visualization using
Seaborn
Data preprocessing is a critical step before visualizing data with Seaborn, a powerful
library that enables attractive and informative statistical graphics in Python. The
following outlines the essential steps for cleaning your data and visualizing it effectively.

Data Cleaning Steps


1. Handling Missing Values: Missing data can skew analysis results. Use fillna() to
substitute missing entries with appropriate values (mean, median, or a
placeholder).
import pandas as pd

df = pd.read_csv('data.csv')
df.fillna(df.mean(), inplace=True)

2. Removing Duplicates: Duplicate entries can distort results. Utilize


drop_duplicates() to ensure the dataset contains only unique records.
df.drop_duplicates(inplace=True)

3. Data Type Conversion: Ensure that each column has the correct data type for
accurate analysis and visualization.
df['column_name'] = df['column_name'].astype('category')

Visualization Techniques with Seaborn


After preprocessing, utilize Seaborn to visualize data. Here are a few commonly used
plots:
• Scatter Plot: Great for observing relationships between two numerical variables.
import seaborn as sns
sns.scatterplot(x='feature1', y='feature2', data=df)

• Bar Plot: Ideal for comparing categorical data across different groups.
sns.barplot(x='category', y='values', data=df)

• Box Plot: Useful for visualizing the distribution of a dataset and identifying
outliers.
sns.boxplot(x='category', y='values', data=df)

By properly cleaning your data and leveraging Seaborn's capabilities, you can derive
meaningful insights and enhance data-driven decision-making in your analyses.
Lab 05
Simple Linear Regression (House Size vs Price)
Simple linear regression is a statistical method used to model the relationship between
a single independent variable and a dependent variable. In this section, we will focus on
predicting house prices based on their sizes, measured in square feet.

Understanding the Concept


The fundamental goal is to find a linear relationship represented as:
[ \text{Price} = \beta_0 + \beta_1 \times \text{Size} + \epsilon ]
• (\beta_0): Y-intercept of the regression line.
• (\beta_1): Slope of the line; this indicates how much the price changes with each
square foot increase in size.
• (\epsilon): Error term accounting for variability not explained by the linear
relationship.

Implementing Simple Linear Regression in Python


To implement simple linear regression in Python, we will use the scikit-learn library.
Below is an example demonstrating the model training and evaluation:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Sample data
data = {'Size': [1500, 2000, 2500, 3000, 3500],
'Price': [300000, 400000, 500000, 600000, 700000]}
df = pd.DataFrame(data)

# Splitting the dataset


X = df[['Size']]
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Training the model


model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Model evaluation
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
Key Steps Explained
1. Data Preparation: A dataframe is created with house sizes and their respective
prices.
2. Train-Test Split: The dataset is divided into training and testing subsets.
3. Model Training: A linear regression model is trained using the training data.
4. Prediction and Evaluation: Predictions are made on the test data, and the
mean squared error is calculated to assess model performance.
By following this process, you create a model that effectively predicts house prices
based on their sizes.

Multiple Linear Regression


Multiple linear regression is a statistical technique used to predict the dependent
variable, such as price, based on multiple independent variables. In this case, we can
model house prices using several features: size, the number of bedrooms, and the age
of the house.

Regression Model Formulation


The multiple linear regression model can be mathematically represented as:
[ \text{Price} = \beta_0 + \beta_1 \times \text{Size} + \beta_2 \times \text{Bedrooms} + \
beta_3 \times \text{Age} + \epsilon ]
• (\beta_0): Intercept of the regression equation.
• (\beta_1), (\beta_2), and (\beta_3): Coefficients that represent the change in
price for a unit change in corresponding features.
• (\epsilon): Error term that accounts for the variability not explained by the model.

Example Code Implementation


To implement multiple linear regression in Python, let's consider a dataset containing
house features.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Sample dataset
data = {
'Size': [1500, 2000, 2500, 3000, 3500],
'Bedrooms': [3, 4, 3, 5, 4],
'Age': [10, 15, 5, 20, 12],
'Price': [300000, 400000, 500000, 600000, 700000]
}

df = pd.DataFrame(data)
# Preparing data
X = df[['Size', 'Bedrooms', 'Age']]
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Training the multiple linear regression model


model = LinearRegression()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model


mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

Explanation of the Code


1. Data Preparation: A dataframe is created with the relevant features and target
prices.
2. Train-Test Split: The dataset is divided into training and testing subsets.
3. Model Training: A multiple linear regression model is trained on the training
data.
4. Prediction and Evaluation: The model predicts prices based on the test data,
and the mean squared error is computed to evaluate its performance.
This example showcases a foundational understanding of how multiple linear regression
can be utilized to make predictions based on multiple features.
Lab 06
Logistic Regression for Student Pass Prediction
Logistic regression is a statistical method used for predicting binary outcomes,
particularly where the result can be categorized into two distinct classes. This makes it a
powerful tool for educational analysis, where we often seek to predict outcomes, such
as whether a student will pass or fail based on various input variables.

Concept Overview
In logistic regression, the relationship between independent variables (such as study
hours, attendance, and prior grades) and the binary dependent variable (pass or fail) is
modeled using the logistic function. The general equation can be expressed as:
[ P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n)}} ]
Where:
• (P(Y = 1)) is the probability of the student passing,
• (X_i) are the independent variables,
• (\beta_i) are the coefficients determined during the model training.

Application Example
Consider a dataset of students where we include features like study_hours,
attendance_percentage, and previous_grade. After applying logistic regression, we can
estimate the probability of each student passing.
For example:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

data = {
'Study_Hours': [5, 10, 15, 20],
'Attendance': [70, 85, 90, 95],
'Previous_Grade': [60, 60, 70, 80],
'Pass': [0, 0, 1, 1]
}

df = pd.DataFrame(data)
X = df[['Study_Hours', 'Attendance', 'Previous_Grade']]
y = df['Pass']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)[:, 1]

Outcome Interpretation
The output predictions indicate whether each student is predicted to pass or fail, while
the probability scores give insight into the confidence of these predictions. Logistic
regression provides a clear framework for decision-making in educational contexts,
helping educators identify at-risk students and tailor interventions effectively.

Titanic Survival Prediction using Logistic


Regression
Logistic regression is a widely used statistical method for binary classification problems,
such as predicting survival outcomes. In the case of the Titanic survival prediction
model, we utilize passenger data to determine the likelihood that an individual survived
the disaster based on various features.

Dataset Overview
The dataset contains several important features, such as:
• Passenger Class (Pclass): The socio-economic status of the passenger (1st,
2nd, or 3rd class).
• Sex: Gender of the passenger.
• Age: Passenger’s age.
• SibSp: Number of siblings or spouses aboard.
• Parch: Number of parents or children aboard.
• Fare: Ticket price.

Feature Selection
Feature selection is critical to improving model accuracy. The key features often
include:
• Sex and Age: Past patterns show that females and younger passengers had
higher survival rates.
• Pclass: First-class passengers had more survival chances.
• SibSp and Parch: Family relationships aboard the ship may influence survival.

Model Building
Using libraries such as pandas for data manipulation and scikit-learn for building the
logistic regression model:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load dataset
data = pd.read_csv('titanic.csv')

# Data preprocessing
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})
X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']].fillna(0)
y = data['Survived']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Logistic Regression Model


model = LogisticRegression()
model.fit(X_train, y_train)

Model Evaluation
To evaluate the model's effectiveness, we can calculate metrics like accuracy,
precision, and recall:
predictions = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print(f'Accuracy: {accuracy}')

Due to the inherent imbalance within survival rates, it is important to utilize additional
evaluation methods, such as confusion matrices or ROC-AUC scores, to ensure the
model is reliable and robust in predicting outcomes in varied scenarios.
Lab 07
Decision Tree Classifier for 'Bought Product'
Prediction
Decision trees are a popular machine learning model used for classification and
regression tasks. They function by splitting the data into subsets based on the value of
input features, creating a tree-like model of decisions. Each internal node of the tree
represents a feature, the branches represent decision rules, and the leaf nodes
represent the outcome or target variable.

Example: Predicting Product Purchases


Let's step through an example where we predict whether a product was bought based
on features such as age, gender, and income.
1. Dataset Preparation: A representative dataset would include records of
customer information along with their purchasing behavior.
import pandas as pd

data = {
'Age': [23, 45, 31, 35, 60, 25],
'Gender': ['M', 'F', 'F', 'M', 'F', 'M'],
'Income': [50000, 60000, 80000, 45000, 70000, 30000],
'Bought': [1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)
df['Gender'] = df['Gender'].map({'M': 0, 'F': 1}) # Encoding
categorical data

2. Model Training: The decision tree classifier can be trained using the features
Age, Gender, and Income to predict the Bought outcome.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X = df[['Age', 'Gender', 'Income']]


y = df['Bought']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

3. Prediction and Interpretation: Upon completion of training, predictions can be


made on the test set.
predictions = model.predict(X_test)
To understand how the model arrived at particular predictions, one can visualize the
decision tree, revealing how splits based on certain input features influenced the
classification outcome. Each path from the root to the leaf node illustrates a series of
decisions leading to a predicted purchase outcome. This interpretability makes decision
trees particularly valuable in fields where understanding the decision-making process is
crucial.

Decision Tree Classifier for 'Go/No-Go' Decision


Based on Nationality and Rank
Decision trees are a vital tool for making binary classifications based on input features,
particularly in scenarios requiring clear decision paths. In this section, we will explore an
example of using a decision tree classifier to predict 'Go/No-Go' decisions based on
attributes like nationality and rank.

Concept Overview
A decision tree works by iteratively splitting the dataset into subsets based on feature
values. Each node represents a decision point, where the data is divided according to
the best attribute to separate classes effectively. This structure allows for both
visualization and interpretability of the decision-making process.

Example Implementation
Here's how you can implement a decision tree classifier using Python:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

# Sample data
data = {
'Nationality': ['US', 'US', 'FR', 'FR', 'US', 'CN', 'CN', 'FR'],
'Rank': [1, 2, 2, 1, 3, 1, 3, 1],
'Decision': ['Go', 'No-Go', 'Go', 'Go', 'No-Go', 'Go', 'No-Go', 'Go']
}

df = pd.DataFrame(data)

# Encoding categorical variables


df['Nationality'] = df['Nationality'].map({'US': 0, 'FR': 1, 'CN': 2})

# Splitting the dataset


X = df[['Nationality', 'Rank']]
y = df['Decision']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=42)

# Training the Decision Tree model


model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Making predictions
predictions = model.predict(X_test)

# Visualizing the Decision Tree


plt.figure(figsize=(10, 6))
plot_tree(model, feature_names=['Nationality', 'Rank'], class_names=['Go',
'No-Go'], filled=True)
plt.title("Decision Tree Visualization")
plt.show()

Interpretation
In this example:
• We prepare a simple dataset where each entry reflects the nationality, rank, and
the corresponding decision ('Go' or 'No-Go').
• The nationality is encoded into numerical values for model training.
• The decision tree is trained on this data, resulting in a visual representation that
outlines the decision logic.
This method provides clear insights into how different ranks and nationalities influence
the 'Go/No-Go' decision's outcome. By analyzing the decision paths highlighted in the
tree, stakeholders can better understand which features are most influential, aiding in
strategic decision-making.

You might also like