Python Basics for Data Science and Analysis
Python Basics for Data Science and Analysis
Section: Red
Lab 02
Python Basics for Data Science and
Analysis
Variable Declaration and Printing
Variable declaration in Python is a straightforward process that allows you to store data
for future use. A variable is created by assigning a value to a name using the
assignment operator =. Here’s how you can declare different types of variables:
# Integer variable
age = 25
# Float variable
height = 5.9
# String variable
name = "Alice"
# Boolean variable
is_student = True
Printing Variables
To display the value of a variable, the print() function is used. This function outputs the
variable's value to the console, facilitating debugging and verification of data.
print(age) # Output: 25
print(height) # Output: 5.9
print(name) # Output: Alice
print(is_student) # Output: True
Naming Conventions
When declaring variables, follow these common practices for naming:
• Use descriptive names (e.g., total_price instead of x).
• Use lowercase letters and underscores to separate words.
• Avoid starting variable names with numbers and using reserved keywords.
Understanding variable declaration and the printing of variable values is essential for
effective programming, especially in debugging scenarios where verifying the data
values can save time and enhance code functionality.
Checking Data Types
In Python, checking the data type of a variable is crucial for understanding the kind of
data you are dealing with, which can influence how you manipulate and perform
operations on that data. The built-in type() function allows you to determine the data
type of any variable. Below are common data types you might encounter, along with
examples:
# Float
num_float = 10.5
print(type(num_float)) # Output: <class 'float'>
# String
greeting = "Hello, World!"
print(type(greeting)) # Output: <class 'str'>
# List
fruits = ["apple", "banana", "cherry"]
print(type(fruits)) # Output: <class 'list'>
# Dictionary
person = {"name": "Alice", "age": 30}
print(type(person)) # Output: <class 'dict'>
In this example, x, y, and z are assigned the values 10, 20, and 30, respectively. This
eliminates the need for indexing and can save time when working with multiple
elements.
Use Cases
List unpacking is especially beneficial when handling data structures like tuples returned
from functions or when processing results from mathematical computations, allowing for
clearer, more concise code. It promotes efficient coding practices essential for scalable
data analysis in Python.
Arithmetic Operations
Arithmetic operations are fundamental in Python and encompass the basic
mathematical computations: addition, subtraction, multiplication, and division. Each
operation is supported by an operator, enabling various numeric types, including
integers and floats.
Basic Operations
1. Addition (+)
2. Subtraction (-)
3. Multiplication (*)
product_int = 3 * 4 # Output: 12 (integer)
product_float = 7.0 * 3.5 # Output: 24.5 (float)
4. Division (/)
Operator Precedence
Python follows specific rules for operator precedence, determining the order in which
operations are performed. The hierarchy is as follows:
• Parentheses ()
• Exponentiation **
• Multiplication *, Division /, Floor Division //, and Modulo %
• Addition + and Subtraction -
For example:
result = 5 + 2 * 3 # Result: 11 (2 * 3 is calculated first)
Assignment Operators
Assignment operators in Python are used to assign values to variables. The simplest
form of an assignment operator is the = operator, which assigns the right-hand value to
the left-hand variable. In addition to standard assignment, Python also provides a set of
compound assignment operators that combine arithmetic operations with assignment
in a more concise way.
Standard Assignment
x = 10 # Assigns the value 10 to x
This operator adds the right operand to the left operand and assigns the sum to
the left operand.
x = 5
x += 3 # Equivalent to x = x + 3; Now x = 8
This operator multiplies the left operand by the right operand and assigns the
product to the left operand.
x = 3
x *= 2 # Equivalent to x = x * 2; Now x = 6
This operator divides the left operand by the right operand and assigns the
quotient to the left operand.
x = 12
x /= 3 # Equivalent to x = x / 3; Now x = 4.0
Summary of Behavior
Using compound assignment operators not only simplifies your code but also enhances
readability by reducing redundancy. It is crucial for beginners to understand these
operators as they form the basis for manipulating variables effectively in Python
programs.
Logical Operators
Logical operators in Python allow developers to combine multiple conditions within
conditional statements. The three primary logical operators are and, or, and not.
Understanding these operators is essential for creating complex decision-making
mechanisms in your code.
Basic Usage
To collect a simple string input, you can use the following example:
user_name = input("Enter your name: ")
print(f"Hello, {user_name}!")
Input Types
While inputs are received as strings, you might need to convert them into other data
types, such as integers or floats. Here's how to handle different types of inputs:
1. String Input
3. Float Input For decimal values, you can use the float() function:
By mastering user input in Python, you open avenues for creating dynamic and user-
centered applications that respond effectively to real-time data.
String Operations
String operations in Python are fundamental for manipulating textual data. Python
provides various methods for working with strings, including concatenation, slicing, and
built-in string methods like upper() and lower(). Mastering these techniques is crucial for
effective data manipulation.
Concatenation
String concatenation is the process of combining two or more strings into one. This can
be achieved using the + operator:
first_name = "Alice"
last_name = "Johnson"
full_name = first_name + " " + last_name # Output: "Alice Johnson"
Slicing
String slicing allows you to extract a portion of a string using indices. For example:
message = "Hello, World!"
greeting = message[0:5] # Output: "Hello"
String Methods
Python strings come equipped with many built-in methods for string manipulation:
• upper(): Converts all characters to uppercase.
These operations are pivotal when cleaning and preparing text data for analysis,
contributing to more effective data processing tasks.
String Slicing
String slicing in Python is a powerful feature that enables you to extract specific portions
of strings using indices. This flexibility allows for effective manipulation of textual data,
which is vital in data analysis and programming.
In this example, characters from index 0 up to, but not including, index 4 are extracted.
Negative Indexing
Python also supports negative indexing, allowing you to reference positions from the
end of the string. For example:
text = "Programming"
last_three = text[-3:] # Output: "ing"
This code retrieves the last three characters of the string, demonstrating how negative
indices work.
Modifying Strings with Slices
While strings in Python are immutable, you can create modified versions by replacing
specific slices. Here’s how you can replace parts of a string:
text = "Hello World"
modified = text[:6] + "Python" # Output: "Hello Python"
In this instance, the substring from the beginning up to index 6 is concatenated with a
new string.
This example picks every second character from the string, showcasing the versatility of
slicing in data manipulation.
Conditional Statements
Conditional statements in Python allow you to execute different blocks of code based on
certain conditions. The primary conditional statements are if, elif (else if), and else.
These statements serve as the backbone of decision-making in programming.
Basic Structure
Here’s the syntax for a basic if statement:
if condition:
# Execute this block
Example:
age = 18
Example:
number = 0
if number > 0:
print("Positive number")
elif number < 0:
print("Negative number")
else:
print("Zero")
Example:
age = 20
is_student = True
Summary
Conditional statements are vital for controlling the flow of your program based on
dynamic inputs and conditions, enabling the execution of different code paths according
to varying scenarios. Understanding how to use these effectively enhances your
programming capabilities and facilitates the development of more sophisticated
applications.
Lab 03
NumPy Array Creation
NumPy is a powerful library in Python that facilitates numerical computing, providing
support for large, multi-dimensional arrays and matrices. It allows for efficient data
storage and computation, making it a crucial tool for data analysis in data science.
import numpy as np
my_list = [1, 2, 3, 4]
array_from_list = np.array(my_list)
2. Using arange() The arange() function generates arrays with evenly spaced
values within a given range.
array_range = np.arange(0, 10, 2) # Output: array([0, 2, 4, 6, 8])
3. Using linspace() The linspace() function returns evenly spaced numbers over a
specified range.
array_linspace = np.linspace(0, 1, 5) # Output: array([0. , 0.25, 0.5,
0.75, 1. ])
When accessing multi-dimensional arrays, indexing can specify the row and column:
matrix = np.array([[1, 2, 3], [4, 5, 6]])
element = matrix[1, 2] # Output: 6 (second row, third column)
Using dropna()
The dropna() method removes rows or columns containing missing values. By default, it
drops any rows with at least one missing value:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', None, 'David'],
'Age': [25, None, 30, 22]
}
df = pd.DataFrame(data)
cleaned_df = df.dropna()
print(cleaned_df)
Implications: While effective, using dropna() can lead to loss of valuable data,
especially in smaller datasets. Therefore, it should be employed with caution.
Using fillna()
On the other hand, the fillna() method is used to replace missing values with a specified
constant, mean, or other statistical measures. For example:
filled_df = df.fillna({"Name": "Unknown", "Age": df['Age'].mean()})
print(filled_df)
Implications: This method preserves data integrity by substituting missing values, but
it's essential to choose replacement values that maintain the dataset's
representativeness.
Summary
Choosing between these methods depends on the context of the analysis. dropna() is
suitable for smaller datasets where the loss of data is negligible, while fillna() is
advantageous for maintaining data volume and leveraging existing patterns within the
dataset. Proper handling of missing data is fundamental to producing accurate and
reliable analysis results.
This function allows you to specify columns to check for duplicates and remove them
effectively.
Renaming Columns
Clear and consistent naming of columns enhances clarity. The rename() function lets
you rename columns easily:
df.rename(columns={'old_name': 'new_name'}, inplace=True)
This practice ensures that all team members and stakeholders understand the dataset
structure.
This method can prevent errors during operations and ensure data is manipulated
correctly.
Summary of Techniques
Technique Description
Remove Duplicates Eliminates duplicate records from the dataset.
Rename Columns Enhances readability by giving descriptive names.
Transform Data Types Ensures consistent data types for accurate analysis.
Implementing these techniques is essential for maintaining data integrity and preparing
datasets for meaningful analysis.
Implementation in Python
To implement min-max scaling using Python, you can utilize libraries such as Pandas or
NumPy. Below is an example using NumPy:
import numpy as np
In this implementation:
• We calculate the minimum and maximum in the array.
• We apply the min-max formula to each element, transforming them into a scale
of 0 to 1.
Mathematical Formula
The Z-score is calculated using the following formula:
[ Z = \frac{(X - \mu)}{\sigma} ]
Where:
• (X) is the original data point,
• (\mu) is the mean of the dataset,
• (\sigma) is the standard deviation of the dataset,
• (Z) is the Z-score of the data point.
Implementation in Python
To implement standardization in Python, one can utilize libraries such as scikit-learn or
pandas. Here’s how to perform Z-score scaling using scikit-learn:
from sklearn.preprocessing import StandardScaler
import numpy as np
standardized_data = scaler.fit_transform(data)
print(standardized_data)
In this code snippet, StandardScaler() calculates the mean and standard deviation for
each feature, applying the Z-score scaling to the dataset.
# Label encoding
df['Color_Encoded'] = df['Color'].astype('category').cat.codes
print(df)
In this example:
• The astype('category') method converts the string categories into a categorical
type.
• The .cat.codes attribute retrieves the corresponding integer codes, resulting in a
DataFrame where Red, Blue, and Green are replaced by unique integers (e.g., 0,
1, 2).
# Label encoding
df['Education_Encoded'] = df['Education'].astype('category').cat.codes
print(df)
Output
The DataFrame df will now look as follows:
Education Education_Encoded
High 0
School
Bachelor 1
Master 2
PhD 3
Bachelor 1
Important Considerations
1. Ordinal vs. Nominal: If the categories have a natural order (e.g., High School <
Bachelor's < Master's < PhD), label encoding is appropriate. However, if the
categories are nominal (e.g., colors), one-hot encoding—creating binary columns
for each category—should be preferred to avoid introducing unintended ordinal
relationships.
2. Model Compatibility: Many machine learning algorithms can directly interpret
label-encoded values, so understanding your model's requirements is essential
when preprocessing your data.
By effectively mapping categorical data, you enhance the model's ability to recognize
patterns and relationships, leading to more accurate predictions in data analysis.
Lab 04
Data Preprocessing & Visualization using
Seaborn
Data preprocessing is a critical step before visualizing data with Seaborn, a powerful
library that enables attractive and informative statistical graphics in Python. The
following outlines the essential steps for cleaning your data and visualizing it effectively.
df = pd.read_csv('data.csv')
df.fillna(df.mean(), inplace=True)
3. Data Type Conversion: Ensure that each column has the correct data type for
accurate analysis and visualization.
df['column_name'] = df['column_name'].astype('category')
• Bar Plot: Ideal for comparing categorical data across different groups.
sns.barplot(x='category', y='values', data=df)
• Box Plot: Useful for visualizing the distribution of a dataset and identifying
outliers.
sns.boxplot(x='category', y='values', data=df)
By properly cleaning your data and leveraging Seaborn's capabilities, you can derive
meaningful insights and enhance data-driven decision-making in your analyses.
Lab 05
Simple Linear Regression (House Size vs Price)
Simple linear regression is a statistical method used to model the relationship between
a single independent variable and a dependent variable. In this section, we will focus on
predicting house prices based on their sizes, measured in square feet.
# Sample data
data = {'Size': [1500, 2000, 2500, 3000, 3500],
'Price': [300000, 400000, 500000, 600000, 700000]}
df = pd.DataFrame(data)
# Predictions
y_pred = model.predict(X_test)
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
Key Steps Explained
1. Data Preparation: A dataframe is created with house sizes and their respective
prices.
2. Train-Test Split: The dataset is divided into training and testing subsets.
3. Model Training: A linear regression model is trained using the training data.
4. Prediction and Evaluation: Predictions are made on the test data, and the
mean squared error is calculated to assess model performance.
By following this process, you create a model that effectively predicts house prices
based on their sizes.
# Sample dataset
data = {
'Size': [1500, 2000, 2500, 3000, 3500],
'Bedrooms': [3, 4, 3, 5, 4],
'Age': [10, 15, 5, 20, 12],
'Price': [300000, 400000, 500000, 600000, 700000]
}
df = pd.DataFrame(data)
# Preparing data
X = df[['Size', 'Bedrooms', 'Age']]
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Making predictions
y_pred = model.predict(X_test)
Concept Overview
In logistic regression, the relationship between independent variables (such as study
hours, attendance, and prior grades) and the binary dependent variable (pass or fail) is
modeled using the logistic function. The general equation can be expressed as:
[ P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n)}} ]
Where:
• (P(Y = 1)) is the probability of the student passing,
• (X_i) are the independent variables,
• (\beta_i) are the coefficients determined during the model training.
Application Example
Consider a dataset of students where we include features like study_hours,
attendance_percentage, and previous_grade. After applying logistic regression, we can
estimate the probability of each student passing.
For example:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
data = {
'Study_Hours': [5, 10, 15, 20],
'Attendance': [70, 85, 90, 95],
'Previous_Grade': [60, 60, 70, 80],
'Pass': [0, 0, 1, 1]
}
df = pd.DataFrame(data)
X = df[['Study_Hours', 'Attendance', 'Previous_Grade']]
y = df['Pass']
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)[:, 1]
Outcome Interpretation
The output predictions indicate whether each student is predicted to pass or fail, while
the probability scores give insight into the confidence of these predictions. Logistic
regression provides a clear framework for decision-making in educational contexts,
helping educators identify at-risk students and tailor interventions effectively.
Dataset Overview
The dataset contains several important features, such as:
• Passenger Class (Pclass): The socio-economic status of the passenger (1st,
2nd, or 3rd class).
• Sex: Gender of the passenger.
• Age: Passenger’s age.
• SibSp: Number of siblings or spouses aboard.
• Parch: Number of parents or children aboard.
• Fare: Ticket price.
Feature Selection
Feature selection is critical to improving model accuracy. The key features often
include:
• Sex and Age: Past patterns show that females and younger passengers had
higher survival rates.
• Pclass: First-class passengers had more survival chances.
• SibSp and Parch: Family relationships aboard the ship may influence survival.
Model Building
Using libraries such as pandas for data manipulation and scikit-learn for building the
logistic regression model:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load dataset
data = pd.read_csv('titanic.csv')
# Data preprocessing
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})
X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']].fillna(0)
y = data['Survived']
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
Model Evaluation
To evaluate the model's effectiveness, we can calculate metrics like accuracy,
precision, and recall:
predictions = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print(f'Accuracy: {accuracy}')
Due to the inherent imbalance within survival rates, it is important to utilize additional
evaluation methods, such as confusion matrices or ROC-AUC scores, to ensure the
model is reliable and robust in predicting outcomes in varied scenarios.
Lab 07
Decision Tree Classifier for 'Bought Product'
Prediction
Decision trees are a popular machine learning model used for classification and
regression tasks. They function by splitting the data into subsets based on the value of
input features, creating a tree-like model of decisions. Each internal node of the tree
represents a feature, the branches represent decision rules, and the leaf nodes
represent the outcome or target variable.
data = {
'Age': [23, 45, 31, 35, 60, 25],
'Gender': ['M', 'F', 'F', 'M', 'F', 'M'],
'Income': [50000, 60000, 80000, 45000, 70000, 30000],
'Bought': [1, 0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)
df['Gender'] = df['Gender'].map({'M': 0, 'F': 1}) # Encoding
categorical data
2. Model Training: The decision tree classifier can be trained using the features
Age, Gender, and Income to predict the Bought outcome.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
Concept Overview
A decision tree works by iteratively splitting the dataset into subsets based on feature
values. Each node represents a decision point, where the data is divided according to
the best attribute to separate classes effectively. This structure allows for both
visualization and interpretability of the decision-making process.
Example Implementation
Here's how you can implement a decision tree classifier using Python:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
# Sample data
data = {
'Nationality': ['US', 'US', 'FR', 'FR', 'US', 'CN', 'CN', 'FR'],
'Rank': [1, 2, 2, 1, 3, 1, 3, 1],
'Decision': ['Go', 'No-Go', 'Go', 'Go', 'No-Go', 'Go', 'No-Go', 'Go']
}
df = pd.DataFrame(data)
# Making predictions
predictions = model.predict(X_test)
Interpretation
In this example:
• We prepare a simple dataset where each entry reflects the nationality, rank, and
the corresponding decision ('Go' or 'No-Go').
• The nationality is encoded into numerical values for model training.
• The decision tree is trained on this data, resulting in a visual representation that
outlines the decision logic.
This method provides clear insights into how different ranks and nationalities influence
the 'Go/No-Go' decision's outcome. By analyzing the decision paths highlighted in the
tree, stakeholders can better understand which features are most influential, aiding in
strategic decision-making.