Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
88 views

Automation With Python Using Excel

Uploaded by

fs.login.1234
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

Automation With Python Using Excel

Uploaded by

fs.login.1234
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

#_ Automation With Python & Excel [ Use Cases ]

1. Introduction

Excel is a widely-used software for data representation and analysis.


Sometimes, repetitive tasks in Excel can be time-consuming. That's
where Python comes into play, allowing for automation and saving a
great deal of time.

2. Background

When automating with Python, the main library used is openpyxl. This
library can handle reading and writing Excel files.

How does it work? At a high level, when you're working with Excel via
openpyxl, you're actually interacting with objects in memory. For
instance, a "Workbook" object represents an Excel file, while a
"Worksheet" object represents an individual sheet.

3. Setting Up

1. First, you need to install the necessary libraries. Use pip:

pip install openpyxl

4. Thinking About Automation

Identify repetitive tasks: Automation starts by identifying a


repetitive task. Example: You may have to format new data the same way
every week.

Break tasks into steps: Understand the step-by-step process you'd


normally do manually.

By: Waleed Mousa


Translate to code: Once you've identified the manual steps, you'll
convert these into Python code.

5. Real-World Example: Summarizing Monthly Sales

Scenario: You get a monthly Excel sheet with sales data. You want to
calculate the total sales and average sales for the month, then add
this info to the sheet.

Manual steps:

1. Open the file.


2. Identify the range of sales data.
3. Calculate the total and average.
4. Write the total and average at the end of the column.

Python Automation:

import openpyxl

# Step 1: Open the file


wb = openpyxl.load_workbook('monthly_sales.xlsx')
sheet = wb.active

# Step 2: Identify the range of sales data


last_row = sheet.max_row
sales_data = [sheet.cell(row=i, column=2).value for i in range(2, last_row +
1)]

# Step 3: Calculate the total and average


total_sales = sum(sales_data)
avg_sales = total_sales / len(sales_data)

# Step 4: Write the total and average at the end of the column
sheet.cell(row=last_row + 1, column=1, value="Total Sales:")
sheet.cell(row=last_row + 1, column=2, value=total_sales)
sheet.cell(row=last_row + 2, column=1, value="Average Sales:")
sheet.cell(row=last_row + 2, column=2, value=avg_sales)

# Save changes
wb.save('monthly_sales_summary.xlsx')

By: Waleed Mousa


Advanced Python Automation Using Excel

1. Creating Multiple Worksheets Based on Categories

Scenario: Imagine you have a main worksheet with a list of customers,


their purchases, and the category of items they bought. You want to
create separate worksheets for each category and list the respective
customers there.

import openpyxl

# Load Workbook and active sheet


wb = openpyxl.load_workbook('sales_data.xlsx')
sheet = wb.active

# Create a dictionary to hold data by category


category_data = {}

# Assuming column 1: Customers, column 2: Purchase Amount, column 3: Category


for row in range(2, sheet.max_row + 1):
category = sheet.cell(row=row, column=3).value
if category not in category_data:
category_data[category] = []
category_data[category].append((sheet.cell(row=row, column=1).value,
sheet.cell(row=row, column=2).value))

# Create separate worksheets for each category


for category, data in category_data.items():
new_sheet = wb.create_sheet(title=category)
for idx, (customer, purchase) in enumerate(data, 1):
new_sheet.cell(row=idx, column=1, value=customer)
new_sheet.cell(row=idx, column=2, value=purchase)

wb.save('sales_data_by_category.xlsx')

2. Conditional Formatting

Scenario: You want to highlight sales greater than a certain value,


e.g., $5000.

from openpyxl.styles import PatternFill

By: Waleed Mousa


# Load Workbook and sheet
wb = openpyxl.load_workbook('sales_data.xlsx')
sheet = wb.active

# Highlight sales greater than 5000


highlight_fill = PatternFill(start_color="FFFF00", end_color="FFFF00",
fill_type="solid")
for row in range(2, sheet.max_row + 1):
if sheet.cell(row=row, column=2).value > 5000:
sheet.cell(row=row, column=2).fill = highlight_fill

wb.save('highlighted_sales_data.xlsx')

3. Integrating Pandas for Data Analysis

Scenario: Compute and append month-over-month growth for a series of


monthly sales data.

import pandas as pd

# Read data into a DataFrame


df = pd.read_excel('monthly_sales.xlsx')

# Calculate month-over-month growth


df['MoM Growth'] = df['Sales'].pct_change()

# Save the DataFrame back to Excel


df.to_excel('sales_with_growth.xlsx', index=False)

4. Pivot Tables and Data Summarization

Scenario: You have data on products sold, their categories, and the
sales figures. You want to summarize sales by category.

import pandas as pd

# Read data into a DataFrame


df = pd.read_excel('product_sales.xlsx')

By: Waleed Mousa


# Create a pivot table
pivot = df.pivot_table(index='Category', values='Sales', aggfunc='sum')

# Save the pivot table to a new worksheet


with pd.ExcelWriter('product_sales_summary.xlsx') as writer:
pivot.to_excel(writer, sheet_name="Summary")
df.to_excel(writer, sheet_name="Detailed Data")

5. Merging Multiple Excel Files

Scenario: You have multiple monthly sales Excel files and you want to
merge them into a yearly file.

import pandas as pd
import glob

# Gather all Excel files in the directory


all_files = glob.glob('sales_*.xlsx')

# Read and concatenate all files into a single DataFrame


all_data = pd.concat([pd.read_excel(file) for file in all_files])

# Save the concatenated data to a new file


all_data.to_excel('yearly_sales_data.xlsx', index=False)

6. Automating Charts and Graphs

Scenario: You have monthly sales figures, and you want to generate a
line chart for visual representation.

import openpyxl
from openpyxl.chart import LineChart, Reference

wb = openpyxl.load_workbook('monthly_sales.xlsx')
sheet = wb.active

# Create a new line chart object


chart = LineChart()
chart.title = "Monthly Sales"
chart.style = 13 # Use a pre-defined style
chart.x_axis.title = 'Month'

By: Waleed Mousa


chart.y_axis.title = 'Sales ($)'
chart.y_axis.majorGridlines = None

# Set data and categories for the chart


data = Reference(sheet, min_col=2, min_row=1, max_col=2,
max_row=sheet.max_row)
categories = Reference(sheet, min_col=1, min_row=2, max_row=sheet.max_row)
chart.add_data(data, titles_from_data=True)
chart.set_categories(categories)

# Add the chart to the sheet and position it


sheet.add_chart(chart, "D5")

wb.save("sales_chart.xlsx")

7. Handling Excel Filters

Scenario: You want to automatically apply filters to a range of data for


easier manual review.

import openpyxl

wb = openpyxl.load_workbook('sales_data.xlsx')
sheet = wb.active

# Apply filter to entire data range


sheet.auto_filter.ref = sheet.dimensions

wb.save('filtered_sales_data.xlsx')

8. Data Validation

Scenario: You're preparing a template for sales input and you want to
ensure that only valid data is entered (e.g., sales figures between 1
and 10,000).

import openpyxl
from openpyxl.worksheet.datavalidation import DataValidation

wb = openpyxl.Workbook()

By: Waleed Mousa


sheet = wb.active

# Create a data validation rule


validation = DataValidation(type="whole", operator="between", formula1=1,
formula2=10000)
validation.errorTitle = "Invalid entry"
validation.error = "Sales figure should be between 1 and 10,000."

# Apply the validation to a range


validation.add('B2:B1000')
sheet.add_data_validation(validation)

wb.save('sales_template.xlsx')

9. Conditional Styling Based on Cell Values

Scenario: You want to change the background color of cells based on


their values (e.g., sales over 10,000 get a green background).

import openpyxl
from openpyxl.styles import PatternFill

wb = openpyxl.load_workbook('sales_data.xlsx')
sheet = wb.active

green_fill = PatternFill(start_color="00FF00", end_color="00FF00",


fill_type="solid")

for row in range(2, sheet.max_row + 1):


if sheet.cell(row=row, column=2).value > 10000:
sheet.cell(row=row, column=2).fill = green_fill

wb.save('color_coded_sales.xlsx')

10. Integrating External APIs

Scenario: You have a list of addresses, and you want to retrieve


latitude and longitude using a geocoding service and store the values
in the Excel file.

By: Waleed Mousa


import openpyxl
import requests

wb = openpyxl.load_workbook('addresses.xlsx')
sheet = wb.active

API_ENDPOINT = "https://geocode.search.hereapi.com/v1/geocode"
API_KEY = "YOUR_API_KEY" # Replace with your actual API key

for row in range(2, sheet.max_row + 1):


address = sheet.cell(row=row, column=1).value
response = requests.get(API_ENDPOINT, params={"q": address, "apiKey":
API_KEY}).json()

# Assuming the API response is valid and contains lat/lon information


lat = response['items'][0]['position']['lat']
lon = response['items'][0]['position']['lng']

sheet.cell(row=row, column=2, value=lat)


sheet.cell(row=row, column=3, value=lon)

wb.save('addresses_with_lat_lon.xlsx')

Note: Ensure you handle possible exceptions and rate-limiting when


dealing with external APIs.

11. Time Series Forecasting

Scenario: Predicting future sales based on past data.

You can utilize libraries like statsmodels to automate the creation of


time series forecasts, and then save the forecasted results in Excel.

import openpyxl
import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Read sales data into a DataFrame


df = pd.read_excel('sales_data.xlsx', index_col='Date', parse_dates=True)

By: Waleed Mousa


# Train a time series model and forecast the next 12 months
model = ExponentialSmoothing(df['Sales'], trend='add', seasonal='add',
seasonal_periods=12)
fit = model.fit()
forecast = fit.forecast(12)

# Add forecast to Excel


wb = openpyxl.load_workbook('sales_data.xlsx')
sheet = wb.active
for month, value in enumerate(forecast, start=sheet.max_row + 1):
sheet.cell(row=month, column=1, value=value.index[month - sheet.max_row -
1])
sheet.cell(row=month, column=2, value=value)

wb.save('sales_forecast.xlsx')

12. Automating Descriptive Statistics

Scenario: For each column of data in an Excel file, compute and save
descriptive statistics (mean, median, standard deviation).

import openpyxl
import pandas as pd

df = pd.read_excel('data.xlsx')
desc_stats = df.describe()

# Save to Excel
with pd.ExcelWriter('data_summary.xlsx') as writer:
df.to_excel(writer, sheet_name='Original Data')
desc_stats.to_excel(writer, sheet_name='Descriptive Statistics')

13. Data Normalization and Standardization

Scenario: Normalize and standardize numerical columns for further


analysis.

import openpyxl

By: Waleed Mousa


import pandas as pd

df = pd.read_excel('data.xlsx')

# Normalize data (0-1 scaling)


df_normalized = (df - df.min()) / (df.max() - df.min())

# Standardize data (z-score scaling)


df_standardized = (df - df.mean()) / df.std()

# Save both to Excel


with pd.ExcelWriter('processed_data.xlsx') as writer:
df_normalized.to_excel(writer, sheet_name='Normalized Data')
df_standardized.to_excel(writer, sheet_name='Standardized Data')

14. Principal Component Analysis (PCA) for Dimension Reduction

Scenario: Reduce the dimensions of a dataset for visualization or


further analysis.

Using sklearn, you can automate PCA and save the reduced data to Excel.

import openpyxl
import pandas as pd
from sklearn.decomposition import PCA

df = pd.read_excel('high_dim_data.xlsx')
pca = PCA(n_components=2) # Reduce to 2 dimensions for simplicity
principal_components = pca.fit_transform(df)
df_pca = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

df_pca.to_excel('reduced_data.xlsx', index=False)

15. Clustering for Data Segmentation

Scenario: Group data points into clusters based on similarities.

By: Waleed Mousa


Use sklearn to automate K-means clustering and save cluster labels to
Excel.

import openpyxl
import pandas as pd
from sklearn.cluster import KMeans

df = pd.read_excel('data_for_clustering.xlsx')
kmeans = KMeans(n_clusters=3) # Assuming 3 clusters for this example
df['Cluster'] = kmeans.fit_predict(df)

df.to_excel('clustered_data.xlsx', index=False)

16. Automated Outlier Detection

Scenario: Detect outliers in a dataset based on the Z-score method.

import openpyxl
import pandas as pd

df = pd.read_excel('data.xlsx')
df['Z-Score'] = (df['Column_Name'] - df['Column_Name'].mean()) /
df['Column_Name'].std()
df['Is_Outlier'] = df['Z-Score'].abs() > 3 # Outliers are typically defined
as values more than 3 standard deviations from the mean

df.to_excel('data_with_outliers.xlsx', index=False)

17. Feature Engineering

Scenario: Generate polynomial features for regression analysis.

import openpyxl
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

df = pd.read_excel('data_for_regression.xlsx')

poly = PolynomialFeatures(degree=2)

By: Waleed Mousa


polynomial_features = poly.fit_transform(df)
feature_names = poly.get_feature_names(df.columns)

df_poly = pd.DataFrame(polynomial_features, columns=feature_names)


df_poly.to_excel('polynomial_features.xlsx', index=False)

18. Data Imputation

Scenario: Fill missing values in a dataset.

import openpyxl
import pandas as pd
from sklearn.impute import SimpleImputer

df = pd.read_excel('data_with_missing_values.xlsx')

# Use mean imputation for simplicity


imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

df_imputed.to_excel('data_without_missing_values.xlsx', index=False)

19. Text Data Preprocessing

Scenario: Clean and preprocess a column containing text data.

import openpyxl
import pandas as pd
import re

df = pd.read_excel('text_data.xlsx')

# A simple preprocessing function to clean text


def clean_text(text):
text = text.lower() # Convert to lowercase
text = re.sub(r'\s+', ' ', text) # Replace multiple spaces with a single
space
text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove non-alphabetic
characters

By: Waleed Mousa


return text.strip()

df['Cleaned_Text'] = df['Text_Column'].apply(clean_text)

df.to_excel('cleaned_text_data.xlsx', index=False)

20. Encoding Categorical Variables

Scenario: Convert categorical variables into numerical format.

import openpyxl
import pandas as pd

df = pd.read_excel('data_with_categories.xlsx')

# Convert categorical column to numerical using one-hot encoding


df_encoded = pd.get_dummies(df, columns=['Category_Column'], drop_first=True)

df_encoded.to_excel('encoded_data.xlsx', index=False)

21. Automating Data Visualization

Scenario: Generate histograms for numerical columns.

import openpyxl
import pandas as pd

df = pd.read_excel('data.xlsx')
ax = df.hist(bins=50)

# Save the plots as images and then insert them into Excel
fig = ax[0][0].get_figure()
fig.savefig('histograms.png')

wb = openpyxl.load_workbook('data.xlsx')
sheet = wb.active
img = openpyxl.drawing.image.Image('histograms.png')
sheet.add_image(img, 'D5') # Place the image at cell D5

By: Waleed Mousa


wb.save('data_with_histograms.xlsx')

22. Correlation Analysis

Scenario: Calculate correlations between variables and save the matrix


to Excel.

import openpyxl
import pandas as pd

df = pd.read_excel('data.xlsx')

correlation_matrix = df.corr()
correlation_matrix.to_excel('correlation_matrix.xlsx', index=True)

23. Automating Data Splitting

Scenario: Split data into training and test sets for model validation.

import openpyxl
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_excel('data_for_modeling.xlsx')

train, test = train_test_split(df, test_size=0.2)

with pd.ExcelWriter('split_data.xlsx') as writer:


train.to_excel(writer, sheet_name='Training Data', index=False)
test.to_excel(writer, sheet_name='Test Data', index=False)

Using Python with Excel for data science tasks provides a bridge
between traditional spreadsheet-driven analysis and more advanced,
automated analysis. For analysts familiar with Excel but new to
programming, this combination can serve as an excellent transition to
the world of data science and machine learning.

By: Waleed Mousa

You might also like