Automation With Python Using Excel
Automation With Python Using Excel
1. Introduction
2. Background
When automating with Python, the main library used is openpyxl. This
library can handle reading and writing Excel files.
How does it work? At a high level, when you're working with Excel via
openpyxl, you're actually interacting with objects in memory. For
instance, a "Workbook" object represents an Excel file, while a
"Worksheet" object represents an individual sheet.
3. Setting Up
Scenario: You get a monthly Excel sheet with sales data. You want to
calculate the total sales and average sales for the month, then add
this info to the sheet.
Manual steps:
Python Automation:
import openpyxl
# Step 4: Write the total and average at the end of the column
sheet.cell(row=last_row + 1, column=1, value="Total Sales:")
sheet.cell(row=last_row + 1, column=2, value=total_sales)
sheet.cell(row=last_row + 2, column=1, value="Average Sales:")
sheet.cell(row=last_row + 2, column=2, value=avg_sales)
# Save changes
wb.save('monthly_sales_summary.xlsx')
import openpyxl
wb.save('sales_data_by_category.xlsx')
2. Conditional Formatting
wb.save('highlighted_sales_data.xlsx')
import pandas as pd
Scenario: You have data on products sold, their categories, and the
sales figures. You want to summarize sales by category.
import pandas as pd
Scenario: You have multiple monthly sales Excel files and you want to
merge them into a yearly file.
import pandas as pd
import glob
Scenario: You have monthly sales figures, and you want to generate a
line chart for visual representation.
import openpyxl
from openpyxl.chart import LineChart, Reference
wb = openpyxl.load_workbook('monthly_sales.xlsx')
sheet = wb.active
wb.save("sales_chart.xlsx")
import openpyxl
wb = openpyxl.load_workbook('sales_data.xlsx')
sheet = wb.active
wb.save('filtered_sales_data.xlsx')
8. Data Validation
Scenario: You're preparing a template for sales input and you want to
ensure that only valid data is entered (e.g., sales figures between 1
and 10,000).
import openpyxl
from openpyxl.worksheet.datavalidation import DataValidation
wb = openpyxl.Workbook()
wb.save('sales_template.xlsx')
import openpyxl
from openpyxl.styles import PatternFill
wb = openpyxl.load_workbook('sales_data.xlsx')
sheet = wb.active
wb.save('color_coded_sales.xlsx')
wb = openpyxl.load_workbook('addresses.xlsx')
sheet = wb.active
API_ENDPOINT = "https://geocode.search.hereapi.com/v1/geocode"
API_KEY = "YOUR_API_KEY" # Replace with your actual API key
wb.save('addresses_with_lat_lon.xlsx')
import openpyxl
import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing
wb.save('sales_forecast.xlsx')
Scenario: For each column of data in an Excel file, compute and save
descriptive statistics (mean, median, standard deviation).
import openpyxl
import pandas as pd
df = pd.read_excel('data.xlsx')
desc_stats = df.describe()
# Save to Excel
with pd.ExcelWriter('data_summary.xlsx') as writer:
df.to_excel(writer, sheet_name='Original Data')
desc_stats.to_excel(writer, sheet_name='Descriptive Statistics')
import openpyxl
df = pd.read_excel('data.xlsx')
Using sklearn, you can automate PCA and save the reduced data to Excel.
import openpyxl
import pandas as pd
from sklearn.decomposition import PCA
df = pd.read_excel('high_dim_data.xlsx')
pca = PCA(n_components=2) # Reduce to 2 dimensions for simplicity
principal_components = pca.fit_transform(df)
df_pca = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
df_pca.to_excel('reduced_data.xlsx', index=False)
import openpyxl
import pandas as pd
from sklearn.cluster import KMeans
df = pd.read_excel('data_for_clustering.xlsx')
kmeans = KMeans(n_clusters=3) # Assuming 3 clusters for this example
df['Cluster'] = kmeans.fit_predict(df)
df.to_excel('clustered_data.xlsx', index=False)
import openpyxl
import pandas as pd
df = pd.read_excel('data.xlsx')
df['Z-Score'] = (df['Column_Name'] - df['Column_Name'].mean()) /
df['Column_Name'].std()
df['Is_Outlier'] = df['Z-Score'].abs() > 3 # Outliers are typically defined
as values more than 3 standard deviations from the mean
df.to_excel('data_with_outliers.xlsx', index=False)
import openpyxl
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
df = pd.read_excel('data_for_regression.xlsx')
poly = PolynomialFeatures(degree=2)
import openpyxl
import pandas as pd
from sklearn.impute import SimpleImputer
df = pd.read_excel('data_with_missing_values.xlsx')
df_imputed.to_excel('data_without_missing_values.xlsx', index=False)
import openpyxl
import pandas as pd
import re
df = pd.read_excel('text_data.xlsx')
df['Cleaned_Text'] = df['Text_Column'].apply(clean_text)
df.to_excel('cleaned_text_data.xlsx', index=False)
import openpyxl
import pandas as pd
df = pd.read_excel('data_with_categories.xlsx')
df_encoded.to_excel('encoded_data.xlsx', index=False)
import openpyxl
import pandas as pd
df = pd.read_excel('data.xlsx')
ax = df.hist(bins=50)
# Save the plots as images and then insert them into Excel
fig = ax[0][0].get_figure()
fig.savefig('histograms.png')
wb = openpyxl.load_workbook('data.xlsx')
sheet = wb.active
img = openpyxl.drawing.image.Image('histograms.png')
sheet.add_image(img, 'D5') # Place the image at cell D5
import openpyxl
import pandas as pd
df = pd.read_excel('data.xlsx')
correlation_matrix = df.corr()
correlation_matrix.to_excel('correlation_matrix.xlsx', index=True)
Scenario: Split data into training and test sets for model validation.
import openpyxl
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_excel('data_for_modeling.xlsx')
Using Python with Excel for data science tasks provides a bridge
between traditional spreadsheet-driven analysis and more advanced,
automated analysis. For analysts familiar with Excel but new to
programming, this combination can serve as an excellent transition to
the world of data science and machine learning.