Excel With Python Performing Advanced Operations
Excel With Python Performing Advanced Operations
docx 1
Contents
Excel with Python: Performing Advanced Operations...................................................2
1. Course Overview..................................................................................................4
2. Working with Fonts and Styles............................................................................5
3. Working with Borders and Colors......................................................................10
4. Applying Number Formats.................................................................................14
5. Applying Conditional Formatting.......................................................................19
6. Using Advanced Conditional Formatting...........................................................22
7. Working with Images.........................................................................................26
8. Working with Formulae.....................................................................................29
9. More Operations Using Formulae.....................................................................33
10. Using Absolute and Relative Cell References.................................................36
11. Programmatically Constructing Absolute References....................................39
12. Using VLOOKUP..............................................................................................42
13. Working with Named Ranges.........................................................................46
14. Working with Pivot Tables..............................................................................50
15. Using Pandas for Pivoting...............................................................................53
16. Leveraging Multi-level Indexing in Pandas.....................................................57
17. Course Summary.............................................................................................60
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 2
and other aesthetics in Python. You will work with the Python library openpyxl;
examine data analysis, the use of pivot tables, and the locking of cell references by
using the $ operator; and learn how to perform complex data analysis operations
using pivot tables, sorting and filtering, and formulae with both absolute and relative
cell references to enable efficient copy paste. You will learn to control the workbook
appearance using conditional formatting and styles. Finally, this course demonstrates
how to leverage the Python Pandas library to read a spreadsheet, to group and
analyze data.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 3
1. Course Overview
Topic title: Course Overview. Your host for this session is Vitthal Srinivasan, a
Software engineer and big data expert.
Python and Microsoft Excel are both incredibly popular and powerful technologies
that have fundamentally altered the way in which we analyze, visualize, and model
data. Microsoft Excel is more than a spreadsheet technology. It is also probably the
best prototyping tool for data analysis bar none and can also be thought of as an
interactive functional programming environment. In other words, a forerunner of
Python. Microsoft Excel has an object model that can be leveraged to create and
modify workbooks programmatically.
The oldest such technology was VBA, or Visual Basic for Applications, which
integrates tightly with Excel. However, Python is fast emerging as an extremely
popular choice for spreadsheet automation owing to its ease of use, and the vibrant
ecosystem of libraries it supports. In this course, we move on to more complex
operations on Excel Workbooks including the use of conditional formatting, named
ranges, and merge cells. Data analysis and the use of pivot tables as well as the
locking of cell references using the dollar operator are covered as well.
By the end of this course, you will be able to perform complex data analysis
operations using pivoting and formally leveraging absolute and relative self
references. And control workbook appearance using conditional formatting and
styles. You will also be able to leverage Pandas, the popular Python library for data
analysis, to group your data and pivot it using one or more columns.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 4
Let's keep going with our exploration of formatting, alignment, and other aesthetics
in Python.
This will be a series of examples on styling of cells. The data that we will be using in
these examples is on screen now. It's fairly simple, made up dummy data. Let's
switch back to Python and launch right in.
The first order of business is to open this workbook using the load_workbook
command.
Because this is a pre-existing workbook, we've got to specify the file path. This is in
the datasets folder, and the name of the file is student_data.xlsx. So far, each time
we've loaded a workbook we have immediately obtained a handle to the active
worksheet. Now, let's try something a little different.
Let's first get a list of all of the sheet names in our workbook. Notice, how we make
use of the sheetnames property. This returns a list. That list has one element for
each sheet in the workbook. In this instance, we have just the one sheet.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 5
Let's go ahead and get a handle to that eponymous sheet by indexing into our
workbook object.
Notice how we use the name of the sheet along with square brackets, just as we
would to index into a dictionary given a key. We store a reference to the worksheet
in a variable and go ahead and perform various operations with that sheet. Let's go
ahead and import a bunch of functionality from the styles name space of the
openpyxl library. So let's import Font, Color, Alignment, Border, Side, and colors.
He enters the following code, code starts: from openpyxl.styles import Font, Color,
Alignment, Border, Side, colors. Code ends.
Let's start out by instantiating a font. This font object takes in a single parameter that
is bold set to true.
So we store this in a variable called bold_font. Next, let's go ahead and instantiate a
big_red_text font. Once again, we make use of the font object.
This time we specify a color parameter which is set to the value colors.RED. Colors is
a sort of enum field, which contains a list of all of the permissible colors in Microsoft
Excel. We also specify the font size to be 20. Now that we've instantiated a couple of
foreign objects, let's do the same with an alignment object.
We've already worked with alignment objects a little bit in the previous example.
Let's start out with a simple center_aligned_text object. Here we specify the
horizontal alignment equal to center. Next up, let's go ahead and instantiate a side
object. This is going to be used to set the line boundaries between different cells.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 6
So in this case, we specify a single input argument, that's the border_style double. So
by this point, we have couple of font objects and alignment object and a
border_style. Let's go ahead and create a border object.
Notice how we make use of the double border side which we instantiated a moment
ago in order to assemble the top, right, bottom, and left boundaries of this border.
We can see that we are now performing pretty low level cell-based manipulation in
Excel using openpyxl, and this ought to seem very familiar to anyone who's worked
with Excel programmatically, maybe using VBA.
Now that we've gone ahead and instantiated a whole bunch of objects of different
types, we can finally put some of these into practice. Let's go ahead and apply the
fonts and the alignments and the border to different cells in our worksheet. As
always, we will use the cell indexing that's made available to us to access individual
cells in our worksheet.
So we choose cell B2 in our worksheet and we set the font property of this cell to be
the bold font. Likewise, let's choose cell B3 and set its font property to be the
big_red_text. Let's pick cell C4 and set its alignment to be equal to the
center_aligned_text. And finally, let's select cell location C5 and set its border
property to be equal to the square_border object that we just instantiated up above.
Once we are done with this operation, let's go ahead and save our workbook.
Let's make use of the work_book.save method and specify the file name, which is
styled.xlsx in the folder named workbooks. As always, when we switch over to Excel,
we need to close and then reopen our Excel spreadsheet in order to make sure that
all of the latest changes have been picked up from disk.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 7
And when we do this, we see that all of the formatting changes that we applied have
indeed been reflected successfully. Let's start by looking at cell B2, that has a bold
font. Cell B3 has big red text. Cell C4 which contains the number 73 is center aligned
in contrast to all of the other cells in this worksheet. And finally, cell C5 has a square
border. As an aside, notice that cell C5 does not actually have the cursor within it. So
the thick sides that we see represent the border. The cursor is present in cell M15. So
we've now applied a variety of individual formatting options to different cells. Let's
try and combine all of these. Let's see what happens if we apply multiple styling
elements to the same cell. Let's switch back to Python and select cell B7.
We then set the alignment of the cell to be center aligned text. We set its font to be
big red text. And we also for good measure, add in a border, which is the square
border. Let's go ahead and save this workbook as usual in order to flush the changes
out to the Excel file on disk.
When we reopen the Excel file, we can see that cell B7 does indeed look very
different. By the way, we can double-click on any column in order to auto size that
column to fit the largest element within it.
So let's double-click on column B in order to fit a cell width to its contents. And when
we do this we can see that cell B7 does indeed have all three of the styles applied. It
is center aligned, the text font is set to a big red font, and it has a thick double border
on all four sides. We've now demonstrated the use of some of the most common
formatting options in Excel. Let's exit out, head back to Python, and do something
even a little more jazzy.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 8
Let's start by importing the name style module from the styles namespace of
openpyxl.
So we start by instantiating a name style with the name header and store this in a
variable called custom_style. Once we have this custom style, let's go ahead and
tweak some of its properties.
So we start by setting the font of this custom style to be bold. Then let's go ahead
and add in a border.
This border is only one sided so it has a bottom border, with the border style equal
to thin. So this is going to be a one sided border with a line down below and that is
going to be a thin line. Next, let's go ahead and add an alignment property to our
custom style.
Note that we add here a horizontal as well as a vertical align property, both of those
are said to be equal to center. Now that we have this custom style object, the beauty
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 9
of it is that we can go ahead and apply this custom style to any rows or cells that we
wish to. Let's go ahead and extract the header row from our sheet.
We do so using a slightly different syntax than we've used so far. Notice how we
index into the sheet using an integer, that integer is equal to one. So far, we have
indexed in sheets using strings to represent cell locations like A1 or B2. But this is the
first time that we are indexing into a sheet with an integer. The return value here is a
single row. And that row is at index one.
It's always a little confusing to keep track of where indexing works when one is using
a combination of Python and Excel. That's because Python has indices starting from
0, but Excel has indices starting from 1. In any case, this command is going to give us
a handle to the first row. Next, let's go ahead and apply our custom style to every cell
in the header row.
He enters the following code, code starts: for cell in header_row: cell.style =
custom_style. Code ends.
Doing this is easy enough, we iterate over the header row, get each constituent cell
and apply the custom style to it. Once we are done with this iteration, let's go ahead
and save our workbook.
And as always, let's switch back and go ahead and reopen the workbook to see
whether the changes have reflected.
And indeed they have. Our header row now shows up in the custom style that we
just created. So the text is in a bold font. It is center aligned both horizontally and
vertically. And it has a single bottom border, which is a thin line. Let's round out this
example by working with PatternFill.
He switches to the StylingCells file in the Jupyter Notebook. He enters the following
code, code starts: from openpyxl.styles.fills import PatternFill. Code ends.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 10
So let's import PatternFill from the styles.fills namespace. Let's instantiate a new
name style. Let's call it highlight.
And let's go ahead and set the fill property of this style to be the PatternFill.
Notice how PatternFill takes in a foreground color, as well as a pattern type. The
value of the foreground color argument is set to colors.Color followed by a hex code.
This hex code is d7abcc. The pattern type also takes a string value which is
lightHorizontal. Now that we have the style object setup, let's go ahead and apply it
to all of the data in our spreadsheet.
He enters the following code, code starts: for cell in sheet['A']: cell.style =
one_more_style. Code ends.
So let's index our sheet using the single column index A. Excel is smart enough to
figure out that this means we are interested in the column A. We then iterate over
all of the cells in column A and apply this style to it. One interesting bit here to note
is that these iterations or cell ranges will only be performed within the actual
dimensions of the spreadsheet. So it's not literally going to be every cell in column A
that's going to be patterned in this way. It's only going to be those cells which
actually contain some data, i.e., those which are within the data dimension. Iterative
operations like this one can become quite expensive as the size of the data in our
worksheet increases. In any case, let's go ahead and save this workbook and reopen
it in Microsoft Excel.
And sure enough, we see that all of the cells in column A now have this styling
applied. We can see that there is a light red horizontal lining through the interior of
each of these cells. Notice something interesting? The only cells that have actually
been selected within column A are in cell locations A1 through A13. In case you're
wondering why A13 was selected, even though it does not contain any data at this
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 11
point in time, that's because we had happened to hit Enter while we were in the last
row. And that's why I said A13 was included by Excel, while calculating the data
bearing cells in column A.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 12
Let's keep going with this idea of manipulating cell formatting from openpyxl.
He opens the sales_record file in the Microsoft Excel window and a message is
displayed, which reads: Dataset source: http://eforexcel.com/wp/wp-
content/uploads/2017/07/100-Sales-Records.zip.
We'll now take this to a whole different level working with conditional formatting
from Python. In this example, we will be making use of some simple dummy
ecommerce data. This is taken from eforexcel.com. This is a handy resource which
has got a bunch of useful links and other information for people interested in
working with Excel. So we have ecommerce data here, representing sales in different
regions, countries, with information about item type, sales channels, and so on.
We are going to make use of openpyxl in order to format this data and make it easier
for us to interpret. To begin with, let's apply a simple bit of formatting. Let's add
comma separators to the three columns over on the extreme right, total revenue,
total cost, and total profit. Let's exit out of Excel and switch back over into Python.
While exiting, let's not save any of the changes that we just made.
He enters the following code: import openpyxl. He enters the following code, code
starts: work_book = openpyxl.load_workbook("datasets/sales_record.xlsx"). Code
ends.
This is in the data sets folder, and it's called sales_record.xlsx. Once we have a handle
to the workbook, the next item on the agenda is getting a handle to the active
worksheet. We can do this using the active property on the work_book.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 13
Let's store this reference in a variable called sheet and then use the max_row
property of sheet in order to get a sense of how much data the sheet contains.
max_row is 101 and max_column is 13.
He enters the following code: sheet.max_row. He runs the code. The output reads:
101.
Remember that columns in Excel are index starting from 1, so column A is equal to
column 1.
He enters the following code: sheet.max_column and executes it. The output reads:
13.
13 tells us that we have the data up to column M. Let's now go ahead and apply the
comma separators using a number format. This number format is something that we
are going to apply to every cell in the range K2 through M101. The structure of this
for loop ought to be quite familiar to us.
He enters the following code, code starts: for row in sheet['K2:M101']: for cell in
row: cell.number_format = ‘#,##0’. Code ends.
Notice how we pass in the string K2:M101 as an index into the sheet object. We then
iterate over all of the rows in this range, that's the outer for loop. There is also an
inner for loop in which we iterate over each of the cells in each row. Finally, in the
body of the inner for loop, we go ahead and apply the number format. The
important bit to note about that number format is the comma.
This means that we want commas as thousand separators. In case you've not worked
with number formats in Excel, you should know that the hash symbol is used to
represent numbers which may or may not be present depending on the magnitude
of the number. However, the 0 is used to represent numbers which we absolutely
want Excel to display. That's all there is to it. Let's go ahead and save our workbook.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 14
And now when we resize columns K, L, and M, we find that these columns do indeed
possess comma separators for all of their values. That was a good start, let's keep
building on that. Let's exit out of Excel, go back to Python and now apply some pretty
heavy duty conditional formatting. This requires a lot of import statements.
He enters the following code, code starts: from openpyxl.styles import PatternFill,
colors from openpyxl.styles.differential import DifferentialStyle from
openpyxl.formatting.rule import Rule. Code ends.
I'll pause for a moment. Take in how we import PatternFill, colors which is an
enumeration, differential style, and then also the Rule module. The first operation
we're going to try and accomplish is we're going to highlight all rows where the total
profit is greater than a certain threshold in a bright yellow color. To do this, let's start
out by creating a PatternFill object.
This has the one input argument which we need to specify, which is the fill, and we
set its value to be equal to that yellow_background. So we created a PatternFill
object, we pass that into a differential style object. Let's now take this one step
further and create a rule. That rule in turn will need to take in a type.
This is of type expression. And the differential style which has got to be applied to all
data rows which satisfy the expression. We'll see in a moment how there are other
rules which do not require us to specify an expression. But for now, let's keep going.
We've yet to tell this rule what formula we would like it to evaluate. So let's go
ahead and do that.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 15
And this here is really interesting. We are setting the formula property of our rule to
be a list. Within that list, we could potentially have multiple expressions. We just
choose to have one. That expression is expressed in the form of a simple predicate.
That predicate compares the value of the cell M1 to an absolute value. Effectively,
we are checking whether this particular row has total profit less than $70,000. But
what's interesting is the use of the dollar sign in that cell formula.
Note how there is a $ before the M but there is no $ before the 1. This means that
when this rule is applied to different data rows, Excel is going to have to be smart
enough to figure out that M should remain unchanged. So the column reference to
M is absolute, but the row reference is relative. So as the rule is applied to different
rows, that 1 is going to be replaced by the corresponding row number. The next step
is to find all of the data that we've got to apply this rule to. To do this, let's make use
of the sheet's calculate_dimension method.
This tells us the entire range within the sheet which contains data. And that here, is
A1 through M101. So we can finally make the one call which ties all of this together.
He runs the code. The output reads: 'A1:M101'. He enters the following code, code
starts: sheet.conditional_formatting.add(sheet.calculate_dimension(), rule). Code
ends.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 16
Let's save our workbook, and open it up once again in Microsoft Excel and check
whether all of the rows which satisfy the rule are indeed highlighted in yellow. And
the answer is yes. We can immediately see that all of the rows in our data where the
value in column M, that is the total profit, is less than $70,000, all such rows are
highlighted in bright yellow. We have successfully made use of some pretty complex
conditional formatting from openpyxl.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 17
Let's keep going from where we left off. What we had so far is a way to highlight all
items where the total profits are less than $70,000. This was a little binary. Let's go
ahead and apply something a little more subtle.
This is going to allow us to have a gradation for each row based on the quantity of
the profit. The code to do this is actually a lot simpler than the custom rule we had
created a moment ago.
Once we've instantiated this color scale rule, let's go ahead and open up our pyxl
workbook. Of course, we need a handle to the active sheet, and then, we go ahead
and index the same three rows once again.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 18
He enters the following code, code starts: for row in sheet['K2:M101']: for cell in
row: cell.number_format = '#,##0'. Code ends.
All we're doing here is setting the comma separators for these three columns. This is
just a rehash of what we did before, but now comes the new way.
Let's go ahead and add conditional formatting to the data in the range M2:M101.
That conditional formatting also requires a rule, and the rule is the color scale rule
which we just set up a moment ago. Remember, the way we've set up this rule, the
smallest value should be yellow and the largest value should be red. Let's go ahead
and save this workbook and reopen it in Microsoft Excel.
And immediately, we can see that our conditional formatting has worked.
The values in the total profit column that's column M do indeed look between red
and yellow. And we can also see that the small values are yellow and the large values
are red. They have successfully applied a color scale to a specific column using
openpyxl. Let's keep building on this. Let's exit out of the spreadsheet, go back to
Python and add an even more subtle rule.
This one is also a ColorScaleRule, but we've configured the input arguments quite
differently. Now, there are three classes of input arguments, the start, the mid and
the end. For each of these three classes, there is a type, a value and a color. So for
instance, for start type, we have percentile start value is zero, so that's the zero
percentile, and the start color is specified as a hex code.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 19
Similarly, the mid type is also percentile, the mid value is 50. And there is a hex code
associated with this 50th percentile as well. Finally, we have an N type which is at the
90th percentile and an associated end color. Instantiating this rule was the hardest
part.
The rest is pretty similar to the two rules we had before. We've got to add
conditional formatting to a range. That range is M2-M101 and we've also got to
specify the rule that we've just configured. Next, let's go on and save our workbook
so that the changes are picked up on file and reopen that workbook in Microsoft
Excel.
And when we look at the data in column M, once again we can see that the
conditional formatting has indeed worked. There are three distinct sets of hues.
There are the pink hues for the really small values of profit. There are the yellow
hues for the intermediate values of profit. And there are the deep green hues for the
really profitable items. So in this way, we have successfully applied a three phase
rule of conditional formatting to the data in our spreadsheet from openpyxl.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 20
Let's keep going with our exploration of conditional formatting in Excel. We've
worked with conditional formatting with color gradations. Let's now take a look at
some icon set rules. Just for variety, let's change up our dataset. We are now going
to make use of the zomato_reviews dataset which we used earlier on in this course,
this is available off of Kaggle. This is a restaurant review dataset, and so it's got a
column in there for the rating. Which lends itself really nicely to the use of icon sets.
These are a great way to give your Excel workbook a really polished appearance.
So let's plunge right in. In openpyxl we import the IconSetRule from the
openpyxl.formatting.rule namespace. We then go ahead and load the workbook. This
is datasets/zomato_reviews.xlsx.
As usual, we also need to get a handle to the active worksheet. And now for the
interesting bit. We are going to instantiate an IconSetRule object. Notice how there
are a bunch of input arguments that we need to specify. First is the icon_style which
we specify to be 4Arrows.
The next is the type which is num or numeric. And the third is the set of values which
I'm going to demarcate the boundaries. These boundaries represent the values at
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 21
which the icons used will change. In other words, these demarcate the different
categories within our data. Let's go ahead and make use of this IconSetRule. To do
so, we've got to apply the IconSetRule to a data range. To find the appropriate data
range, let's make use of the max_row property of our sheet. This is a pretty large
dataset.
It has 9,552 rows. Let's go ahead and apply this conditional formatting rule to the
data range G2 to G9552.
And exactly as before, we've got to add two input arguments. The data range, as well
as the rule which we want applied in order to format the data range. Let's go ahead
and save this workbook so that the changes are written out to Excel.
And if we go back and re-open the Excel spreadsheet and scroll over to the right, we
can see that column G now looks really elegant. There is now a different icon for
each category.
Notice how there are four icons. There are the up and down yellow arrows, the down
red arrows and the up green arrows. All of these correspond to different values of
the restaurant rating. Icon set rules really lend a pretty elegant appearance to an
Excel spreadsheet. And you should try and use them in every available opportunity.
Now, you might notice that these particular icons are not all that appropriate for
ratings. They are actually more appropriate for directions or for quantities which are
changing. So let's go ahead and try one last type of conditional formatting.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 22
Both the start and the end type are num, that's for numeric. Finally, we can also
specify the color that we want our bars to be. Here we've gone with the color red.
Let's now go ahead and apply this conditional formatting rule to the same data, that
is G2 to G9552. This is the data_bar_rule.
And then when we head over to Excel and reopen it, the formatting for the data in
column G has changed. It's now got these red bars rather than the four icon arrows
as we had in the previous example. We can also see that the length of the bars is
proportionate to the numeric value. So for instance, a value of 1 is represented by
the shortest possible bar and a value of 4 is represented by the longest possible bar.
This gets us to the end of our exploration of conditional formatting.
Hopefully you've learned enough to realize that there is a lot that we can accomplish
with condition formatting from openpyxl.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 23
Let's jump into a quick example which demonstrates how we can work with images
which are embedded in Microsoft Excel spreadsheets. The image that we will be
making use of here is a very common one, it is the Python logo. Let's plunge right
into a simple example. Let's open up a Jupyter Notebook.
Begin by importing Workbook and Image. These are the two relevant modules from
openpyxl. Notice that we import image from the openpyxl namespace
drawing.image.
He enters the following code, code starts: from openpyxl import Workbook from
openpyxl.drawing.image import Image. Code ends.
Let's go ahead and instantiate a workbook. This is a brand new empty workbook that
we are constructing in memory.
Let's get a handle to the active worksheet. Let's also create an image object. For this
we need to specify the file path to the image file.
Once we have all of this preliminary setup in place, it's actually really simple to add
an image into our worksheet. We simply invoke the add_image function on the
worksheet object.
The input arguments that we need to pass include the image object as well as the
cell location where we would like that image to be embedded. Here, we'd like the
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 24
image embedded at the cell location, C11. So let's execute this, save our workbook
so that the changes are written out to file.
If we now go back and reopen this workbook in Microsoft Excel, we can confirm that
the Python logo does indeed appear embedded at cell location C11. In Excel, we can
move this image around, we can drag and drop it to change its dimensions. Let's see
how some of that could be accomplished from openpyxl. So let's close out the
workbook and head back to Python. This time we will modify some of the properties
of the image object. To begin with, notice that the image object has properties called
width and height. These are set by default 225 pixels each.
He enters the following code: img.width, img.height. He runs the code. The output
reads: (225, 225).
So we will reread in the image from file and instantiate a new image object called
large image. Next, in order to actually resize the image, let's go ahead and increase,
or set the width and the height properties of our large image to be 400 pixels.
Once that's done, let's go ahead and add this large image into our worksheet at cell
location L11.
As always, we need to save this workbook to make sure these changes are written to
disk.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 25
Reopen Excel, and now indeed, we find that there are two embedded versions of the
image. One of these is much smaller than the other. The large image does indeed
appear at cell location L11. We have successfully demonstrated working with
embedded images in Excel using openpyxl.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 26
Let's say we have just a little bit of simple data in an Excel workbook and we'd like to
calculate the sum of this data.
This is simple enough to accomplish with select all of the values and we could check
the Sum in the toolbar over at bottom. We can see the sum here is 54. In addition to
the sum, we can also see that the average is 10.8 and the count is equal to 5. Let's
now go ahead and do this a little more explicitly. So we will invoke the sum formula
on the five values that we have selected.
So we go to any old cell type out the equal to sign, which is the standard way to
begin a formula, followed by the formula name. The function here that we use is
SUM and then comes the range that we wish to sum over. And the sum of A1
through A5 is indeed 54. This was just a really simple example of using formulae in
Excel. There are a large number of other formula that we could use as well. For
instance, we could calculate the product of these five numbers. And that is done
using the product function which gives us 87318.
We can also count and we eventually include a number of blank cells. The COUNT
function only counts the non-blank cells.
So there are five non-blank values in all of the cells that we have selected. Similarly,
we can use the average worksheet function.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 27
And as we remember, the average of these five numbers is 10.8. Formulae like those
which we have demonstrated so far are known as worksheet functions in Excel.
Because we type them out in individual cells of our worksheet. And they also operate
on other cells within the same worksheet. Let's see how we can recreate this using
openpyxl. So let's exit out of Excel, there's no particular reason for us to save this
workbook. So we hit Don't Save and switch back to Jupyter, where we begin by
importing the FORMULAE module from openpyxl.utils. Now, a couple of important
notes about worksheet formulae from openpyxl.
He opens the ApplyingFormulae file in the Jupyter Notebook. The following code is
displayed, code starts: from openpyxl.utils import FORMULAE. Code ends.
The first note is that these are never going to be evaluated by openpyxl. So we can
specify formulae, but we've got to do so using strings. The other bit worth noting is
that openpyxl has a large set of formulae which it knows about. All of these are
available by printing out the variable FORMULAE, this is all caps.
He enters the following code: FORMULAE. He runs the code. The output displays a
list of all formulae.
And when we print it out to screen, we see that this is of type frozenset, which
means that this is an immutable set. We can't change the values contained within it.
It is also possible to use other formulae not included in this list. But those must be
prefixed and the prefix that we've got to add is underscore xlfn dot. With those
minor caveats out of the way, let's go ahead and demonstrate the use of a
worksheet functions.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 28
So we have the sheet, followed by the indexing operator. The cell address in the
form of a string, that's on the left-hand side of each of these assignments. And on the
right-hand side are the values, which we had just been working with. And in this way,
in cell A1, we have the value, 21. There are five values, as before, these are 21, 11, 7,
9 and 6. Notice how useful it is to be able to access individual cells in the worksheet
using string based addressing. Okay, let's now move to the use of the worksheet
formulae. We now would like to store the sum of these particular values.
And for that, we are going to invoke the SUM worksheet function over the cell range
A1 through A5. And store this in the cell location D2. Notice how this formula for cell
D2 begins with an equal to sign. This is a way for us to tell Excel that this is a
worksheet formula. In addition to this, we have a little bit of self explanatory text.
That's in cell C2, notice how this text is simply the word SUM followed by a colon.
Because there's no equal to sign in front of it, Excel knows that this is not a formula
and that it is simply a string literal. Okay, let's go ahead and save this workbook and
switch over to Excel and open it.
We can now see that the sum does indeed appear in cell D2, in fact, this is the sum
formula.
We can see that in the function bar up top, the some formula has been invoked over
cells A1 through A5. That's exactly what we had specified in our Python program. In
addition, the cell C2 contains a string SUM followed by the colon sign. This was a
pretty simple Hello, World demonstration of worksheet formulae. Let's exit out of
this workbook, head back into Python and add a few more.
Let's go ahead and calculate the product of these numbers by invoking the PRODUCT
function over the cell range A1 through A5.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 29
And store the result in cell location D3. And then let's go ahead and do the same
thing with the COUNT function and store the result in cell B4.
Notice as before, that we are intentionally counting some zero or blank cells. So we
count the cells A1 through A9. And remember, we expect only the non-zero cells to
actually show up in the count. And the last worksheet function which we wish to
demonstrate is the AVERAGE.
So we calculate the average of cells A1 through A5 and place the result in cell D5. At
this point, we are good to go. Let's save our workbook and head back over into
Excel and see what we've got there.
And indeed, we see that the SUM, PRODUCT, COUNT and MEAN have all been
calculated exactly as we wanted them to. Each of these appears in the form of a
worksheet function. And we can see that they also correspond to the semantics that
we obtained in Excel the first time around. For instance, the count, which was
calculated over the range A1 through A9, returns the value 5. And that's because
there are only five non-blank cells in that range.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 30
We've now demonstrated creating data within an Excel workbook and then
calculating or invoking worksheet formulae from within openpyxl. Let's go the other
way. Let's also see or verify rather that we can go ahead and use worksheet formulae
from within Excel on data which has been written from within openpyxl. So we exit
out of this workbook and back into our Python code. Let's go ahead and create a list.
Let's call that list header and it contains the values cake, quantity, price, and
revenue. We'll then go ahead and create a data variable. This is a list of lists. Each of
the constituent lists here is going to represent one row of data. Do note, however,
that our data does not have the revenue in there.
He enters the following code, code starts: data = [[‘Chocolate’, 18, 5],
[‘Cheesecake’, 13, 4.5], [‘Tres Leches’, 16, 5.5], [‘Carrot’, 8, 4], [‘Red Velvet’, 9, 4.5]].
Code ends.
So for instance, the first rule contains the cake type which is chocolate. It has the
quantity which is 18 and it has the price which is 5. But we have intentionally omitted
the Revenue field. That is something which we will calculate directly from within
Excel. Now that we've created this list of lists, let's go ahead and add a new sheet
into our workbook. Let's call that sheet CakeSales.
We also specify the index within all of the sheets in the workbook where we'd like
this sheet to appear. We'd like this to show up first in other words at index 0. Hit
Shift + Enter to run this code, and we see that we get back a handle to the workbook.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 31
Remember now that we are working with an external Excel workbook, and so these
operations which we create, they have side effects. Were we to run this code again,
it would attempt to create yet another sheet called CakeSales. In any case, let's get a
handle to the worksheet CakeSales.
We can do this by indexing into our workbook with the name of the sheet within
square brackets and quotes. So we get a handle to the CakeSales worksheet and
store that in a variable. Now that we have the sheet, let's go ahead and write the
header and the data into it.
So we first invoke the append method to write the header. And then we use list
comprehension to walk through each row within our data.
And at the end of this process, if we switch back to Excel, we should find that we
have the header along with all of the data, but with the Revenue column blank. So
let's go ahead and save our workbook and switch over to Excel to see if this indeed is
the case.
And lo and behold it is. We have the header information, and we have each of the
rows in our data, and the revenue column is indeed blank. Let's now go ahead and
manually calculate the revenue using worksheet functions this time from within
Excel.
So the revenue for the first row is 5 times 18 which is 90, and in the same way we
can copy this formula down and populate the revenue column for every row in our
data. We have demonstrated the use of worksheet functions from within openpyxl,
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 32
and we have also shown that it's possible to write data into an Excel workbook and
then manipulate it using worksheet functions directly from within Excel.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 33
Let's now lead into an example where we demonstrate the use of the dollar symbol
and of absolute and relative referencing from within openpyxl. Let's pick up right
from where we left off. We ended with a calculation of the revenue as a
multiplication of the price and quantity. Now, let's examine any one of these revenue
formulae, for instance, the one in cell D5.
By the way, a handy way to examine which cells influence the result of this particular
cell is to use the F2. That is the function key F2 while you are within a cell. That
highlights all of the cells whose values drive or are precedents of the value in the
current set. Here we can clearly see that the formula is B5* C5. The important bit to
note is that there is no dollar symbol in this formula. There is also no dollar symbol in
the formula in cell D6.
Now in case you have not worked with the dollar symbol and are not aware of what
absolute and relative referencing and addressing mean, let's just drive this home
with an example. Let's copy all of the cells D1 through D6. And then let's go ahead
and paste them to another location in our spreadsheet. So we right click and hit
Copy, scroll down and then hit Paste. When we paste these cells to the cell location
starting at D10, the first cell which contains the value, the literal value revenue, that
cell pastes just fine. However, the cells which contain the formulae that is the D11
through D15, all of them now just contain zeros. This probably is not what we had in
mind.
Let's investigate by hitting F2, and we can see that the formula is now being
calculated as the product of B11 and C11. So what we are seeing here is relative cell
addressing at work. The original formula, the one in cell D2 was expressed as product
of the B2 and C2, and we copied this formula down into cell B11. That formula was
updated automatically by Excel so that it now became the product of B11 and C11. In
this instance, this clearly is not what we had in mind. We want the formula to
continue to refer to the references B2 and C2. So let's go ahead and clear the cells
which we just pasted in there by right clicking and hitting the Clear Contents button.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 34
Now the way in which we are going to accomplish what we had in mind is to make
appropriate use of the dollar symbol. Basically in Excel, the dollar symbol when
applied, to cell addresses has a special meaning. It's a way of telling Excel that when
that particular formula is copied, you would like any cell address locations which are
prefixed with the dollar symbol to not be updated while copying. This now makes
this an absolute cell reference rather than a relative cell reference. So we're going to
have to go back up into the formula in cell D2 and painstakingly insert dollar symbols
before each of the row and the column identifiers. Please note that the dollar symbol
applies separately to the row number or the column number.
So if we want both the row and the column number to be locked, we've got to insert
two dollar symbols, just like you see here. So the absolute address of B2 is $B$2.
Once again, this is something that we have to do for every term in the formula. So
we again painstakingly go ahead and add dollar symbols to each component of C2 as
well.
So by the time we are done with this, the absolute version of this formula is now
$B$2 multiplied by $C$2. Getting the use of the dollar symbol right is fairly tricky, and
it's also really painstaking, because the moment you add dollar symbols to your cell
references, you no longer can copy paste those formulas easily. This means that we
now have to go through each of our revenue cells and update the formulae, adding
four dollar symbols per formula. As you can see, this takes quite a bit of doing but
sooner or later we're done with it.
And at the end of it all, each of our revenue formulae that is in cells D2 to D6 are
now in absolute addressing form. Now it's safe for us to go ahead and copy these
formulae. If you copy them and paste them into a different portion of the sheet, the
formulae continue to refer to the contents of cells B2 and C2 respectively, and in this
way we have successfully made use of absolute cell referencing from within
Microsoft Excel. Let's take this one step further and see how this can be done from
Python using openpyxl.
Let's go ahead and exit out of Microsoft Excel, switch back over to Python. This is the
same Jupyter workbook we had been working with previously. And for that reason,
we already have references to the workbook as well as to the active sheet.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 35
The name of the active sheet is cake_sales_sheet. Because we already have a handle
to it, we need not re-initialize it. Let's just go ahead and get the maximum row in the
sheet.
For this, we make use of the max_row property of the worksheet object. This by
default is going to be a numeric value. We cast it to be a string by enclosing it within
a call to the str formula. This is a built-in Python formula. We can print it out to
screen and ensure that this is indeed a variable of type string. We know because the
digit 6 is enclosed within a pair of single quotes. Now let's go ahead and
programmatically recreate the formulae which calculated the revenue as the
multiplication of the price and the quantity.
He enters the following code, code starts: for row in cake_sales_sheet ['D2:D' +
max_row_str]: for cell in row: cell.value = '=$B${0}*$C${0}'.format(cell.row). Code
ends.
The interesting bit is that this programmatic recreation is going to make use of the
dollar symbol. This is a great way to save effort. Because as you could see a moment
ago, adding those dollar symbols ahead of every row and every column just really got
to be a lot of work, in addition to being notoriously error prone. Because we already
had the max_row_str which contained the string 6, we can now easily iterate over all
of the cells between D2 and D6. This is the outer for loop. The loop variable here is
called row. We then go ahead and manually iterate over each cell in each row.
This is the inner for loop. And we set the value of the cell to be a formula. That
formula contains dollar symbols ahead of B as well as ahead of C. Those are the
column identifiers. But in addition, there are two more dollar symbols ahead of the
cell row number. And this is where we splice in the row number using the format
command. In this way, we are constructing a nice, absolute formula string, and
setting each of the cells D2, D3, D4, D5 and D6 to be absolute cell references. While
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 36
we're at it, for good measure, let's also calculate the total cake sales. This we do in
the cell location at max_row + 2.
So this is going to be in cell C8. And right next to it in cell D8, we have a formula for
the actual total sales, the sum of all of the revenues that we've calculated right up
above.
We are now ready to save this workbook so that all of these changes are reflected to
the Excel file on disk. Let's open up the workbook and check whether everything
looks as expected.
Let's place our cursor in cell B2, and hit the F2 command key. We can see
immediately that the formulae there now contain the absolute cell references. We
have been successful in adding on the dollar symbols ahead of B2 as well as C2. We
can quickly check the other cells as well, and it is indeed correct. Each one of the
revenue calculations now contains only absolute cell references. Finally, let's place
our cursor in cell D8. And we can see that it has indeed picked up correctly the sum
of all of the revenues in the rows up above. Let's just add a little finishing touch here.
Let's set the formatting of all of these cells so that it contains a dollar sign ahead of it.
He enters the following code, code starts: for row in cake_sales_sheet ['C2:D' +
max_row_str]: for cell in row: cell.number_format = ‘$#,##0.00’. Code ends.
So let's exit out of this workbook, switch back to Python, and go ahead and iterate
over all of the cells containing dollar amounts. So that is all of the cells in the range
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 37
from C2 to D6. And go ahead and specify that the number format for that cell should
include a dollar sign up above. It will also include a couple of cents over at the end. In
addition, there will also be comma separators.
Note how the hash symbols are used to represent numbers which may or may not
exist. But zeros are used to represent numbers which we absolutely want
represented to screen. So this is our way of saying that we definitely want a dollar
symbol, comma separators, as well as two cents after the decimal point. Let's also go
ahead and apply the same number formatting to cell D8.
This shows, by the way, how painstaking it is to manually create an Excel workbook.
One has to keep track of all of the different cells that one is reading to or writing
from, and make all of these changes one by one. In any case, we are now done with
the formatting.
Let's save the workbook, switch over back to Excel, and confirm that the formatting
appears as we wanted it to. And indeed it does. So we have successfully
demonstrated the use of absolute and relative formulae, as well as of some custom
number formatting from openpyxl in Python.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 38
Let's continue working with named ranges and with some advanced features of Excel
formulae. So let's pick up, we have here a simple spreadsheet which has two tabs.
One called Products and the other called Data. In the Data tab, we have a named
range called fx_rates. If we choose this name range from the cell address bar, over
on the top left. The corresponding cells within that name range are highlighted. And
those are the cells N3 to O11. That data has two columns. The first column
represents a currency, and the second represents the spot rate against the US dollar.
Likewise, let's go ahead and create a new name range in the other tab of our
spreadsheet, that's the products tab. So let's navigate over. Let's select the data
which we would like to combine into our name range. And over in the address bar on
the top left, we type out the name of this named range, which in this case is
products.
A table is displayed with five columns and six rows. The column headers are
Product, USD, EUR, INR, and JPY. The first product name called H37 Vacuum
Cleaner is located in cell A3:.
So in this way, we have assigned the named range products to the cell range A3 to B8
in the products tab of our spreadsheet. We now have two named ranges fx_rates
and products and selecting either of these causes the corresponding cells to be
highlighted.
Okay, let's now go ahead and make use of this named range. One of the great
advantages of named ranges is that we can pass them in as absolute cell ranges into
formulae such as the VLOOKUP function. So let's say for instance, if you'd like to
calculate the price in Euros of each of these products, we would multiply the price in
USD by the corresponding US dollar to Euro exchange rate.
And how would we get that exchange rate? Well, that's simple. We would perform a
VLOOKUP on the effects rates, named range, which we have, over on the other tab.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 39
And that's precisely what you see on screen now. The price in US dollars represents
cell B3. We multiply that by the spot exchange rate, between the USD and the Euro.
And that is in turn the result of the formula, that's the VLOOKUP.
Now the first input argument into the VLOOKUP function is the name of the currency
which we wish to look up. Here, that is the string Euro, that is EUR. The second is the
cell range to be looked up, which is our named range fx_rates. The third argument is
the column number, that is 2. And the last argument, which is almost always equal to
false in the case of VLOOKUP is simply asking for an exact match.
So this formula tells us that the price in Euros of the H37 vacuum cleaner is 227.7
Euros that corresponded to a price in US dollars of $249. And, of course, we would
go ahead and copy this formula after locking the cell range corresponding to the
currency name. This is something which we will go ahead and recreate in Python.
So let's exit out of Excel, switch back to Python and begin by importing openpyxl as
usual.
As always, the first order of business is to load the workbook that we wish to work
with. We've got to pass in the input file path.
So this is datasets/products.xlsx. And once we've done that, we need a handle to the
worksheet. Here, the worksheet we're interested in is called Products. So we index
into the workbook object with the name of the worksheet, Products, and store the
reference in a variable called sheet. Now the beauty of named ranges is they are
effectively addressable locations on a worksheet, which means that we can get a
name range by using the defined_names method on our workbook.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 40
So we index into the list of all of the defined names in our workbook. We pass in the
name of the named range, which is fx_rates, and we store the corresponding object
in a variable called fx_range.
If you print this object out to screen, we can see that it has a bunch of interesting
properties. If you worked with VBA, all of this ought to seem pretty familiar. So the
name is fx_rates. There's a no comment associated with this named range. There is
no custom menu, shortcut, nor a description or a help item. Notice, however, that
we can tell what underlying cell locations this name range refers to. This is in the
field called attr_text. And that particular attribute has the exact cell location. So it's
worksheet name, data and cells N3 to O11.
So if, for some reason, you need to figure out the exact underlying cell locations
corresponding to a named range, there is a way to figure that out by parsing this
particular attribute. Now, a named range is nothing but a collection of individual
cells. And those individual cells can be obtained by iterating, or a collection or an
iterable called destinations. So if we print out fx_range.destinations, we see that we
get a generator object.
He enters the following code: fx_range.destinations. He runs the code. The output
reads: <generator object DefinedName.destinations at 0x11406a570>.
Let's now go ahead and access each of the individual cells within that generator
object, using this for loop. So notice how we are iterating what the contents of
fx_range.destination.
He enters the following code, code starts: cells = [ ] for title, coord in
fx_range.destinations : ws = work_book[title] cells.append(ws[coord]) cells. Code
ends. He runs the code. The output displays contents of the fx_range.
Each value in this iterable is a tuple consisting of the worksheet title, as well as the
coordinates within that particular worksheet. So we get the title as well as the coords
and then index into the workbook using the title, store the resulting worksheet in a
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 41
variable called ws. And then go ahead and access the underlying cell by indexing into
that worksheet using the coord variable. And this finally gives us an individual work
cell which we append to our cell's array. And then we print out the value of the cells
array, and we can see that it consists of individual cells. Each of which is in a
worksheet called Data and the corresponding addresses or coord values range from
N3 down to O11. And in this way, we have created a list which has each of the
individual cells in this named range.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 42
Now remember that our end objective is to populate this NamedRange with the
prices of all of these products in currencies other than US dollar. So first let's find the
maximum row number using the max_row property of our sheet and we see that is
the row number 8.
At this point the code is going to get a little involved. We are going to have a series of
nested for loops. Each of these nested for loops is going to calculate the prices for
one currency.
He enters the following code, code starts: for row in sheet['C3:C' + max_row_str]:
for cell in row: cell.value = ‘=$B${0}*VLOOKUP($C$2, fx_rates, 2,
FALSE)’.format(cell.row) cell.number_format = ‘#,##0.00’. Code ends.
For instance on screen now, you see a nested for loop which is going to populate the
data range with prices in euros. Let's walk through this code carefully.
After we are done with this, we will have similar for loops for columns D, E, and so on
for the currencies other than euro. Once we understand this for column C, which has
the prices in euros, the other for loops ought to make sense as well. We will now go
ahead and iterate over the sheet for all the rows and all the cells that we care about.
And for each of these cells we are going to explicitly place in there, the VLOOKUP.
This is the same VLOOKUP formula, which we had used in Excel a moment ago. This
code is fairly complex, so let's make sure that we understand it.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 43
The outer for loop is iterating over each row. Notice how this starts from the cell
location C3 and ends with cell location C8. So this row variable is going to consist of
the rows 3 to 8. The inner for loop consists of every cell within the corresponding
row.
Now the value of every cell is going to be obtained by multiplying the price in US
dollars. Which is always going to be in column B multiplied by the corresponding
currency conversion factor. Now notice how we carefully construct the first part of
the formula which has the price in US dollars as the cell location $B followed by $,
followed by the cell's row location. Also, notice how the VLOOKUP is always
performed, the currency name which is in cell C2, that's the string EUR. And we are
going to look up the corresponding USD rate in the named range fx_rates.
In column number 2, and also with a perfect match, that's why the last argument is
FALSE. And finally, once we've calculated the corresponding price in euros and stored
it in cell.value. We will also nicely format the cell using cell.number_format in order
to display with a couple of decimal points. Now, this gets us to the end of the
calculation of the prices in euros. Let's go ahead and repeat very similar code for INR
and JPY.
He enters the following code, code starts: for row in sheet['D3:D' + max_row_str]:
for cell in row: cell.value = ‘=$B${0}*VLOOKUP($D$2, fx_rates, 2,
FALSE)’.format(cell.row) cell.number_format = ‘#,##0.00’. Code ends.
Notice that the code is virtually identical. All that changes is the column that we are
operating on. So for the euro prices the column was C, for INR it's D, and for
Japanese Yen it is E.
He enters the following code, code starts: for row in sheet['E3:E' + max_row_str]:
for cell in row: cell.value = ‘=$B${0}*VLOOKUP($E$2, fx_rates, 2,
FALSE)’.format(cell.row) cell.number_format = ‘#,##0.00’. Code ends.
Now it's not great practice to copy paste code like this. But here, what I'd like to get
across to you, is the manner in which we are constructing the formulae. And I
thought it would just get a little too complicated if we try to generalize the currency
code as well. By this point, we are finally done with all of our calculations. So we can
save our workbook and toggle back over to Excel.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 44
And we can see at this point that all of the currency conversions have indeed been
performed correctly. We can examine the individual calculations by clicking into the
cells. And we can see that in each case, a multiplication is being carried out. The USD
price is being multiplied by the corresponding FX rate. That FX rate is obtained as a
result of a VLOOKUP. The VLOOKUP is done using the name of the currency which is
always in row 2. So the euro is in C2, INR is in D2 and JPY is in E2. And the end result
is that we multiply the price in euros by the corresponding spot rate.
That spot rate was from the named range fx_rates. Now notice that in this Excel
workbook which we have created from within Python, we have yet to create the
named range called products. That is something which we had done in the manually
created Excel at the start of the example. Let's go ahead and complete that from
within Python. So we switch back from Excel to Python and go ahead and create a
named range in our workbook using the work_book.create_named_range method.
The first input argument is the name of the named range, which is products. The
second corresponds to the sheet on which we want this named range to reside. And
the third has the absolute cell locations. Note how we've got to encode these in
absolute cell formatting. So that is A3 to B8. Notice also how we've got to add the
dollar symbols in there. This has to do with absolute and relative formatting in Excel.
And now if you go ahead and save this workbook and then toggle back into Excel, we
can find that there are now two named ranges.
So in addition to the previously created fx_rates named range, there is also the new
products named range in cells A3 to B8, exactly as we had expected. We have
successfully demonstrated working with named ranges from openpyxl using named
ranges in VLOOKUP formulae in Excel as well as in Python. And finally, also creating
new named ranges from within Python.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 45
Let's now move on to a slightly more substantial example, in which we'll make use of
pivoting functionality from within Microsoft Excel. Now, in most of the examples in
this set of courses, we have relied on the openpyxl library. But it turns out that when
it comes to pivoting, there really is no better tool than Pandas. So that's what we'll
be using in this particular example. So let's plunge right in. We are going to make use
of some dummy e-commerce data. The source for this data set is the URL that you
see on screen now.
The data is in tabular format, and it includes region, country, item type, sales
channel, and a host of other information coming down to and including revenue, unit
sales, and profits. Let's go ahead and explore this data from within Microsoft Excel.
This has become really easy to do, in particular because recent versions of Excel
actually suggest pivot tables which help us to slice our data. In order to use this
functionality, we simply click on the Insert menu item up top, that in turn gives us a
host of possible options. The second option, from the left, is called Recommended
Pivot Tables. Let's just go ahead and click on that and see what Microsoft Excel gives
us. And when we click this, we see that Excel automates the process of pivot table
creation.
On the right side of the window, a PivotTable Fields pane is displayed. It is divided
into five sections: FIELD NAME, Filters, Columns, Rows, and Values.
So we are no longer prompted for whether we like to insert a new worksheet or for
the choice of the rows and the columns. Microsoft has just gone ahead and inserted
a new sheet, it's called Sheet2. And within the sheet, it has gone ahead and created a
placeholder pivot table. Of course, we can go ahead and play around and change the
configuration of this pivot table if it does not meet our requirements. The default
pivot table has product names as the row labels and regions as the columns. We can
change that by getting rid of the regions and replacing them with the count of the
countries.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 46
He right-clicks the Columns section from within the PivotTable Fields pane and
selects the Remove Field option from the shortcut menu. The Row Labels and Count
of Country columns are displayed in the sheet.
This gives us a very different pivot table in which each row tells us how many
countries possess sales for that particular item. So for instance, in this data set, we
can see that Baby Food was sold in seven countries, beverages was sold in eight
countries. And if we sum up the total country counts across all of the product heads,
we see that the grand total is 100. Pivot tables are an extremely powerful feature of
Excel. So you should be sure to explore this functionality if you have not worked with
them a whole lot. For instance, if we'd like to change the configuration of our pivot
table, we can simply go ahead and remove a field.
He right-clicks the Values section and selects the Remove Field option from the
shortcut menu. The Row Labels column is displayed in the sheet.
And that will get rid of the country count. We can then also go ahead and remove
the last remaining row field.
He right-clicks the Rows section and selects the Remove Field option from the
shortcut menu.
And that gives us a completely empty pivot table. In the old days, if you pivoted your
data, you'd be prompted for whether you'd like a new sheet. And if you hit yes, you
would get to exactly this point. That is, to an empty pivot table where you then have
to pick the fields that you wanted as the rows, the columns, and as the individual cell
values. Let's rebuild our pivot table using the menus over on the right.
He selects the Country checkbox in the FIELD NAME section. The field name
"Country" now appears in the Rows section in the PivotTable Fields pane. The Row
Labels column is displayed in the sheet.
Let's first select the country as our field name. This causes the row labels to appear,
this is in column A, and we now have one row for each country in our e-commerce
data. Next, let's go ahead and scroll through all of the possible fields, again, on the
right, and select the number of units sold.
He selects the Units_Sold checkbox in the FIELD NAME section. The field name
"Units_Sold" now appears in the Values section in the PivotTable Fields pane. The
Sum of Units_Sold column is displayed in the sheet.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 47
This causes, by default, the sum of that particular field, that is, the sum of the units
sold, to appear as a column in our pivot table. Just a couple of quick points here. The
first is notice that we use an aggregation function, that's the sum, that can be
changed to represent the min, max, count, or other values. Another point worth
noting is the similarity between a pivot table and a group by query in Excel. And in
fact, the two are very, very semantically similar. Now, in exactly the same manner,
let's go ahead and add yet another column, this one for the total revenue. Once
again, this represents a group by, so we're going to get the total revenue per
country.
He selects the Total_Revenue checkbox in the FIELD NAME section. The field name
"Total_Revenue" now appears in the Values section in the PivotTable Fields pane.
The Sum of Total_Revenue column is displayed in the sheet.
And let's add one last column, this one to represent the total profit.
He selects the Total_Profit checkbox in the FIELD NAME section. The field name
"Total_Profit" now appears in the Values section in the PivotTable Fields pane. The
Sum of Total_Profit column is displayed in the sheet.
So at the end of this process, our pivot table has rows corresponding to each
country. And the corresponding values for each row include the sum of the units
sold, the total revenue, and the total profit in that country. Now, since we are going
to go ahead and recreate this entire pivot table using Python, let's just remember
some of these values to serve as sanity checks. So for instance, the total number of
units sold in Zambia is 4,085. We'll use this to verify the correctness of our Python
calculations later on in this example. We've now done enough in Microsoft Excel, so
let's exit out of this workbook, we don't particularly need to save this pivot table.
And let's switch back over into Python.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 48
The first library that we are going to import now, is going to be pandas. Pandas of
course, is an extremely powerful and popular library useful working with relational
data, that is data that's organized in rows and columns. And in fact, pandas even has
helper methods for reading in Excel.
So let's go ahead and use the pandas.read_excel command, followed by the path to
our dataset. And let's save the result, the result is going to of course be a pandas
data frame. We save that in a variable called sales_df. df of course represents data
frame. Let's go ahead and use the head command, this is a quick way to check out
the first few rows in a pandas data frame.
He enters the following code: sales_df.head(). He runs the code. The output
displays a data in a tabular format.
You can see here that we have headers, the header names correspond to the first
row of the data in our excel spreadsheet. That's because pandas was smart enough
to figure out the column needs. It is also possible to control all of this with greater
granularity, but for now this looks just fine. Once we've sanity checked that our data
has been redeem correctly, let's go ahead and try another experiment.
Let's sort our data frame, based on a set of columns. This is very easy to do in
pandas. We simply invoke the sort values method on our sales data frame, and we
pass in a list of columns that we wish to sort on. To do this, we specify the value of
the by parameter, and that's a list with the column names, region, country and item
type. What's more, after sorting our data frame by these three columns, we go
ahead and invoke the head function. And this time, we like to examine the top ten
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 49
rows, because the head function is being invoked on the sorted version. The top ten
rows are going to represent the output, sorted by those three columns.
Now, the last region in alphabetical order is Sub-Saharan Africa. Within which we
have the countries, again sorted in alphabetical order. So at the very end we see
Zambia, that's the last row and the last item type there is Snacks. And here's our first
sanity check. You can see that the number of items sold there is 4085. That's the
number which we had noted to ourselves, when we had constructed our pivot table
in Excel. So far, we've just used pandas to read in the raw data. Let's now go ahead
and actually perform some pivoting operations on it.
In order to do so, we need to write just a little more code. Let's also import numpy
which is a handy library to have, and then invoke the pivot table command on our
data frame. So, we invoke pd.pivot_table, followed by various input arguments.
The first of which is the data frame, sales_df which we actually would like to pivot.
Then we've got to include a list of columns that we would want to index on, those
are the regions and the country. So effectively, the region and the country will
become the rows in our data. Each row will consist of the region, the country, and
then a set of values, what values? Well, that's what we pass in using the values
parameter, and that once again is a list of column names. The column names are
Units_Sold, Total_Revenue, and Total_Profit.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 50
So note, how we have invoked the pandas pivot table command with the index
argument. To point to a list region, and country, and values, which are units sold
total revenue and total profit. All five of these strings correspond to column names in
our original pandas data frame sales_df. But wait, there's one last important
argument for us to understand. And that is the aggregation function. Here, we've
used np.sum, that's numpy.sum, as the aggregation function, which we would like
invoked on each of the individual cell values.
This corresponds to the aggregation function sum that we had in the Excel pivot
table, and this effectively represents the use of a function object. Once this is done,
we store the result in a variable called table. And we invoke the head command to
check out the first 10 rows. Let's go ahead and hit Shift+Enter.
And we can see that the output of this command looks very, very similar to our excel
pivot table. Notice how the first row is the Region, and the second row is the
Country. So these correspond to the two fields that we had in our index argument,
into the pandas pivot table command. Those were the row names, the columns
correspond to the total profit, the total revenue and the units sold. And each cell
value represents an aggregation, that aggregation in this particular pivot table is the
sum. This is an exact replica created in pandas of the pivot table we had in Excel.
Now, we can also perform various operations on this resulting pivot table. Let's say
for instance that we were only interested in the rows for region equal to Asia.
If you work with pandas before, you might recall the significance of the lock attribute
of a data frame. This references the index attributes. Or in other words, the values
which go into the row names. Here, we have two lock attributes, region and country.
So we work with filter for the first of those lock attributes to be Asia. And the second,
is anything at all. So for the second we simply have a placeholder, represented by the
colon symbol. So this command is going to return a subsection of the original pivot
table, where we have the data for all countries within Asia.
Notice how pandas is smart enough to figure out, that we no longer need two lock
attributes, because first of this Asia is shared by all of the countries in this data. So
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 51
this represents a nice subsection of the original pivot table. We can also filter on a
combination of country and region.
So for instance, to get all of the rows corresponding to the country Myanmar, in the
region, Asia, we simply specify the values of both lock attributes.
The result is no longer a pivot table, rather it's simply one row of a pivot table.
Representing the total profit, total revenue and units sold for Myanmar.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 52
If we would like to examine all of the regions or the row names for all rows in our
original table, that is simple enough as well. We simply use the get_level_values
command on the index property of our data frame. Now remember that this index
property has two attributes, region and country, because we only care about the first
of these, that's the region we've got to index using zero. And the output of this
command is simply going to represent the region field of every row in our original
pivot table.
If you wanted only the unique row names, well that is easy enough as well. We
simply invoke the unique method on this list which we just got.
That's what's been done on screen now. We have a variable called unique regions.
This ends up with a list of all of the unique values within the level values of the first
field of the index. And we can see that this still retains the metadata.
So this command knows that it's talking about the index column with the name
region, and the values include Asia, Australia, and Oceana, Central America, right
down through Sub-Saharan Africa. If we now wanted to get a list of unique countries,
we could simply tweak the command we ran a moment ago.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 53
regions followed by the unique command. And once again, remember that this is
because our table, pivot table, had two index columns, the first of which was the
region and the second of which was the country.
The values within the unique countries variable are taken in order so they start with
Bangladesh and end with Zambia. Next, we're going to do something interesting. We
are going to iterate over each region in our unique regions variable, and for each of
those regions, we are going to print out a pivot table. That pivot table will include all
of the countries in that region as rows. And will include the total profits, total
revenues and units sold as the cell values.
Now that sounds complicated, but it takes exactly two lines of code to accomplish it.
We use list comprehension to walk over the regions and then print the table.loc for
that particular region.
He runs the code. The output displays the pivot table for each region.
And you can see one by one that we have pivot tables one for each region. The first
is for all of the countries in Asia. And as we scroll down, the last of these is for Sub-
Saharan Africa. And now that we've accomplished all of this right here within Python,
let's go ahead and write it all out to Excel. The way we're going to do this, we're
going to have one umbrella workbook.
Within that workbook, we are going to have one tab for each region. And then within
that tab, we're going to have rows for each country. And that is exactly what's
implemented on screen. We use the Pandas dot Excel writer method and then
iterate over the regions and write out to the Excel workbook the corresponding pivot
table using the to_excel method on our temp data frame. At the end of this for loop,
we've got to remember to actually save our values out of file. That's done using the
writer.save method down below.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 54
We can now switch back over from Python to Excel, and we can see our sales pivot
tables. We can see that we have fairly complex workbook with one tab for each
region. Within each region, we have one row for each country. And notice how the
cell representation closely mirrors what we had in Pandas. You have the name of the
aggregate function, which is sum, followed by the column names, total profit, total
revenue, and units sold. And just to be absolutely sure, we've gotten this calculation
right, let's scroll to the Sub-Saharan Africa data and scroll down to the country
Zambia.
And there, yes indeed we find that the number of units sold is equal to 4,085. You
might recall that this is the number we had noted down as a checksum when we
implemented this in Excel at the start of this example. And in this way we have
successfully demonstrated working with pivot tables in Excel as well as in Pandas.
Once again, this is one of the examples in this series where we did not work with
openpyxl, and that's because Pandas is simply too popular and too powerful to use
anything else.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 55
By this point, you possess the skills needed to perform complex data analysis
operations using pivoting and formulae, leveraging absolute and relative cell
references. And controlling workbook appearance using conditional formatting and
styles. You can easily leverage Pandas, the popular Python library for data analysis,
to group your data and pivot it using one or more columns. You also understand the
dollar operator and can use it to cleverly lock and unlock specific parts of your
worksheet formulae to enable those formulae to be copy pasted as effectively as
possible.
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 56
18. Test
/conversion/tmp/activity_task_scratch/549519172.docx
14-Oct-21 549519172.docx 57
/conversion/tmp/activity_task_scratch/549519172.docx