Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
21 views

Date String Manipulations With Python

This document discusses manipulating and visualizing date string data in Python. It shows how to convert date strings to datetime objects to allow grouping and ordering of data by time segments like weeks and quarters. It demonstrates converting a date column in a lightning strike dataset to datetimes, then creating new columns for week, month, quarter and year. Finally, it creates bar charts plotting lightning strikes by week for 2018 and by quarter over three years to understand patterns in the data.

Uploaded by

Mostafa Fathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Date String Manipulations With Python

This document discusses manipulating and visualizing date string data in Python. It shows how to convert date strings to datetime objects to allow grouping and ordering of data by time segments like weeks and quarters. It demonstrates converting a date column in a lightning strike dataset to datetimes, then creating new columns for week, month, quarter and year. Finally, it creates bar charts plotting lightning strikes by week for 2018 and by quarter over three years to understand patterns in the data.

Uploaded by

Mostafa Fathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Date string manipulations with Python

As a data professional, you can expect

to work with date time objects and date strings. In this video,

we'll continue coding in Python and practice converting,

manipulating, and grouping data. By the end of this video, we'll create

a widely used data visualization, a bar graph that tells

a story with your data. Working with date strings will often

require breaking them down into smaller pieces. Breaking date strings into days, months,

and years allows you to group and order the other data in different ways so

that you can analyze it. Manipulating date and time strings

is a foundational skill in EDA. In this video,

you will learn to convert date strings in the NOAA lightning strike

dataset into datetime objects. We will discuss how to combine

these data objects into different groups by segments of

time such as quarters and weeks. Let's open a Python notebook,

and I'll show you what I mean. Let's begin by importing

Python libraries and packages. To start, import Matplotlib and

Pandas which you've used before. To review,

Pyplot is a very helpful package for creating visualizations like bar,

line, and pie charts. Pandas is one of the more popular packages

in data science because it's specific focus is a series of functions and

commands that help you work with datasets. The last package, Seaborn,

may be new to you. Seaborn is a visualization

library that is easier to use and produces nicer looking charts. For this video, we'll use the NOAA

lightning strike data for the years 2016, 2017, and 2018 to group lightning

strikes by different timeframes. This will help us understand total

lightning strikes by week and quarter. As I mentioned at the beginning of this

video, when manipulating date strings, the best thing to do is to break

down the date information like day, month, and year into parts. This allows us to group the data into
any time series grouping we want. Luckily, there's an easy way to do that

which is to create a datetime object. Now, as you'll recall

from a previous video, this NOAA dataset has three

columns giving us the date, number of lightning strikes, and

latitude and longitude of the strike. For us to manipulate the date column and

its data, we'll first need to convert

it into a datetime data type. We do that by simply

coding df ('date') and making that equal pd.to_datetime with df ('date') input in parentheses. Doing this
conversion

gives us the quickest and most direct path to manipulating

the date string in the date column which currently is in the format of

a four digit year followed by a dash, then the two digit month, a dash,

and lastly the two digit day. Okay, this is the exciting part. Because our dates are converted

into Panda's datetime objects, we can create any sort of

date grouping we want. Let's say for example, we want to group

the lightning strike data by both week and quarter. All we would need to do is

create some new columns. You'll see here we're creating four new

columns, week, month, quarter, and year. With the first line of code,

we're creating a column week by taking

the data in the date column and using the function strftime. This function from

the daytime package formats are datetime data into new strings. In this case, we want the year followed
by

a dash, and the number of weeks out of 52. If we want that string, we need to code it as %Y-W%V. The
percent sign is the command which

tells the datetime format to use the year data in the string. The W implies this is a week,

and the V stands for value, as in a sequential

number running from 1 to 52. The final string output for

the column data will be in this format, 2016-W27. The next line of code gives

us the new month column. The argument is then written as a %Y-%m. This will output the four

digit year followed by a dash, then the two digit month. Essentially, we're removing the last two
digit date from the original date string. Next, we will create a column for

quarters. In this case, a quarter is three months. Many corporations divide their

financial year into quarters. So knowing how to divide data into

quarter years is a very useful skill. In this case,

it only takes one line of code. We'll call the new column quarter,

and we'll use our date column with two underscore

period to create the quarter column. The datetime package has a pre-made code

for dividing datetime into quarters. In the two underscore period argument

field, we only need to place the letter q. After that, we can use the function

strftime to complete the string. For the argument, we put %Y-Q%q. The first Q is placed into the string
to

indicate we are talking about quarters. The percent sign followed by

the lower case q indicates the Pandas that we want the date

formatted into quarters. Our final column will be

the easiest to code of them all. The year column is created by taking

our original date column data and creating a string that includes

only the argument percent sign Y. This creates a column of data

with it with only the year in it. Now that we have formatted some strings,

let's quickly review our work by using the head function we learned

in the previous video. When we run this code,

our four new columns are there, week, month, quarter, and year. They are all formatted

just as we discussed. We can use these new strings

to learn more about the data. For example, let's say we want to group

the number of lightning strikes by weeks. An organization whose employees primarily

work outdoors might be interested in knowing the week to week likelihood

of dealing with lightning strikes. In order to do that,

we'll want to plot a chart. We've reviewed a couple of

charts coded in Python by now. Next, let's code a chart with

a lightning strike data. For plotting the number of lightning


strikes per week, let's use a bar chart. Our graph would be a bit confusing

using all three years of data. So, let's just use the 2018 data and limit our chart to 52 weeks

rather than 156 weeks. We can do this by creating a column

that groups the data by year and then orders it by week. We will then learn more about

the structuring function in another video. For now,

let's focus on plotting this bar chart. We'll use the plt.bar function to plot. Within our argument field, we
select

the x-axis which is our week column, then the y-axis or height,

which we input as a number of strikes. Next, we'll fill in some of

the details of our chart. Using plt.plot,

we will place arguments in the x-label, y-label, and title functions. The arguments are week number,

number of lightning strikes, and number of lightning strikes per week

(2018), respectively. This renders a graph, but

the x-axis labels are all joined together. So, we have a chart, but

the x-axis is difficult to read. So let's fix that. We can do that with

the plt.xticks function. For the rotation, we can put 45, and for

the fontsize, let's scale it down to 8. After we use plt.show,

the x-axis labels are much cleaner. Given our bar chart illustrating

lightning strikes per month in 2018, you could conclude that a group

planning outdoor activities for weeks 32 to 34 might want

a backup plan to move indoors. Of course, this is a broad

generalization to make on behalf of every North American

location in the dataset. But for our purposes and in general, it is a good understanding

of our dataset to have. For our last visualization,

let's plot lightning strikes by quarter. For our visualization,

it will be far easier to work with numbers in millions

such as 25.2 million rather than 25, 154, 365, for example. Let's create a column that divides

the total number of strikes by one million. We do this by typing df_by_quarter, and entering the
relevant column
in the arguments field. In this case, we want number of strikes. Next, we add on .div to

get our division function. Lastly for the argument field,

we enter 1000000. When we run this cell, we have a column that provides the number

of lightning strikes in millions. Next, we'll group the number

of strikes by quarter using the groupby and

reset_index functions. This code divides the number of strikes

into quarters for all three years. Each number is rounded

to the first decimal. The letter m represents one million. As you'll soon discover, this calculation

will help with the visualization. You'll learn more about these

functions in another video. We will plot our chart using

the same format as before. We use the plt.bar with

our x being from our df_by_quarter dataframe,

with quarter in the argument field. For the height, we put the number_of_strikes

column in the argument field. It would be helpful if each

quarter had the total lightning strike count at the top of each bar. To do that, we need to define our own

function, which we will call, addlabels. Let's type addlabels,

then input our two column axes, quarter, and number of strikes

separated by columns and brackets. At the end,

we use the format we created earlier, number_of_strikes formatted to label

the number_of_strikes_by_quarter. To finish the bar chart, we label

the x and y-axis and add the title. Before we show the data visualization, there are a few small things we
want to

add just to make it more friendly to read. Let's set our length and

height to 15 by 5. Next, let's make the bar

labels cleaner by defining those numbers and centering the text. Our bar chart now gives us the number
of

strikes by quarter from 2016 to 2018. To make the information easier to digest,

let's do one more visualization. Here is the code for

plotting a bar chart that groups the total number of strikes


year over year by quarter. Review the code carefully and

consider what each function an argument does in order to create this

final polished bar chart. Each year has assigned its own color to

highlight the differences in quarters. And now we have our chart. Coming up, you'll learn more about
the

different methods for structuring data. I'll see you there.

You might also like