Web Scraping Weather Data Using Python - by Abhishek Khatri - Medium
Web Scraping Weather Data Using Python - by Abhishek Khatri - Medium
Save
It is the process of extracting useful information from the page and formatting the
dataininapp
Open the required format for further analysis and use. Sign up Sign In
import pandas as pd
import bs4
import requests
import re
Let’s move ahead and start extracting data, but before we do this, let us first see how
our data looks like:
30 3
Note: We are going to extract data for a decade, let say starting from Jan 2009 till
October 2018. Indexing will be done based on the date for which data is available.
Just look at above image, we will be extracting info such that we get 19 columns
from Average Temperature till Maximum heat index and our index would be the date
on which these readings were recorded.
Since we are extracting data for a decade, we will start our code with a dates
variable.
dates[0:5]
Above cell will give us the following output: [‘200901’, ‘200902’, ‘200903’, ‘200904’,
‘200905’]
Now we will start by creating 2 empty lists, one for our data and other for index field
and we will run a loop till the number of elements in the list dates. Also, we will
create a variable url in which we will pass page per iteration for extracting data:
df_list = []
index = []
for k in tqdm(range(len(dates))):
url = "http://www.estesparkweather.net/archive_reports.php?date="
+ dates[k]
page = requests.get(url)
requests.get() will take the URL and will download the content of the web page. The
downloaded web page is stored as a string in the Response object’s text
variable(page). If access gets successful we will see 200 as the output for following:
page.status_code
Next step is parsing our data using BeautifulSoup, this module is used for extracting
information from HTML page.
We should have some basic knowledge of HTML tags before using this, next we’ll
create a BeautifulSoup object and will apply its method to extract data as per our
requirement:
In the above piece of block we are using HTML parser, we generally use either
HTML or XML parser, search over the net for other parsers.
Next, we use find_all method to locate all our <table> tags. For more info on this
method and related ones kindly click here.
table = soup.find_all('table')
type(table)
#bs4.element.ResultSet
Next we’ll create a list of list which contains text data of all the rows under table
tags:
parsed_data = parsed_data[:-9]
In the last line of above cell, we are removing all the rows that are not required for
extraction, or we can say these are the junk information from the data that can be
ignored. So, the first element of our list will be:
parsed_data[0]
Removing all the empty list for better view using slicing:
for l in range(len(parsed_data)):
parsed_data[l] = parsed_data[l][2:len(parsed_data[l]):3]
Finally, we will use regex to extract numerical values from the string and create a
final list to store the data as per the required format:
for i in range(len(parsed_data)):
c = [('.'.join(re.findall("\d+",str(parsed_data[i][j].split()
[:5]))) for j in range(len(parsed_data[i]))]
df_list.append(c)
index.append(dates[k] + c[0])
Above code will run for all the months between the defined period and we now
have 2 list one containing data and indexes.
Now if we check the length of 1st element of df_list, output will be 20 and the
elements of our list will be: [‘1’, ‘37.8’, ‘35’, ‘12.7’, ‘29.7’, ‘26.4’, ‘36.8’, ‘274’,‘0.00’,‘0.00’,
‘0.00’, ‘40.1’, ‘34.5’, ‘44’, ‘27’, ‘29.762’, ‘29.596’,
But it has been observed for 8 data points, length of list is 22 and one such list looks
like: [‘’, ‘31.3’, ‘54’, ‘14.2’, ‘29.986’, ‘8.0’, ‘12.3’, ‘319’, ‘0.575’,
Before creating a dataframe we need to get rid of these 8 datapoints, this can be
achieved as follows:
f_index = [index[i] for i in range(len(index)) if len(index[i]) > 6]
After removing junk data points, we will convert the indexes in date format %Y-%m-
%d using following piece of code:
final_index = [datetime.strptime(str(f_index[i]),
'%Y%m%d').strftime('%Y-%m-%d') for i in range(len(f_index))]
Final Output
With this we are done extracting data, we can type cast columns further using
.astype(float). Thanks for reading!