0% found this document useful (0 votes)

80 views

Web Scraping Weather Data Using Python - by Abhishek Khatri - Medium

The document discusses extracting weather data from a website using Python. It describes scraping data from http://www.estesparkweather.net for a decade between 2009-2018. It covers importing libraries, viewing sample data, creating a list of dates, extracting information from the HTML using Beautiful Soup, formatting the data, and creating a final Pandas dataframe.

Uploaded by

riddhee

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views

Web Scraping Weather Data Using Python - by Abhishek Khatri - Medium

Uploaded by

riddhee

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Abhishek Khatri Follow

Nov 28, 2019 · 5 min read · Listen

Save

Web Scraping Weather Data using Python

Web Scraping

Web scraping is a method of extracting information from a web page.

It is the process of extracting useful information from the page and formatting the
dataininapp
Open the required format for further analysis and use. Sign up Sign In

In this blog we will be extracting weather data from the following:

http://www.estesparkweather.net/archive_reports.php?date=20001

Required Python Libraries

We will start by importing bunch of libraries:

import pandas as pd

import bs4

from bs4 import BeautifulSoup

import requests

import re

from tqdm import tqdm

Missing libraries can be installed using pip:

pip3 install tqdm

Let’s move ahead and start extracting data, but before we do this, let us first see how
our data looks like:
30 3

Data for Jan 1 2009

Note: We are going to extract data for a decade, let say starting from Jan 2009 till
October 2018. Indexing will be done based on the date for which data is available.

Just look at above image, we will be extracting info such that we get 19 columns
from Average Temperature till Maximum heat index and our index would be the date
on which these readings were recorded.

Creating a list of dates

Since we are extracting data for a decade, we will start our code with a dates
variable.

range_date = pd.date_range(start = '1/1/2009',end = '11/1/2018',freq

= 'M')

dates = [str(i)[:4] + str(i)[5:7] for i in range_date]

In above piece of code, a pandas datetimes object with monthly frequency was
created, a list of dates was then created using range_date.

dates[0:5]

Above cell will give us the following output: [‘200901’, ‘200902’, ‘200903’, ‘200904’,
‘200905’]

Extracting information from the page:

Now we will start by creating 2 empty lists, one for our data and other for index field
and we will run a loop till the number of elements in the list dates. Also, we will
create a variable url in which we will pass page per iteration for extracting data:

df_list = []

index = []

for k in tqdm(range(len(dates))):

url = "http://www.estesparkweather.net/archive_reports.php?date="
+ dates[k]

#Our loop will run for the number of elements in dates

Now to download the webpage use:

page = requests.get(url)

requests.get() will take the URL and will download the content of the web page. The
downloaded web page is stored as a string in the Response object’s text
variable(page). If access gets successful we will see 200 as the output for following:

page.status_code

Next step is parsing our data using BeautifulSoup, this module is used for extracting
information from HTML page.
We should have some basic knowledge of HTML tags before using this, next we’ll
create a BeautifulSoup object and will apply its method to extract data as per our
requirement:

soup = BeautifulSoup(page.content, 'html.parser')

In the above piece of block we are using HTML parser, we generally use either
HTML or XML parser, search over the net for other parsers.

Next, we use find_all method to locate all our <table> tags. For more info on this
method and related ones kindly click here.

table = soup.find_all('table')

type(table)

# This will give

#bs4.element.ResultSet

Next we’ll create a list of list which contains text data of all the rows under table
tags:

parsed_data = [row.text.splitlines() for row in table]

parsed_data = parsed_data[:-9]

In the last line of above cell, we are removing all the rows that are not required for
extraction, or we can say these are the junk information from the data that can be
ignored. So, the first element of our list will be:
parsed_data[0]

Removing all the empty list for better view using slicing:

for l in range(len(parsed_data)):

parsed_data[l] = parsed_data[l][2:len(parsed_data[l]):3]

After formatting, 1st element of parsed_data will look like:

parsed_data[0] after further formatting

Finally, we will use regex to extract numerical values from the string and create a
final list to store the data as per the required format:

for i in range(len(parsed_data)):

c = [('.'.join(re.findall("\d+",str(parsed_data[i][j].split()
[:5]))) for j in range(len(parsed_data[i]))]

df_list.append(c)

index.append(dates[k] + c[0])

Above code will run for all the months between the defined period and we now
have 2 list one containing data and indexes.

Now if we check the length of 1st element of df_list, output will be 20 and the
elements of our list will be: [‘1’, ‘37.8’, ‘35’, ‘12.7’, ‘29.7’, ‘26.4’, ‘36.8’, ‘274’,‘0.00’,‘0.00’,
‘0.00’, ‘40.1’, ‘34.5’, ‘44’, ‘27’, ‘29.762’, ‘29.596’,

‘41.4’, ‘59’, ‘40.1’]

But it has been observed for 8 data points, length of list is 22 and one such list looks
like: [‘’, ‘31.3’, ‘54’, ‘14.2’, ‘29.986’, ‘8.0’, ‘12.3’, ‘319’, ‘0.575’,

‘13.268’, ‘0.020’, ‘53.2’, ‘1.4’, ‘93’, ‘15’, ‘30.72’, ‘29.04’,

‘170.2’, ‘255.3’, ‘53.2’, ‘40.7’, ‘22.1’]

Before creating a dataframe we need to get rid of these 8 datapoints, this can be
achieved as follows:
f_index = [index[i] for i in range(len(index)) if len(index[i]) > 6]

data = [df_list[i][1:] for i in range(len(df_list)) if

len(df_list[i][1:]) == 19]

After removing junk data points, we will convert the indexes in date format %Y-%m-
%d using following piece of code:

final_index = [datetime.strptime(str(f_index[i]),
'%Y%m%d').strftime('%Y-%m-%d') for i in range(len(f_index))]

So, we are all set, let us create dataframe:

col = ['Average temperature (°F)', 'Average humidity (%)',

'Average dewpoint (°F)', 'Average barometer (in)',

'Average windspeed (mph)', 'Average gustspeed (mph)',

'Average direction (°deg)', 'Rainfall for month (in)',
'Rainfall for year (in)', 'Maximum rain per minute',
'Maximum temperature (°F)', 'Minimum temperature (°F)',

'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',

'Minimum pressure', 'Maximum windspeed (mph)',
'Maximum gust speed (mph)', 'Maximum heat index (°F)']

final_df = pd.DataFrame(data, columns = col, index = final_index)

Final Output
With this we are done extracting data, we can type cast columns further using
.astype(float). Thanks for reading!

Python Web Scraping Beautifulsoup Regex Data Science

About Help Terms Privacy

Get the Medium app

Yash Week 3 Uber Case Study
No ratings yet
Yash Week 3 Uber Case Study
38 pages
YCB Level 1 - Aasanas & Practical Exam Syllabus
100% (1)
YCB Level 1 - Aasanas & Practical Exam Syllabus
13 pages
Learning Pandas PDF
No ratings yet
Learning Pandas PDF
171 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Exercises 5
No ratings yet
Exercises 5
7 pages
Performing Analysis of Meteorological Data: Punam Seal
No ratings yet
Performing Analysis of Meteorological Data: Punam Seal
21 pages
Experiment 12 YP
No ratings yet
Experiment 12 YP
6 pages
Pythonic Data Cleaning With Numpy and Pandas
No ratings yet
Pythonic Data Cleaning With Numpy and Pandas
11 pages
Chapter2 - Data Wrangling
No ratings yet
Chapter2 - Data Wrangling
48 pages
Pandas Python 1667717677
No ratings yet
Pandas Python 1667717677
12 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
WorkingWithData - Ipynb - Colaboratory
No ratings yet
WorkingWithData - Ipynb - Colaboratory
13 pages
Lesson - 3 - 1 Data Wrangling
No ratings yet
Lesson - 3 - 1 Data Wrangling
29 pages
DataFrame.docx
No ratings yet
DataFrame.docx
95 pages
Important Pandas Operations 1697910759
No ratings yet
Important Pandas Operations 1697910759
6 pages
DAwHPC L03 Data Cleaning Practical
No ratings yet
DAwHPC L03 Data Cleaning Practical
43 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
CH 3 2
No ratings yet
CH 3 2
17 pages
SDFG
No ratings yet
SDFG
4 pages
Unit6 - Working With Data
No ratings yet
Unit6 - Working With Data
29 pages
Practical File Class - Xii Informatics Practices (New) : 1. How To Create A Series From A List, Numpy Array and Dict?
No ratings yet
Practical File Class - Xii Informatics Practices (New) : 1. How To Create A Series From A List, Numpy Array and Dict?
17 pages
Pandas Cheat Sheet - Python For Data Science
No ratings yet
Pandas Cheat Sheet - Python For Data Science
5 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Exercise 7 - Pandas
No ratings yet
Exercise 7 - Pandas
2 pages
Tools For Data Science Notes
No ratings yet
Tools For Data Science Notes
16 pages
Databricks Etl Pipeline 1699423882
No ratings yet
Databricks Etl Pipeline 1699423882
6 pages
Importing Data Python Cheat Sheet PDF
No ratings yet
Importing Data Python Cheat Sheet PDF
1 page
Unit 7: Problem Solving Real World Programming Problems
No ratings yet
Unit 7: Problem Solving Real World Programming Problems
36 pages
Python - Working With Data - Text Formats
No ratings yet
Python - Working With Data - Text Formats
23 pages
Data Exploration and Visualization Laboratory - AD3301 - Lab Manual
No ratings yet
Data Exploration and Visualization Laboratory - AD3301 - Lab Manual
55 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas PDF
No ratings yet
Pandas PDF
171 pages
Esc Enter M Y A B D + D Z F Shift + Up/Down Space Shift + Space
No ratings yet
Esc Enter M Y A B D + D Z F Shift + Up/Down Space Shift + Space
12 pages
Utf-8''libraries Data Management
No ratings yet
Utf-8''libraries Data Management
9 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Programming Notes 2
No ratings yet
Programming Notes 2
9 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
41b Data Wrangling, Grouping and Aggregation
No ratings yet
41b Data Wrangling, Grouping and Aggregation
31 pages
final dev record
No ratings yet
final dev record
49 pages
Data frames pandas, handout 1 (1)
No ratings yet
Data frames pandas, handout 1 (1)
16 pages
7.2 - Data Frame Basics.mp4
No ratings yet
7.2 - Data Frame Basics.mp4
3 pages
Recurrent Neural Network-Programs
No ratings yet
Recurrent Neural Network-Programs
9 pages
Institute of Technology Management & Research
No ratings yet
Institute of Technology Management & Research
10 pages
Python Note 3
No ratings yet
Python Note 3
11 pages
14oct Pandas 2024
No ratings yet
14oct Pandas 2024
13 pages
Importing Data Cheat Sheet Python For Data Science: Pickled Files Exploring Your Data
No ratings yet
Importing Data Cheat Sheet Python For Data Science: Pickled Files Exploring Your Data
1 page
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Python For Data Science
No ratings yet
Python For Data Science
4 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Week 1: 1 The Python Programming Language: Functions
No ratings yet
Week 1: 1 The Python Programming Language: Functions
9 pages
Acknowledgement
No ratings yet
Acknowledgement
25 pages
APL FY23 AGM Presentation For Upload
No ratings yet
APL FY23 AGM Presentation For Upload
23 pages
Cmi 2001ar
No ratings yet
Cmi 2001ar
26 pages
Melbourne Water System Strategy - 0
No ratings yet
Melbourne Water System Strategy - 0
132 pages
Un Global Compact Progress Report 2021
No ratings yet
Un Global Compact Progress Report 2021
12 pages
Topic 6 Services Final Report With Alt Text
No ratings yet
Topic 6 Services Final Report With Alt Text
34 pages
Lewis Grey Advisory Gas Price Projections Report
No ratings yet
Lewis Grey Advisory Gas Price Projections Report
60 pages
1 s2.0 S2211467X22001559 Main
No ratings yet
1 s2.0 S2211467X22001559 Main
16 pages
YCB Registration Process Including Fee
No ratings yet
YCB Registration Process Including Fee
32 pages
Detection of Stator, Bearing and Rotor Faults in Induction Motors
No ratings yet
Detection of Stator, Bearing and Rotor Faults in Induction Motors
7 pages
Mmscience - 2015 06 - Comparison of Capabilities of Finite Element Method and Specialized Software Programs in Evaluation of Gears
No ratings yet
Mmscience - 2015 06 - Comparison of Capabilities of Finite Element Method and Specialized Software Programs in Evaluation of Gears
3 pages
Easa Rep Resea 2012 6 PDF
No ratings yet
Easa Rep Resea 2012 6 PDF
190 pages
Data in Brief
No ratings yet
Data in Brief
6 pages
Simplex Responsive Userguide v1
No ratings yet
Simplex Responsive Userguide v1
12 pages
School Improvement For Real: January 2001
No ratings yet
School Improvement For Real: January 2001
14 pages
Vibration Analysis Report
100% (1)
Vibration Analysis Report
18 pages