Read Html File In Python Using Pandas
Last Updated :
24 Apr, 2025
In Python, Pandas is a powerful library commonly used for data manipulation and analysis. While it's primarily used for working with structured data such as CSV files, Excel spreadsheets, and databases, it's also capable of reading HTML files and extracting tabular data from them. In this article, we'll explore how to read an HTML file in Python using Pandas, along with practical examples and explanations.
Read HTML Files in Python Using Pandas
Below are the possible approaches to Read HTML Files in Python Using Pandas.
- Using read_html() Function
- Using BeautifulSoup with read_html()
- Using requests with read_html()
- Using lxml parser with read_html()
Read HTML Files Using read_html() Function
This approach directly uses the read_html() function provided by pandas. This function is specifically designed to parse HTML tables and return a list of DataFrames corresponding to the tables found in the HTML content. It's a convenient method when dealing with simple HTML files containing tabular data.
Python3
import pandas as pd
def read_html_with_read_html(file_path):
# Read HTML file into DataFrame using read_html()
df = pd.read_html(file_path)[0]
return df
# File path
html_file_path = 'data/geeks_for_geeks.html'
# Read HTML file using read_html() function
df = read_html_with_read_html(html_file_path)
# Display DataFrame
print("Approach 1 Output:")
print(df)
HTML
<!DOCTYPE html>
<html>
<head>
<title>Table Example</title>
</head>
<body>
<table border="1">
<tr>
<th>Name</th>
<th>Topic</th>
<th>Difficulty</th>
</tr>
<tr>
<td>Introduction to Python</td>
<td>Python</td>
<td>Beginner</td>
</tr>
<tr>
<td>Data Structures</td>
<td>Algorithms</td>
<td>Intermediate</td>
</tr>
<tr>
<td>Machine Learning Basics</td>
<td>Machine Learning</td>
<td>Advanced</td>
</tr>
</table>
</body>
</html>
Output:
Approach 1 Output: Name Topic Difficulty
0 Introduction to Python Python Beginner
1 Data Structures Algorithms Intermediate
2 Machine Learning Basics Machine Learning Advanced
Read HTML Files Using BeautifulSoup with read_html()
In this approach, we first use the BeautifulSoup library to parse the HTML file and extract tables from it. BeautifulSoup provides more flexibility in navigating and extracting specific elements from HTML documents. We then pass the extracted tables to the read_html() function to convert them into DataFrames.
Python3
import pandas as pd
from bs4 import BeautifulSoup
def read_html_with_beautiful_soup(file_path):
# Read HTML file
with open(file_path, 'r') as f:
# Parse HTML using BeautifulSoup
soup = BeautifulSoup(f, 'html.parser')
# Find all tables in the HTML
tables = soup.find_all('table')
# Read tables into DataFrame using read_html()
df = pd.read_html(str(tables))[0]
return df
# File path
html_file_path = 'data/geeks_for_geeks.html'
# Read HTML file using BeautifulSoup with read_html()
df = read_html_with_beautiful_soup(html_file_path)
# Display DataFrame
print("Approach 2 Output:")
print(df)
HTML
<!DOCTYPE html>
<html>
<head>
<title>Programming Languages</title>
</head>
<body>
<table border="1">
<tr>
<th>Code</th>
<th>Language</th>
<th>Difficulty</th>
</tr>
<tr>
<td>HTML</td>
<td>HTML/CSS</td>
<td>Beginner</td>
</tr>
<tr>
<td>Python</td>
<td>Python</td>
<td>Intermediate</td>
</tr>
<tr>
<td>JavaScript</td>
<td>JavaScript</td>
<td>Advanced</td>
</tr>
</table>
</body>
</html>
Output:
Approach 2 Output: Code Language Difficulty
0 HTML HTML/CSS Beginner
1 Python Python Intermediate
2 JavaScript JavaScript Advanced
Read HTML Files Using requests with read_html()
This approach involves fetching the HTML content from a URL using the requests library and then passing the content directly to the read_html() function for parsing. It's useful when the HTML content is available online and can be accessed via URL. This approach enables automation in data retrieval and is suitable for reading data from remote sources. However, it requires an internet connection to fetch HTML content, dependency on external servers for data retrieval, and potential security risks when fetching data from untrusted sources.
Python3
import pandas as pd
import requests
def read_html_with_requests(file_url):
# Fetch HTML content using requests
response = requests.get(file_url)
# Read HTML content into DataFrame using read_html()
df = pd.read_html(response.content)[0]
return df
# File URL
html_file_url = 'https://media.geeksforgeeks.org/wp-content/uploads/20240213175028/geeks_for_geeks.html'
# Read HTML file using requests with read_html()
df = read_html_with_requests(html_file_url)
# Display DataFrame
print("Approach 3 Output:")
print(df)
HTML
<!DOCTYPE html>
<html>
<head>
<title>Topics in Different Categories</title>
</head>
<body>
<table border="1">
<tr>
<th>Category</th>
<th>Topic</th>
<th>Difficulty</th>
</tr>
<tr>
<td>Data Structures</td>
<td>Algorithms</td>
<td>Beginner</td>
</tr>
<tr>
<td>Web Development</td>
<td>HTML/CSS</td>
<td>Intermediate</td>
</tr>
<tr>
<td>Machine Learning</td>
<td>Python</td>
<td>Advanced</td>
</tr>
</table>
</body>
</html>
Output:
Approach 3 Output: Category Topic Difficulty
0 Data Structures Algorithms Beginner
1 Web Development HTML/CSS Intermediate
2 Machine Learning Python Advanced
Read HTML Files Using lxml parser with read_html()
In the approach, we use the lxml parser in the read_html() function to parse the HTML file. XMLlxml parser is known for its speed and ability to handle large HTML files efficiently. This approach is suitable for cases where performance is a concern or when dealing with large HTML files. While it offers fast and efficient parsing and good performance, especially with large datasets, it requires additional installation of the lxml library and has limited control over parsing compared to BeautifulSoup.
Python3
import pandas as pd
# Approach 4: Using lxml parser with read_html()
def read_html_with_lxml(file_path):
# Read HTML file into DataFrame using read_html() with 'lxml' parser
df = pd.read_html(file_path, flavor='lxml')[0]
return df
# File path
html_file_path = 'data/geeks_for_geeks.html'
# Read HTML file using lxml parser with read_html()
df = read_html_with_lxml(html_file_path)
# Display DataFrame
print("Approach 4 Output:")
print(df)
HTML
<!DOCTYPE html>
<html>
<head>
<title>Book Information</title>
</head>
<body>
<table border="1">
<tr>
<th>Title</th>
<th>Author</th>
<th>Difficulty</th>
</tr>
<tr>
<td>Python Basics</td>
<td>John Doe</td>
<td>Beginner</td>
</tr>
<tr>
<td>Data Analysis</td>
<td>Jane Smith</td>
<td>Intermediate</td>
</tr>
<tr>
<td>Machine Learning Algorithms</td>
<td>David Johnson</td>
<td>Advanced</td>
</tr>
</table>
</body>
</html>
Output:
Approach 4 Output: Title Author Difficulty
0 Python Basics John Doe Beginner
1 Data Analysis Jane Smith Intermediate
2 Machine Learning Algorithms David Johnson Advanced
Conclusion
In conclusion, Pandas provides multiple methods to read HTML files in Python, offering flexibility based on specific requirements. The read_html() function is a straightforward option for parsing simple HTML tables. Alternatively, utilizing BeautifulSoup allows for more control over complex HTML structures. For web-based content, the integration of requests enables fetching HTML from URLs, while the use of the lxml parser enhances performance with large datasets.
Similar Reads
Python Tutorial | Learn Python Programming Language
Python Tutorial â Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly.Python is:A high-level language, used in web development, data science, automatio
10 min read
Python Interview Questions and Answers
Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Non-linear Components
In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Python OOPs Concepts
Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. By understanding the core OOP principles (classes, objects, inheritance, encapsulation, polymorphism, and abstraction), programmers can leverage the full p
11 min read
Python Projects - Beginner to Advanced
Python is one of the most popular programming languages due to its simplicity, versatility, and supportive community. Whether youâre a beginner eager to learn the basics or an experienced programmer looking to challenge your skills, there are countless Python projects to help you grow.Hereâs a list
10 min read
Python Exercise with Practice Questions and Solutions
Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test
9 min read
Class Diagram | Unified Modeling Language (UML)
A UML class diagram is a visual tool that represents the structure of a system by showing its classes, attributes, methods, and the relationships between them. It helps everyone involved in a projectâlike developers and designersâunderstand how the system is organized and how its components interact
12 min read
Python Programs
Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co
11 min read
Spring Boot Tutorial
Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Enumerate() in Python
enumerate() function adds a counter to each item in a list or other iterable. It turns the iterable into something we can loop through, where each item comes with its number (starting from 0 by default). We can also turn it into a list of (number, item) pairs using list().Let's look at a simple exam
3 min read