Convert XML structure to DataFrame using BeautifulSoup - Python
Last Updated :
21 Mar, 2024
Here, we are going to convert the XML structure into a DataFrame using the BeautifulSoup package of Python. It is a python library that is used to scrape web pages. To install this library, the command is
pip install beautifulsoup4
We are going to extract the data from an XML file using this library, and then we will convert the extracted data into Dataframe. For converting into the Dataframes, we need to install the panda's library.
Pandas library: It is a python library which is used for data manipulation and analysis. To install this library, the command is
pip install pandas
Note: If it asks you to install a parser library, use the command
pip install et_xmlfile
Step-by-step implementation:
Step 1: Import the libraries.
Python3
from bs4 import BeautifulSoup
import pandas as pd
First we need to import the libraries which are going to use in our program. Here, we imported the BeautifulSoup library from the bs4 module and also imported the pandas library and created its alias as ‘pd’.
Step 2: Read the xml file.
Python3
file = open("gfg.xml",'r')
contents = file.read()
Here, we are opening our xml file named ‘gfg.xml’ using open(“filename”, “mode”) function in read mode ‘r’ and storing it in variable ‘file’. Then we are reading the actual contents stored in the file using read() function.
Step 3:
Python3
soup = BeautifulSoup(contents,'xml')
Here, we are giving the data of the file to be scraped which is stored in the ‘contents’ variable to the BeautifulSoup function and also passing the type of file which is XML.
Step 4: Searching the data.
Here, we are extracting the data. We are using the find_all() function which returns the extracted data present inside the tag which is passed in this function.
Python3
authors = soup.find_all('author')
titles = soup.find_all('title')
prices = soup.find_all('price')
pubdate = soup.find_all('publish_date')
genres = soup.find_all('genre')
des = soup.find_all('description')
Example:
authors = soup.find_all('author')
We are storing the extracted data into the authors variable. This find_all(‘author’) function will extract all the data inside the author tag in the xml file. The data will be stored as a list, i.e. authors is a list of extracted data from all the author tag in that xml file. Same with the other statements.
Step 5: Get text data from xml.
Python3
data = []
for i in range(0,len(authors)):
rows = [authors[i].get_text(),titles[i].get_text(),
genres[i].get_text(),prices[i].get_text(),
pubdate[i].get_text(),des[i].get_text()]
data.append(rows)
Now, we have all the data extracted from the xml file in various lists as per the tags. Now we need to combine all the data related to one book from different lists. So we run a for loop where all the data of a particular book from different lists is stored in one list name ‘rows’ and then each such row is appended in another list named ‘data’.
Step 6: Print the dataframe.
Finally, we have a separated combined data for each book. Now we need to convert this list data into a DataFrame.
Python3
df = pd.DataFrame(data,columns = ['Author','Book Title',
'Genre','Price','Publish Date',
'Description'], dtype = float)
display(df)
Output:
DataFrame
Here, we are converting that data list into a Dataframe using the pd.DataFrame() command. In this command we pass the list ‘data’ and also passed the names of the columns we want to have. We have also mentioned the datatype(dtype) as float which will make all the numerical values float.
Now we have extracted the data from the XML file using the BeautifulSoup into the DataFrame and it is stored as ‘df’. To see the DataFrame we use the print statement to print it.
Below is the full implementation:
Python3
# Python program to convert xml
# structure into dataframes using beautifulsoup
# Import libraries
from bs4 import BeautifulSoup
import pandas as pd
# Open XML file
file = open("gfg.xml", 'r')
# Read the contents of that file
contents = file.read()
soup = BeautifulSoup(contents, 'xml')
# Extracting the data
authors = soup.find_all('author')
titles = soup.find_all('title')
prices = soup.find_all('price')
pubdate = soup.find_all('publish_date')
genres = soup.find_all('genre')
des = soup.find_all('description')
data = []
# Loop to store the data in a list named 'data'
for i in range(0, len(authors)):
rows = [authors[i].get_text(), titles[i].get_text(), genres[i].get_text(
), prices[i].get_text(), pubdate[i].get_text(), des[i].get_text()]
data.append(rows)
# Converting the list into dataframe
df = pd.DataFrame(data, columns=['Author',
'Book Title', 'Genre',
'Price', 'Publish Date',
'Description'], dtype = float)
display(df)
Output:
DataFrame
Similar Reads
Python Tutorial | Learn Python Programming Language
Python Tutorial â Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly.Python is:A high-level language, used in web development, data science, automatio
10 min read
Python Interview Questions and Answers
Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Non-linear Components
In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Python OOPs Concepts
Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. By understanding the core OOP principles (classes, objects, inheritance, encapsulation, polymorphism, and abstraction), programmers can leverage the full p
11 min read
Python Projects - Beginner to Advanced
Python is one of the most popular programming languages due to its simplicity, versatility, and supportive community. Whether youâre a beginner eager to learn the basics or an experienced programmer looking to challenge your skills, there are countless Python projects to help you grow.Hereâs a list
10 min read
Python Exercise with Practice Questions and Solutions
Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test
9 min read
Class Diagram | Unified Modeling Language (UML)
A UML class diagram is a visual tool that represents the structure of a system by showing its classes, attributes, methods, and the relationships between them. It helps everyone involved in a projectâlike developers and designersâunderstand how the system is organized and how its components interact
12 min read
Python Programs
Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co
11 min read
Spring Boot Tutorial
Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Enumerate() in Python
enumerate() function adds a counter to each item in a list or other iterable. It turns the iterable into something we can loop through, where each item comes with its number (starting from 0 by default). We can also turn it into a list of (number, item) pairs using list().Let's look at a simple exam
3 min read