Parsing and converting HTML documents to XML format using Python Last Updated : 23 Aug, 2021 Comments Improve Suggest changes Like Article Like Report In this article, we are going to see how to parse and convert HTML documents to XML format using Python. It can be done in these ways: Using Ixml module.Using Beautifulsoup module.Method 1: Using the Python lxml library In this approach, we will use Python's lxml library to parse the HTML document and write it to an encoded string representation of the XML tree.The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that as it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. Installation: pip install lxml We need to provide the path to open the HTML document to read and parse it using the html.fromstring(str) function, returning a single element/document tree. This function parses a document from the given string. This always creates a correct HTML document, which means the parent node is <html>, and there is a body and possibly a head. htmldoc = html.fromstring(inp.read()) And write the parsed HTML element/document tree to an encoded string representation of its XML tree using the etree.tostring() function. out.write(etree.tostring(htmldoc)) Html file used: input. Code: Python3 # Import the required library from lxml import html, etree # Main Function if __name__ == '__main__': # Provide the path of the html file file = "input.html" # Open the html file and Parse it, # returning a single element/document. with open(file, 'r', encoding='utf-8') as inp: htmldoc = html.fromstring(inp.read()) # Open a output.xml file and write the # element/document to an encoded string # representation of its XML tree. with open("output.xml", 'wb') as out: out.write(etree.tostring(htmldoc)) Output: Method 2: Using the BeautifulSoup In this approach, we will use the BeautifulSoup module to parse the raw HTML document using html.parser and modify the parsed document and write it to an XML file. Provide the path to open the HTML file and read the HTML file and Parse it using BeautifulSoup's html.parser, returning an object of the parsed document. BeautifulSoup(inp, 'html.parser') To remove the DocType HTML, we need to first get the string representation of the document using soup.prettify() and then split the document by lines using splitlines(), returning a list of lines. soup.prettify().splitlines() Code: Python3 # Import the required library from bs4 import BeautifulSoup # Main Function if __name__ == '__main__': # Provide the path of the html file file = "input.html" # Open the html file and Parse it # using Beautiful soup's html.parser. with open(file, 'r', encoding='utf-8') as inp: soup = BeautifulSoup(inp, 'html.parser') # Split the document by lines and join the lines # from index 1 to remove the doctype Html as it is # present in index 0 from the parsed document. lines = soup.prettify().splitlines() content = "\n".join(lines[1:]) # Open a output.xml file and write the modified content. with open("output.xml", 'w', encoding='utf-8') as out: out.write(content) Output: Comment More infoAdvertise with us Next Article Parsing and converting HTML documents to XML format using Python anilabhadatta Follow Improve Article Tags : Python Python-XML Practice Tags : python Similar Reads How to Convert Excel to XML Format in Python? Python proves to be a powerful language when the requirement is to convert a file from one format to the other. It supports tools that can be employed to easily achieve the functionality. In this article, we'll find out how we will convert from an Excel file to Extensible terminology (XML) files wit 3 min read Create XML Documents using Python Extensible Markup Language(XML), is a markup language that you can use to create your own tags. It was created by the World Wide Web Consortium (W3C) to overcome the limitations of HTML, which is the basis for all Web pages. XML is based on SGML - Standard Generalized Markup Language. It is used for 3 min read Parse XML using Minidom in Python DOM (document object model) is a cross-language API from W3C i.e. World Wide Web Consortium for accessing and modifying XML documents. Python enables you to parse XML files with the help of xml.dom.minidom, which is the minimal implementation of the DOM interface. It is simpler than the full DOM API 1 min read Convert XML to CSV in Python XML (Extensible Markup Language) is widely used to store and transport structured data. However, for tasks like data analysis or spreadsheet editing, CSV (Comma-Separated Values) is far more user-friendly and easier to work with.To make XML data easier to process, we often convert it to a CSV file. 2 min read Find the title tags from a given html document using BeautifulSoup in Python Let's see how to Find the title tags from a given html document using BeautifulSoup in python. so we can find the title tag from html document using BeautifulSoup find() method. The find function takes the name of the tag as string input and returns the first found match of the particular tag from t 1 min read Convert XML structure to DataFrame using BeautifulSoup - Python Here, we are going to convert the XML structure into a DataFrame using the BeautifulSoup package of Python. It is a python library that is used to scrape web pages. To install this library, the command is pip install beautifulsoup4 We are going to extract the data from an XML file using this library 4 min read How to Parse and Modify XML in Python? XML stands for Extensible Markup Language. It was designed to store and transport data. It was designed to be both human- and machine-readable. Thatâs why, the design goals of XML emphasize simplicity, generality, and usability across the Internet. Note: For more information, refer to XML | Basics H 4 min read Convert HTML source code to JSON Object using Python In this post, we will see how we can convert an HTML source code into a JSON object. JSON objects can be easily transferred, and they are supported by most of the modern programming languages. We can read JSON from Javascript and parse it as a Javascript object easily. Javascript can be used to make 3 min read Parsing XML with DOM APIs in Python The Document Object Model (DOM) is a programming interface for HTML and XML(Extensible markup language) documents. It defines the logical structure of documents and the way a document is accessed and manipulated. Parsing XML with DOM APIs in python is pretty simple. For the purpose of example we wil 2 min read How to append new data to existing XML using Python ElementTree Extensible Markup Language (XML) is a widely used format for storing and transporting data. In Python, the ElementTree module provides a convenient way to work with XML data. When dealing with XML files, it is common to need to append new data to an existing XML document. This can be achieved effici 4 min read Like