
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Extract URL from HTML Link Using Python Regular Expression
In this article, we are going to learn how to extract a URL from an HTML link using Python regular expressions. URL is an acronym for Uniform Resource Locator; it is used to identify the location resource on the Internet.
URL consists of a domain name, path, port number, etc. The URL can be parsed and processed by using a Regular Expression. Therefore, if we want to use a Regular Expression, we have to use the "re" library in Python. Following is an example of a URL -
URL: https://www.tutorialspoint.com/courses If we parse the above URL we can find the website name and protocol Hostname: tutorialspoint.com Protocol: https
Regular Expressions
In the Python language, a regular expression is one of the search patterns used to find matching strings.
Python has four methods which are used for regular expressions search(),is used to find the first match, match() is used to find only identical matches, findall() is used to find all matches, sub(), which is used to substitute string matching pattern with a new string.
If we want to search for a required pattern in a URL by using the Python language, we use re.findall() function, which is a re library function.
Examples to Extract URLs
Now let us see different examples to extract URLs from an HTML link in the section below -
Example: Extract All URLs
The following example will extract URLs from an HTML string. We have used a regex pattern that matches web addresses starting with http or https. And the re.findall() function scans the text and returns a list of matching links.
# Import re module import re # define url here text= '<p>Hello World: </p><a href="http://tutorialspoint.com">More Courses</a><a href="https://www.tutorialspoint.com/market/index.asp">Even More Courses</a>' urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text) # print the result print("Original string: ",text) print("Urls:",urls)
Output
This output will be displayed when the program runs -
Original string: <p>Hello World: </p><a href="http://tutorialspoint.com">More Courses</a><a href="https://www.tutorialspoint.com/market/index.asp">Even More Courses</a> Urls: ['http://tutorialspoint.com', 'https://www.tutorialspoint.com/market/index.asp']
Example: Extract Protocol and Hostname
The following example will extract the protocol and host name from a given URL. We have used regex patterns to identify the protocol (http or https) and the domain name. And the re.findall() function returns a list of matching components.
# Import re module import re # define url here website = 'https://www.tutorialspoint.com/' #to find protocol object1 = re.findall('(\w+)://', website) print(object1) # To find host name object2 = re.findall('://www.([\w\-\.]+)', website) print(object2)
Output
After running the program, you will get this result -
['https'] ['tutorialspoint.com']
Example: Parse URL Parts
Following program demonstrates the usage of a general URL where path elements are constructed. We have used a pattern to capture the protocol (http/https), the domain name, the page name, and the file extension. The re.findall() function scans the URL and returns a list of grouped elements.
# Import re module import re # url url = 'http://www.tutorialspoint.com/index.html' # finding all capture groups object = re.findall('(\w+)://([\w\-\.]+)/(\w+).(\w+)', url) print(object)
Output
When you run the program, it will show this output -
[('http', 'www.tutorialspoint.com', 'index', 'html')]