Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Extract URL from HTML Link Using Python Regular Expression



In this article, we are going to learn how to extract a URL from an HTML link using Python regular expressions. URL is an acronym for Uniform Resource Locator; it is used to identify the location resource on the Internet.

URL consists of a domain name, path, port number, etc. The URL can be parsed and processed by using a Regular Expression. Therefore, if we want to use a Regular Expression, we have to use the "re" library in Python. Following is an example of a URL -

URL: https://www.tutorialspoint.com/courses
If we parse the above URL we can find the website name and protocol
Hostname: tutorialspoint.com
Protocol: https

Regular Expressions

In the Python language, a regular expression is one of the search patterns used to find matching strings.

Python has four methods which are used for regular expressions search(),is used to find the first match, match() is used to find only identical matches, findall() is used to find all matches, sub(), which is used to substitute string matching pattern with a new string.

If we want to search for a required pattern in a URL by using the Python language, we use re.findall() function, which is a re library function.

Examples to Extract URLs

Now let us see different examples to extract URLs from an HTML link in the section below -

Example: Extract All URLs

The following example will extract URLs from an HTML string. We have used a regex pattern that matches web addresses starting with http or https. And the re.findall() function scans the text and returns a list of matching links.

# Import re module
import re

# define url here
text= '<p>Hello World: </p><a href="http://tutorialspoint.com">More Courses</a><a href="https://www.tutorialspoint.com/market/index.asp">Even More Courses</a>'
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)

# print the result
print("Original string: ",text)
print("Urls:",urls)

Output

This output will be displayed when the program runs -

Original string:  <p>Hello World: </p><a href="http://tutorialspoint.com">More Courses</a><a href="https://www.tutorialspoint.com/market/index.asp">Even More Courses</a>
Urls: ['http://tutorialspoint.com', 'https://www.tutorialspoint.com/market/index.asp']

Example: Extract Protocol and Hostname

The following example will extract the protocol and host name from a given URL. We have used regex patterns to identify the protocol (http or https) and the domain name. And the re.findall() function returns a list of matching components.

# Import re module
import re  

# define url here
website = 'https://www.tutorialspoint.com/'

#to find protocol
object1 = re.findall('(\w+)://', website)
print(object1)

# To find host name
object2 = re.findall('://www.([\w\-\.]+)', website)
print(object2)

Output

After running the program, you will get this result -

['https']
['tutorialspoint.com']

Example: Parse URL Parts

Following program demonstrates the usage of a general URL where path elements are constructed. We have used a pattern to capture the protocol (http/https), the domain name, the page name, and the file extension. The re.findall() function scans the URL and returns a list of grouped elements.

# Import re module
import re

# url
url = 'http://www.tutorialspoint.com/index.html' 

# finding  all capture groups
object = re.findall('(\w+)://([\w\-\.]+)/(\w+).(\w+)', url)
print(object)

Output

When you run the program, it will show this output -

[('http', 'www.tutorialspoint.com', 'index', 'html')]
Updated on: 2025-05-26T18:04:33+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements