Parsing and Processing URL using Python - Regex
Last Updated :
02 Sep, 2020
Prerequisite: Regular Expression in Python
URL or Uniform Resource Locator consists of many information parts, such as the domain name, path, port number etc. Any URL can be processed and parsed using Regular Expression. So for using Regular Expression we have to use re library in Python.
Example:
URL: https://www.geeksforgeeks.org/courses
When we parse the above URL then we can find
Hostname: geeksforgeeks.com
Protocol: https
We are using re.findall( ) function of re library for searching the required pattern in the URL.
Syntax: re.findall(regex, string)
Return: all non-overlapping matches of pattern in string, as a list of strings.
Now, let's see the examples:
Example 1: In this Example, we will be extracting the protocol and the hostname from the given URL.
- Regular expression for extracting protocol group: '(\w+)://'.
- Regular expression for extracting hostname group: '://www.([\w\-\.]+)'.
Meta characters Used:
- \w: Matches any alphanumeric character, this is equivalent to the class [a-zA-Z0-9_].
- +: One or more occurrences of previous characters.
Code:
Python3
# import library
import re
# url link
s = 'https://www.geeksforgeeks.org/'
# finding the protocol
obj1 = re.findall('(\w+)://',
s)
print(obj1)
# finding the hostname which may
# contain dash or dots
obj2 = re.findall('://www.([\w\-\.]+)',
s)
print(obj2)
Output:
['https']
['geeksforgeeks.org']
Example 2: If the URL is of a different type such as 'file://localhost:4040/zip_file', with the port number along with it, then to extract the port number, as it is optional we will use the '?' notation. Here the port number '4040' occurs after the ':' sign. Therefore, as it is a digit (:(\d+)) is used. To make it optional as all URLs do not end with host number, this syntax is used '(:(\d+))?'.
Meta characters Used:
- \d: Matches any decimal digit, this is equivalent to the set class [0-9].
- +: One or more occurrences of previous characters.
- ?: Matches zero or one occurrence.
Code:
Python3
# import library
import re
# url link
s = 'file://localhost:4040/abc_file'
# finding the file capture group
obj1 = re.findall('(\w+)://', s)
print(obj1)
# finding the hostname which may
# contain dash or dots
obj2 = re.findall('://([\w\-\.]+)', s)
print(obj2)
# finding the hostname which may
# contain dash or dots or port
# number
obj3 = re.findall('://([\w\-\.]+)(:(\d+))?', s)
print(obj3)
Output:
['file']
['localhost']
[('localhost', ':4040', '4040')]
Example 3: For a general URL, this can be used, where the path elements can also be constructed.
Python3
# import library
import re
# url
s = 'http://www.example.com/index.html'
# searching for all capture groups
obj = re.findall('(\w+)://([\w\-\.]+)/(\w+).(\w+)',
s)
print(obj)
Output:
[('http', 'www.example.com', 'index', 'html')]
Similar Reads
Python | Parse a website with regex and urllib Let's discuss the concept of parsing using python. In python we have lot of modules but for parsing we only need urllib and re i.e regular expression. By using both of these libraries we can fetch the data on web pages. Note that parsing of websites means that fetch the whole source code and that we
2 min read
Pattern matching in Python with Regex You may be familiar with searching for text by pressing ctrl-F and typing in the words youâre looking for. Regular expressions go one step further: They allow you to specify a pattern of text to search for. In this article, we will see how pattern matching in Python works with Regex.Regex in PythonR
8 min read
Flipkart Product Price Tracker using Python Flipkart Private Limited is an Indian e-commerce company. It sells many products and the prices of these products keep changing. Generally, during sales and festivals, the price of products drops, and our aim is to buy any product at the lowest possible price. In this article, we will learn to build
3 min read
How to get current_url using Selenium in Python? While doing work with selenium many URL get opened and redirected in order to keeping track of URL current_url method is used. The current_url method is used to retrieve the URL of the webpage the user is currently accessing. It gives the URL of the current webpage loaded by the driver in selenium.
2 min read
Python | How to shorten long URLs using Bitly API Bitly is used to shorten, brand, share, or retrieve data from links programmatically. In this article, we'll see how to shorten URLs using Bitly API. Below is a working example to shorten a URL using Bitly API. Step #1: Install Bitly API using git git clone https://github.com/bitly/bitly-api-python.
2 min read
Comparing path() and url() (Deprecated) in Django for URL Routing When building web applications with Django, URL routing is a fundamental concept that allows us to direct incoming HTTP requests to the appropriate view function or class. Django provides two primary functions for defining URL patterns: path() and re_path() (formerly url()). Although both are used f
8 min read
Python Regex: Replace Captured Groups Regular Expressions, often abbreviated as Regex, are sequences of characters that form search patterns. They are powerful tools used in programming and text processing to search, match, and manipulate strings. Think of them as advanced search filters that allow us to find specific patterns within a
5 min read
Python Tweepy â Getting the URL of a user In this article we will see how we can get the URL of a user. The url attribute is a URL provided by the user in association with their profile. The URL attribute is optional and is Nullable. Identifying the URL in the GUI : In the above mentioned profile the url is : geeksforgeeks.org In order to g
2 min read
Building CLI to check status of URL using Python In this article, we will build a CLI(command-line interface) program to verify the status of a URL using Python. The python CLI takes one or more URLs as arguments and checks whether the URL is accessible (or)not. Stepwise ImplementationStep 1: Setting up files and Installing requirements First, cr
4 min read
How to Pass Parameters in URL with Python Passing parameters in a URL is a common way to send data between a client and a server in web applications. In Python, this is usually done using libraries like requests for making HTTP requests or urllib .Let's understand how to pass parameters in a URL with this example.Example:Pythonimport urllib
2 min read