Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
25 views

Extracting text from PDF files with Python_ A comprehensive guide - Modo leitor

Extracting text from PDF files with Python

Uploaded by

Régis Vargas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Extracting text from PDF files with Python_ A comprehensive guide - Modo leitor

Extracting text from PDF files with Python

Uploaded by

Régis Vargas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Personalizar aparência

Extracting text from PDF files with Python: A


comprehensive guide

Photo by Giorgio Trovato on Unsplash

Introduction

In the age of Large Language Models (LLMs) and their wide-ranging


applications, from simple text summarisation and translation to
predicting stock performance based on sentiment and financial report
topics, the importance of text data has never been greater.

There are many types of documents that share this kind of


unstructured information, from web articles and blog posts to
handwritten letters and poems. However, a significant portion of this
text data is stored and transferred in PDF format. More specifically, it
has been found that over 2 billion PDFs are opened in Outlook each
year, while 73 million new PDF files are saved in Google Drive and
email daily (2).

Developing, therefore, a more systematic way to process these


documents and extract information from them would give us the
ability to have an automated flow and better understand and utilise
this vast volume of textual data. And for this task, of course, our best
friend could be none other than Python.

However, before we start our process, we need to specify the different


types of PDFs that are around these days, and more specifically, the
three most frequently appearing:

1. Programmatically generated PDFs: These PDFs are created on a


computer using either W3C technologies such as HTML, CSS, and
Javascript or another software like Adobe Acrobat. This type of file
can contain various components, such as images, text, and links,
which are all searchable and easy to edit.
2. Traditional scanned documents: These PDFs are created from non-
electronic mediums through a scanner machine or a mobile app.
These files are nothing more than a collection of images stored
together in a PDF file. Saying that, the elements appearing in these
images, like the text, or links can’t be selected or searched.
Essentially, the PDF serves as a container for these images.
3. Scanned documents with OCR: In this case, Optical Character
Recognition (OCR) software is employed after scanning the
document to identify the text within each image in the file,
converting it into searchable and editable text. Then the software
adds a layer with the actual text to the image, and that way you can
select it as a separate component when browsing the file. (3)

Even though nowadays more and more machines have OCR systems
installed in them that identify the text from scanned documents, there
are still documents that contain full pages in an image format. You’ve
probably seen that when you read a great article and try to select a
sentence, but instead you select the whole page. This can be a result
of a limitation in the specific OCR machine or its complete absence.
That way, in order not to leave this information undetected in this
article, I tried to create a process that also considers these cases and
takes the most out of our precious and information-rich PDFs.

The Theoretical Approach

With all these different types of PDF files in mind and the various
items that compose them, it’s important to perform an initial analysis
of the layout of the PDF to identify the proper tool needed for each
component. More specifically, based on the findings of this analysis,
we will apply the appropriate method for extracting text from the PDF,
whether it’s text rendered in a corpus block with its metadata, text
within images, or structured text within tables. In the scanned
document without OCR, the approach that identifies and extracts text
from images will perform all the heavy lifting. The output of this
process will be a Python dictionary containing information extracted
for each page of the PDF file. Each key in this dictionary will present
the page number of the document, and its corresponding value will be
a list with the following 5 nested lists containing:

1. The text extracted per text block of the corpus


2. The format of the text in each text block in terms of font family and
size
3. The text extracted from the images on the page
4. The text extracted from tables in a structured format
5. The complete text content of the page
Image by the author

That way, we can achieve a more logical separation of the extracted


text per source component, and it can sometimes help us to more
easily retrieve information that usually appears in the specific
component (e.g., the company name in a logo image). In addition, the
metadata extracted from the text, like the font family and size, can be
used to easily identify text headers or highlighted text of greater
importance that will help us further separate or post-process the text
in multiple different chunks. Lastly, retaining the structured table
information in a way that an LLM can understand will enhance
significantly the quality of inferences made about relationships within
the extracted data. Then these results can be composed as an output
the all the textual information that appeared on each page.

You can see a flowchart of this approach in the images below.


Image by the author

Installation of all the necessary libraries

Before we start this project, though, we should install the necessary


libraries. We assume that you have Python 3.10 or above installed on
your machine. Otherwise, you can install it from here. Then let’s install
the following libraries:

PyPDF2: To read the PDF file from the repository path.

pip install PyPDF2

Pdfminer: To perform the layout analysis and extract text and format
from the PDF. (the .six version of the library is the one that supports
Python 3)

pip install pdfminer.six


Pdfplumber: To identify tables in a PDF page and extract the
information from them.

pip install pdfplumber

Pdf2image: To convert the cropped PDF image to a PNG image.

pip install pdf2image

PIL: To read the PNG image.

pip install Pillow

Pytesseract: To extract the text from the images using OCR


technology

This is a little trickier to install because first, you need to install


Google Tesseract OCR, which is an OCR machine based on an LSTM
model to identify line recognition and character patterns.

You can install this on your machine if you are a Mac user through
Brew from your terminal, and you are good to go.

brew install tesseract

For Windows users, you can follow these steps to install the link.
Then, when you download and install the software, you need to add
their executable paths to Environment Variables on your computer.
Alternatively, you can run the following commands to directly include
their paths in the Python script using the following code:

pytesseract.pytesseract.tesseract_cmd = r'C:\Program
Files\Tesseract-OCR\tesseract.exe'

Then you can install the Python library

pip install pytesseract

Lastly, we will import all the libraries at the beginning of our script.

# To read the PDF


import PyPDF2
# To analyze the PDF layout and extract text
from pdfminer.high_level import extract_pages,
extract_text
from pdfminer.layout import LTTextContainer, LTChar,
LTRect, LTFigure
# To extract text from tables in PDF
import pdfplumber
# To extract the images from the PDFs
from PIL import Image
from pdf2image import convert_from_path
# To perform OCR to extract text from images
import pytesseract
# To remove the additional created files
import os

So now we are all set. Let’s move to the fun part.

Document’s Layout Analysis with Python

Image by the author

For the preliminary analysis, we used the PDFMiner Python library to


separate the text from a document object into multiple page objects
and then break down and examine the layout of each page. PDF files
inherently lack structured information, such as paragraphs, sentences,
or words as seen by the human eye. Instead, they understand only the
individual characters of the text along with their position on the page.
That way, the PDFMiner tries to reconstruct the content of the page
into its individual characters along with their position in the file. Then,
by comparing the distances of those characters from others it
composes the appropriate words, sentences, lines, and paragraphs of
text. (4) To achieve that, the library:

Separates the individual pages from the PDF file using the high-level
function extract_pages() and converts them into LTPage objects.

Then for each LTPage object, it iterates from each element from top
to bottom and tries to identify the appropriate component as either:

LTFigure which represents the area of the PDF that can present
figures or images that have been embedded as another PDF
document in the page.
LTTextContainer which represents a group of text lines in a
rectangular area is then analysed further into a list of LTTextLine
objects. Each one of them represents a list of LTChar objects,
which store the single characters of text along with their metadata.
(5)
LTRect represents a 2-dimensional rectangle that can be used to
frame images, and figures or create tables in an LTPage object.

Therefore, based on this reconstruction of the page and the


classification of its elements either into LTFigure, which contains the
images or figures of the page, LTTextContainer, which represents the
textual information of the page, or LTRect, which will be a strong
indication of the presence of a table, we can apply the appropriate
function to better extract the information.

for pagenum, page in


enumerate(extract_pages(pdf_path)):

# Iterate the elements that composed a page


for element in page:

# Check if the element is a text element


if isinstance(element, LTTextContainer):
# Function to extract text from the text
block
pass
# Function to extract text format
pass

# Check the elements for images


if isinstance(element, LTFigure):
# Function to convert PDF to Image
pass
# Function to extract text with OCR
pass

# Check the elements for tables


if isinstance(element, LTRect):
# Function to extract table
pass
# Function to convert table content into a
string
pass

So now that we understand the analysis part of the process, let’s


create the functions needed to extract the text from each component.

Define the function to extract text from PDF

From here on, extracting text from a text container is really


straightforward.
# Create a function to extract text

def text_extraction(element):
# Extracting the text from the in-line text
element
line_text = element.get_text()

# Find the formats of the text


# Initialize the list with all the formats that
appeared in the line of text
line_formats = []
for text_line in element:
if isinstance(text_line, LTTextContainer):
# Iterating through each character in the
line of text
for character in text_line:
if isinstance(character, LTChar):
# Append the font name of the
character

line_formats.append(character.fontname)
# Append the font size of the
character

line_formats.append(character.size)
# Find the unique font sizes and names in the line
format_per_line = list(set(line_formats))

# Return a tuple with the text in each line along


with its format
return (line_text, format_per_line)

So to extract text from a text container, we simply use the get_text()


method of the LTTextContainer element. This method retrieves all the
characters that make up the words within the specific corpus box,
storing the output in a list of text data. Each element in this list
represents the raw textual information contained in the container.

Now, to identify this text’s format, we iterate through the


LTTextContainer object to access each text line of this corpus
individually. In each iteration, a new LTTextLine object is created,
representing a line of text in this chunk of corpus. We then examine
whether the nested line element contains text. If it does, we access
each individual character element as LTChar, which contains all the
metadata for that character. From this metadata, we extract two types
of formats and store them in a separate list, positioned
correspondingly to the examined text:
The font family of the characters, including whether the character
is in bold or italic format
The font size for the character

Generally, characters within a specific chunk of text tend to have


consistent formatting unless some are highlighted in bold. To
facilitate further analysis, we capture the unique values of text
formatting for all characters within the text and store them in the
appropriate list.

Image by the author

Define the function to extract text from


Images

Here I believe it is a more tricky part.

How to handle text in images found in PDF?

Firstly, we need to establish here that image elements stored in PDFs


are not in a different format from the file, such as JPEG or PNG. That
way in order to apply OCR software on them we need first to separate
them from the file and then convert them into an image format.

# Create a function to crop the image elements from


PDFs
def crop_image(element, pageObj):
# Get the coordinates to crop the image from the
PDF
[image_left, image_top, image_right, image_bottom]
= [element.x0,element.y0,element.x1,element.y1]
# Crop the page using coordinates (left, bottom,
right, top)
pageObj.mediabox.lower_left = (image_left,
image_bottom)
pageObj.mediabox.upper_right = (image_right,
image_top)
# Save the cropped page to a new PDF
cropped_pdf_writer = PyPDF2.PdfWriter()
cropped_pdf_writer.add_page(pageObj)
# Save the cropped PDF to a new file
with open('cropped_image.pdf', 'wb') as
cropped_pdf_file:
cropped_pdf_writer.write(cropped_pdf_file)
# Create a function to convert the PDF to images
def convert_to_images(input_file,):
images = convert_from_path(input_file)
image = images[0]
output_file = "PDF_image.png"
image.save(output_file, "PNG")

# Create a function to read text from images


def image_to_text(image_path):
# Read the image
img = Image.open(image_path)
# Extract the text from the image
text = pytesseract.image_to_string(img)
return text

To achieve this, we follow the following process:

1. We use the metadata from the LTFigure object detected from


PDFMiner to crop the image box, utilising its coordinates in the
page layout. We then save it as a new PDF in our directory using
the PyPDF2 library.
2. Then we employ the convert_from_file() function from the
pdf2image library to convert all PDF files in the directory into a list
of images, saving them in PNG format.
3. Finally, now that we have our image files we read them in our script
using the Image package of the PIL module and implement the
image_to_string() function of pytesseract to extract text from the
images using the tesseract OCR engine.

As a result, this process returns the text from the images, which we
then save in a third list within the output dictionary. This list contains
the textual information extracted from the images on the examined
page.

Define the function to extract text from Tables

In this section, we will extract a more logically structured text from


tables on a PDF page. This is a slightly more complex task than
extracting text from a corpus because we need to take into account
the granularity of the information and the relationships formed
between data points presented in a table.

Although there are several libraries used to extract table data from
PDFs, with Tabula-py being one of the most well-known, we have
identified certain limitations in their functionality.

The most glaring one in our opinion comes from the way that the
library identifies the different rows of the table using the line-break
special character \n in the table’s text. This works pretty well in most
of the cases but it fails to capture correctly when the text in a cell is
wrapped into 2 or more rows, leading to the addition of unnecessary
empty rows and losing the context of the extracted cell.

You can see the example below when we tried to extract the data
from a table using tabula-py:

Image by the author

Then, the extracted information is outputted in a Pandas DataFrame


instead of a string. In most cases, this can be a desirable format but
in the case of transformers that take into account text, these results
need to be transformed before feeding into a model.

For this reason, to tackle this task we used the pdfplumber library for
various reasons. Firstly, it is built on pdfminer.six which we used for
our preliminary analysis, meaning that it contains similar objects. In
addition, its approach to table detection is based on line elements
along with their intersections that construct the cell that contains the
text and then the table itself. That way after we identify a cell of a
table, we can extract just the content inside the cell without carrying
how many rows needed to be rendered. Then when we have the
contents of a table, we will format it in a table-like string and store it in
the appropriate list.

# Extracting tables from the page

def extract_table(pdf_path, page_num, table_num):


# Open the pdf file
pdf = pdfplumber.open(pdf_path)
# Find the examined page
table_page = pdf.pages[page_num]
# Extract the appropriate table
table = table_page.extract_tables()[table_num]
return table

# Convert table into the appropriate format


def table_converter(table):
table_string = ''
# Iterate through each row of the table
for row_num in range(len(table)):
row = table[row_num]
# Remove the line breaker from the wrapped
texts
cleaned_row = [item.replace('\n', ' ') if item
is not None and '\n' in item else 'None' if item is
None else item for item in row]
# Convert the table into a string
table_string+=
('|'+'|'.join(cleaned_row)+'|'+'\n')
# Removing the last line break
table_string = table_string[:-1]
return table_string

To achieve that, we created two functions, extract_table() to extract


the contents of the table into a list of lists, and table_converter() to
join the contents of those lists in a table-like string.

In the extract_table() function:

1. We open the PDF file.


2. We navigate to the examined page of the PDF file.
3. From the list of tables found on the page by pdfplumber, we select
the desired one.
4. We extract the content of the table and output it in a list of nested
lists representing each row of the table.

In the table_converter() function:

1. We iterate in each nested list and clean its context from any
unwanted line breaks coming from any wrapped text.
2. We join each element of the row by separating them using the |
symbol to create the structure of a table’s cell.
3. Finally, we add a line break at the end to move to the next row.

This will result in a string of text that will present the content of the
table without losing the granularity of the data presented in it.

Adding all together

Now that we have all the components of the code ready let’s add
them all up to a fully functional code. You can copy the code from
here or you can find it along with the example PDF in my Github repo
here.

# Find the PDF path


pdf_path = 'OFFER 3.pdf'
# create a PDF file object
pdfFileObj = open(pdf_path, 'rb')
# create a PDF reader object
pdfReaded = PyPDF2.PdfReader(pdfFileObj)

# Create the dictionary to extract text from each


image
text_per_page = {}
# We extract the pages from the PDF
for pagenum, page in
enumerate(extract_pages(pdf_path)):

# Initialize the variables needed for the text


extraction from the page
pageObj = pdfReaded.pages[pagenum]
page_text = []
line_format = []
text_from_images = []
text_from_tables = []
page_content = []
# Initialize the number of the examined tables
table_num = 0
first_element= True
table_extraction_flag= False
# Open the pdf file
pdf = pdfplumber.open(pdf_path)
# Find the examined page
page_tables = pdf.pages[pagenum]
# Find the number of tables on the page
tables = page_tables.find_tables()

# Find all the elements


page_elements = [(element.y1, element) for element
in page._objs]
# Sort all the elements as they appear in the page
page_elements.sort(key=lambda a: a[0],
reverse=True)

# Find the elements that composed a page


for i,component in enumerate(page_elements):
# Extract the position of the top side of the
element in the PDF
pos= component[0]
# Extract the element of the page layout
element = component[1]

# Check if the element is a text element


if isinstance(element, LTTextContainer):
# Check if the text appeared in a table
if table_extraction_flag == False:
# Use the function to extract the text
and format for each text element
(line_text, format_per_line) =
text_extraction(element)
# Append the text of each line to the
page text
page_text.append(line_text)
# Append the format for each line
containing text
line_format.append(format_per_line)
page_content.append(line_text)
else:
# Omit the text that appeared in a
table
pass

# Check the elements for images


if isinstance(element, LTFigure):
# Crop the image from the PDF
crop_image(element, pageObj)
# Convert the cropped pdf to an image
convert_to_images('cropped_image.pdf')
# Extract the text from the image
image_text =
image_to_text('PDF_image.png')
text_from_images.append(image_text)
page_content.append(image_text)
# Add a placeholder in the text and format
lists
page_text.append('image')
line_format.append('image')

# Check the elements for tables


if isinstance(element, LTRect):
# If the first rectangular element
if first_element == True and (table_num+1)
<= len(tables):
# Find the bounding box of the table
lower_side = page.bbox[3] -
tables[table_num].bbox[3]
upper_side = element.y1
# Extract the information from the
table
table = extract_table(pdf_path,
pagenum, table_num)
# Convert the table information in
structured string format
table_string = table_converter(table)
# Append the table string into a list
text_from_tables.append(table_string)
page_content.append(table_string)
# Set the flag as True to avoid the
content again
table_extraction_flag = True
# Make it another element
first_element = False
# Add a placeholder in the text and
format lists
page_text.append('table')
line_format.append('table')

# Check if we already extracted the tables


from the page
if element.y0 >= lower_side and element.y1
<= upper_side:
pass
elif not isinstance(page_elements[i+1][1],
LTRect):
table_extraction_flag = False
first_element = True
table_num+=1

# Create the key of the dictionary


dctkey = 'Page_'+str(pagenum)
# Add the list of list as the value of the page
key
text_per_page[dctkey]= [page_text, line_format,
text_from_images,text_from_tables, page_content]

# Closing the pdf file object


pdfFileObj.close()

# Deleting the additional files created


os.remove('cropped_image.pdf')
os.remove('PDF_image.png')

# Display the content of the page


result = ''.join(text_per_page['Page_0'][4])
print(result)

The script above will:

Import the necessary libraries.

Open the PDF file using the pyPDF2 library.


Extract each page of the PDF and iterate the following steps.

Examine if there are any tables on the page and create a list of them
using pdfplumner.

Find all the elements nested in the page and sort them as they
appeared in its layout.

Then for each element:

Examine if it is a text container, and does not appear in a table


element. Then use the text_extraction() function to extract the text
along with its format, else pass this text.

Examine if it is an image, and use the crop_image() function to crop


the image component from the PDF, convert it into an image file using
the convert_to_images(), and extract text from it using OCR with the
image_to_text() function.

Examine if it is a rectangular element. In this case, we examine if the


first rect is part of a page’s table and if yes, we move to the following
steps:

1. Find the bounding box of the table in order not to extract its text
again with the text_extraction() function.
2. Extract the content of the table and convert it into a string.
3. Then add a boolean parameter to clarify that we extract text from
Table.
4. This process will finish after the last LTRect that falls into the
bounding box of the table and the next element in the layout is not
a rectangular object. (All the other objects that compose the table
will be passed)

The outputs of the process will be stored in 5 lists per iteration,


named:

1. page_text: contains the text coming from text containers in the


PDF (placeholder will be placed when the text was extracted from
another element)
2. line_format: contains the formats of the texts extracted above
(placeholder will be placed when the text was extracted from
another element)
3. text_from_images: contains the texts extracted from images on the
page
4. text_from_tables: contains the table-like string with the contents of
tables
5. page_content: contains all the text rendered on the page in a list of
elements

All the lists will be stored under the key in a dictionary that will
represent the number of the page examined each time.
Afterwards, we will close the PDF file.

Then we will delete all the additional files created during the process.

Lastly, we can display the content of the page by joining the elements
of the page_content list.

Conclusion

This was one approach that I believe uses the best characteristics of
many libraries and makes the process resilient to various types of
PDFs and elements that we can encounter, with PDFMiner however do
the most of the heavy lifting. Also, the information regarding the
format of the text can help us with the identification of potential titles
that can separate the text into distinct logical sections rather than just
content per page and can help us to identify the text of greater
importance.

However, there will always be more efficient ways to do this task and
even though I believe that this approach is more inclusive, I am really
looking forward to discussing with you new and better ways of
tackling this problem.

You might also like