Extracting text from PDF files with Python_ A comprehensive guide - Modo leitor
Extracting text from PDF files with Python_ A comprehensive guide - Modo leitor
Introduction
Even though nowadays more and more machines have OCR systems
installed in them that identify the text from scanned documents, there
are still documents that contain full pages in an image format. You’ve
probably seen that when you read a great article and try to select a
sentence, but instead you select the whole page. This can be a result
of a limitation in the specific OCR machine or its complete absence.
That way, in order not to leave this information undetected in this
article, I tried to create a process that also considers these cases and
takes the most out of our precious and information-rich PDFs.
With all these different types of PDF files in mind and the various
items that compose them, it’s important to perform an initial analysis
of the layout of the PDF to identify the proper tool needed for each
component. More specifically, based on the findings of this analysis,
we will apply the appropriate method for extracting text from the PDF,
whether it’s text rendered in a corpus block with its metadata, text
within images, or structured text within tables. In the scanned
document without OCR, the approach that identifies and extracts text
from images will perform all the heavy lifting. The output of this
process will be a Python dictionary containing information extracted
for each page of the PDF file. Each key in this dictionary will present
the page number of the document, and its corresponding value will be
a list with the following 5 nested lists containing:
Pdfminer: To perform the layout analysis and extract text and format
from the PDF. (the .six version of the library is the one that supports
Python 3)
You can install this on your machine if you are a Mac user through
Brew from your terminal, and you are good to go.
For Windows users, you can follow these steps to install the link.
Then, when you download and install the software, you need to add
their executable paths to Environment Variables on your computer.
Alternatively, you can run the following commands to directly include
their paths in the Python script using the following code:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program
Files\Tesseract-OCR\tesseract.exe'
Lastly, we will import all the libraries at the beginning of our script.
Separates the individual pages from the PDF file using the high-level
function extract_pages() and converts them into LTPage objects.
Then for each LTPage object, it iterates from each element from top
to bottom and tries to identify the appropriate component as either:
LTFigure which represents the area of the PDF that can present
figures or images that have been embedded as another PDF
document in the page.
LTTextContainer which represents a group of text lines in a
rectangular area is then analysed further into a list of LTTextLine
objects. Each one of them represents a list of LTChar objects,
which store the single characters of text along with their metadata.
(5)
LTRect represents a 2-dimensional rectangle that can be used to
frame images, and figures or create tables in an LTPage object.
def text_extraction(element):
# Extracting the text from the in-line text
element
line_text = element.get_text()
line_formats.append(character.fontname)
# Append the font size of the
character
line_formats.append(character.size)
# Find the unique font sizes and names in the line
format_per_line = list(set(line_formats))
As a result, this process returns the text from the images, which we
then save in a third list within the output dictionary. This list contains
the textual information extracted from the images on the examined
page.
Although there are several libraries used to extract table data from
PDFs, with Tabula-py being one of the most well-known, we have
identified certain limitations in their functionality.
The most glaring one in our opinion comes from the way that the
library identifies the different rows of the table using the line-break
special character \n in the table’s text. This works pretty well in most
of the cases but it fails to capture correctly when the text in a cell is
wrapped into 2 or more rows, leading to the addition of unnecessary
empty rows and losing the context of the extracted cell.
You can see the example below when we tried to extract the data
from a table using tabula-py:
For this reason, to tackle this task we used the pdfplumber library for
various reasons. Firstly, it is built on pdfminer.six which we used for
our preliminary analysis, meaning that it contains similar objects. In
addition, its approach to table detection is based on line elements
along with their intersections that construct the cell that contains the
text and then the table itself. That way after we identify a cell of a
table, we can extract just the content inside the cell without carrying
how many rows needed to be rendered. Then when we have the
contents of a table, we will format it in a table-like string and store it in
the appropriate list.
1. We iterate in each nested list and clean its context from any
unwanted line breaks coming from any wrapped text.
2. We join each element of the row by separating them using the |
symbol to create the structure of a table’s cell.
3. Finally, we add a line break at the end to move to the next row.
This will result in a string of text that will present the content of the
table without losing the granularity of the data presented in it.
Now that we have all the components of the code ready let’s add
them all up to a fully functional code. You can copy the code from
here or you can find it along with the example PDF in my Github repo
here.
Examine if there are any tables on the page and create a list of them
using pdfplumner.
Find all the elements nested in the page and sort them as they
appeared in its layout.
1. Find the bounding box of the table in order not to extract its text
again with the text_extraction() function.
2. Extract the content of the table and convert it into a string.
3. Then add a boolean parameter to clarify that we extract text from
Table.
4. This process will finish after the last LTRect that falls into the
bounding box of the table and the next element in the layout is not
a rectangular object. (All the other objects that compose the table
will be passed)
All the lists will be stored under the key in a dictionary that will
represent the number of the page examined each time.
Afterwards, we will close the PDF file.
Then we will delete all the additional files created during the process.
Lastly, we can display the content of the page by joining the elements
of the page_content list.
Conclusion
This was one approach that I believe uses the best characteristics of
many libraries and makes the process resilient to various types of
PDFs and elements that we can encounter, with PDFMiner however do
the most of the heavy lifting. Also, the information regarding the
format of the text can help us with the identification of potential titles
that can separate the text into distinct logical sections rather than just
content per page and can help us to identify the text of greater
importance.
However, there will always be more efficient ways to do this task and
even though I believe that this approach is more inclusive, I am really
looking forward to discussing with you new and better ways of
tackling this problem.