Report
Report
Report
We have chosen Pdfminer.six as the system we plan to analyse and reengineer. This is
a community maintained fork of Pdfminer, originally created by Yusuke Shinyama,
around 8 years ago.
As in the name, the original intended function of Pdfminer was to provide an easy way
to extract, or ‘mine’, data from pdfs, including images, text and metadata.
It was also able to convert pdfs into other formats such as HTML and XML, often useful
and much easier for users to manipulate the content.
From 2020 onwards, the original pdfminer, ADD LINK HERE, is no longer actively
maintained, which led to the birth of pdfminer.six. This is a fork of pdfminer, which is
essentially when developers clone a project and begin their own development on it.
Forks are beneficial as it means software can be continuously updated with for
example, security patches as well as new features, despite the original creator no
longer working on the application.
Its original goal was to add support for Python 3. This was done with the six package,
which helps write code compatible with both Python 2 and Python 3.
The name pdfminer.six thus stems from a combination of the original pdfminer package
and the six package.
We chose pdfminer.six as we found its history and lifecycle interesting as well it being a
functional tool. We also thought that because it is community maintained, it is likely to
contain different code styles and practices which we can analyse and learn from to help
us both reengineer the system and improve our own skills as developers.
Pdfminer.six does not have a GUI as such, but can instead be used on the command
line or through written python code in a file.
At a high level, the system works by reading, or ‘parsing’, the pdf, then breaking it down
into basic objects. It also constructs a hierarchical structure of the pdf, containing
elements such as pages, text and images.
Developers can customise the extraction process by modifying the layout analysis
parameters or by extending the tool's capabilities through custom parsing logic, making
it adaptable to a wide range of PDF parsing needs.
We have chosen this commit id: 9cc4d1ddc615fddc5901ead63d11fdf3142f5499
Understanding pdfminer
Evolution of pdfminer
The repository for pdfminer has in recent development stayed largely the same, almost
all of these changes were confined to two files, cmapdb and pdffont which were
changed often in relation to minor bugs that caused improper font and character size
when changing between unicode and ascii character maps. These bugs have been
mostly fixed however there still remain issues with the character mapping and these are
closely linked to the LTPage -> LTTextBox -> LtTextLine -> LTChar pipeline that
separates pdfs into individual elements. These bugs are most commonly related with
how characters are represented and the byte buffer created by cmapdb file. This would
seem to indicate both files are closely linked to the most commonly found bugs in the
code, this makes it a great starting point for any reengineering work to be done.
In previous iterations of the code the converter, pdffont, encodingdb, high level and
pdfparser files were the most worked on files, many of these files have not been
updated in as long as 2 years which could mean that they are in need of refactoring or
even in more extreme cases reengineering as a result of the fact they may not be
updated to best practices. This would be difficult in some cases as many of these files
appear bloated, with some containing thousands of lines of code, each with large
numbers of functions that could potentially be decoupled into more cohesive classes.
These also happen to be some of the most called functions so any changes made
would have to be done carefully as they are particularly important to the operation of the
code, their weakness may have a butterfly effect that makes bug fixes and new features
more difficult to deploy as well as creating original bugs. These areas are most in need
of redesign, due to how tighty there are coupled with other classes as well as how
integral they are to the function of the software.
Lastly, as it stands there are 21 known issues with the code, most of which are directly
related to character conversion as well as errors caused in relation to parsing/converting
images and words, the root cause of these bugs should be analysed more closely
however after this repository analysis we can have a reasonable starting point.
Current system and code smells
Before analysing the system’s design, in order to aid our understanding of its domain
and direct our search, we first identified the most important concepts. We did this
automatically from Pdfminer’s documentation using the ‘Natural Language Toolkit’. This
has then been visualised in a word cloud - as seen in figure 1 - using ‘matplotlib’.
It shows that text elements like characters, fields, non-breaking spaces (nbsp) and so
on are important. It then highlights related concepts named ‘LTChar’ and ‘LTAnno’.
Searching for these keywords in the source code reveals that they are classes
representing letters in a pdf document and that ‘LT’ stands for layout. Pdfminer’s
documentation explains that layout analysis works hierarchically to group characters
into LTTextLines and then into an LTTextBox, as part of an LTPage1.
1
https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html
Figure 2: class diagram of the layout module
The class diagram of classes in the layout module, automatically generated by
PyCharm as seen in Figure 2, shows a complex inheritance tree. It shows that
LTTextLine, for example, has a high depth, with six classes from which any changes in
behaviour will propagate.
Figure 3: class diagram of the converter module
Elsewhere, the ‘converter’ module - which has also been visualised in a class diagram
in Figure 3 - has a number of code smells. For instance, the method
PDFLayoutAnalyzer.paint_path is very long and complex2. It is 116 LOC (lines of code)
and a high cyclomatic complexity score of 15. It also has long paragraph-like comments
which imply that the code is not intuitive enough, likely because it is doing too much - a
violation of the single responsibility principle - and also perhaps because its variables
are poorly named. This causes the code to be hard to read and maintain.
We also used the python package Vulture as part of our static analysis. This package
looks for dead code - code which is never executed by the program and is therefore
redundant. The removal of dead code has many benefits, including making code easier
to read, understand and maintain. Tests can also focus on code which will actually be
executed rather than testing redundant code.
2
pdfminer.six/converter.py line 102
Vulture finds unused attributes, classes, methods, properties and variables. It also
provides a confidence score, with 100% indicating complete certainty that the code is
not executed. We used this score to improve our understanding of the code base,
checking if code is used when the confidence score is less than 100%.
It is notable that there are unused classes in the code, which we plan to remove.
Removing these along with unused methods will make it easier for future developers to
extend the application.