Python might be your best PDF data extractor
A step-by-step guide on getting the most of lengthy data reports, within seconds
TLDR: Extracting data from PDFs is a necessary first step in order to quantitatively process reporting documents in finance and sustainability. It can be challenging, however, due to complex document layouts, different encoding standards, and noisy data. We give an overview of all major PDF extraction tools in Python, with use cases regarding text, tables, and images. We also demonstrate real-life examples, on a couple of financial and sustainability-related reports. Finally, we discuss how Python stacks up against other programming languages and more manual approaches.
Portable Document Format files (PDFs), have been floating around in the digital world since their inception by Adobe in the early 1990s. Designed to preserve formatting across different devices, PDFs quickly became the go-to format for sharing everything from contracts to annual reports and complex financial documents.
In finance, legal services, and many (if not all) other sectors, PDFs have remained a mainstay to this day. Anyone can open a PDF, and it always displays the same way, no matter what reader is being used. This is an advantage for files that should not change — unlike, say, editable word or PowerPoint files.
One disadvantage of PDFs is that they are meant for human eyes. In other words, if you want to process a 400-page report, initially you might need to open it manually and at least scroll through to the relevant sections yourself. This is a problem when working with large volumes of data, stored in PDFs.
Training chatbots on such large files remains challenging, not to mention energy-consuming. Even when you succeed, state-of-the-art chatbots give unreliable answers at best when queried about the contents. Fine-tuning such chatbots to the type of data in your PDFs only gets you so far, too. (We know because we have tried — at length.)
Python, on the other hand, comes with a whole Swiss army knife’s worth of libraries to deal with different PDFs. As we will see in this piece, it is not 100 percent perfect all the time either. It does come pretty close, though. Compared with manual extraction, which we were doing at the beginnings of Wangari, we’re looking at some 90 to 95 percent worth of time savings.
Because no PDF is the same as another, it is worth figuring out which one of Python’s libraries is worth using for which type of data. We therefore present a quick overview of the most popular libraries below. We then proceed to a couple of examples that illustrate how one can use some of these libraries and extract data in seconds once the code is written. Finally, we compare Python’s tools to those available in some other programming languages and more manual approaches, before wrapping up in a conclusion.
A rundown of available PDF extraction tools in Python
Overall, the available tools can be classified as lightweight tools (e.g., Slate, PyPDF2), advanced extraction tools (e.g., pdfplumber, pdfminer.six), OCR-focused tools (pytesseract), and libraries for PDF manipulation (pikepdf, PDFBox). OCR is industry-lingo for Optical Character Recognition, and will come up some more in this article.
Keep reading with a 7-day free trial
Subscribe to Wangari Digest to keep reading this post and get 7 days of free access to the full post archives.