Wangari Digest

Wangari Digest

Share this post

Wangari Digest
Wangari Digest
Stop Copy-Pasting. Turn PDFs into Data in Seconds

Stop Copy-Pasting. Turn PDFs into Data in Seconds

Automate PDF extraction and get structured data instantly with Python's best tools

Ari Joury's avatar
Ari Joury
Feb 21, 2025
∙ Paid
3

Share this post

Wangari Digest
Wangari Digest
Stop Copy-Pasting. Turn PDFs into Data in Seconds
Share
Unstructured data can be a goldmine, but extracting its value can be hard without proper guidance. Image by Leonardo AI

If you’ve ever copy-pasted information from PDFs to another file, only to have it all corrupted or full of errors, you know the pain. I’ve been facing this challenge daily because my company performs data analysis from public corporate data. This data is usually served to you in the shape of PDFs. I’ve literally spent entire days copy-pasting columns and rows of relevant data tables, in order to get what I need for my analyses.

After a few weeks in which I spent half my time manually copy-pasting, I got tired of this mind-numbing monotony. I decided to automate my processes.

The big advantage that I have over other people in my area of work—mostly consultants, financial analysts, and sustainability analysis—is that I’ve got solid coding skills. I’d been writing scientific code in Python and C++, among other languages, for almost a decade before founding my own company. So I decided to use my skills to get rid of the boring part of my work.

Despite my experience, I still faced challenges in automating my data extraction: PDFs come in different shapes and sizes, and a tool that works for one PDF won’t work for another. Luckily, there are tools for just about any PDF—but choosing the right one was trial-and-error for me in the beginning.

So, in addition to studying all the available tools, I also had to make sure that I automated which tool was going to be chosen. Doing a hit-and-miss approach with my code was an improvement, but still equated dull work.

You might be dealing with less varied data and PDF formats than I do. I therefore chose to introduce you to each tool separately, given that you might not need all of them. And in case you’re dealing with as varied data as I am, I’ll conclude by showing you how to string all these tools together and make your extraction seamless.

Choosing the Right Tool for the Job

So, before we get into the details of various Python packages for PDF extraction, let’s first discuss when to use which one. We’ll be covering five table extraction tools: Tabula, Camelot, pdfplumber, PyMuPDF, and Tesseract.

The first three—Tabula, Camelot, and pdfplumber—are good for structured tables in text-based PDFs. Text-based essentially means that you can mark the text in a PDF editor; this is not the case for most scanned documents these days. The difference between the three is that Tabula is only good for simple tables. Camelot can handle more complex tables, but requires more fine-tuning. Of the three, pdfplumber can handle the most complex tables; however, it cannot detect tables automatically like the other two and does not always preserve the table structure well.

PyMuPDF is a more general-purpose tool that can not only extract tables but also raw text. It can be a good tool if you are also handling text-based data alongside tabular data, or if you want to quickly find out whether your document is text-based (it won’t work if it is a scanned document).

Tesseract is a pure Optical Character Recognition (OCR) tool. It is best for scanned documents. However, I’ve also been using it on text-based documents with atypical encryptions that caused all other packages to fail. Because it is an OCR tool, it does not recognize tables and its structures automatically, and it can make many mistakes that later need to be cleaned up. That being said, it’s a good backup option when the other packages don’t work for whichever reason.

The table below summarizes the key properties of each package. We’ll go through each of them and their usage in the next few sections of this piece.

Tabula and Camelot for Structured, Text-Based Tables

Tabula

Tabula is the simplest tool on this list, and it only really works with simple tables. Tabula depends on Java, though, so make sure you have that installed before you try playing with it. After that, installing Tabula is as simple as typing in your command line:

pip install tabula-py

To run it on an example PDF, here’s how one would do this:

import tabula

pdf_path = "example.pdf"
dfs = tabula.read_pdf(pdf_path, pages="all", multiple_tables=True)

# Save all extracted tables to CSV
for i, df in enumerate(dfs):
    df.to_csv(f"table_tabula_{i}.csv", index=False)

print(dfs[0].head())  # Preview first table

The advantage of Tabula is that it is simple to execute and quick to run. It struggles, however, with tables that have merged cells or irregular formatting.

Camelot

Camelot is a viable alternative to Tabula because it doesn’t require Java to run and can handle somewhat more complex tables. On the other hand, Camelot requires Ghostscript to run. If it’s not installed yet, run:

brew install ghostscript  # Mac
sudo apt install ghostscript  # Linux

For Windows, one must download and install Ghostscript. To install Camelot, one then simply runs:

pip install camelot-py

Camelot supports two modes of running: Lattice mode is good for tables with visible borders. Stream mode is good when the tables lack visible borders.

Here’s how one would extract data using Camelot:

import camelot

pdf_path = "example.pdf"
tables = camelot.read_pdf(pdf_path, pages="all", flavor="lattice")  # lattice mode
#tables = camelot.read_pdf(pdf_path, pages="all", flavor="stream")  # stream mode

# Save extracted tables to CSV
for i, table in enumerate(tables):
    table.df.to_csv(f"table_camelot_{i}.csv", index=False)

print(tables[0].df.head())  # Preview first table

Notice that the commented-out line above is for stream mode; the non-commented one is for lattice mode. You can adjust these as needed.

More Control Over Table Extraction With pdfplumber

If your tables are too complex for Tabula or Camelot, then pdfplumber is your friend. It also helps extract not only tables, but also text. One has more control over what one wants to extract because one can manually define table areas—that being said, this does require more fine-tuning than a “just run it”-type of approach.

One installs pdfplumber via:

pip install pdfplumber

Once this is done, one can proceed to extracting all tables from a PDF in the following fashion:

import pdfplumber
import pandas as pd

pdf_path = "example.pdf"

# Open the PDF
with pdfplumber.open(pdf_path) as pdf:
    all_tables = []
  
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            df = pd.DataFrame(table)
            all_tables.append(df)

# Save tables to CSV
for i, df in enumerate(all_tables):
    df.to_csv(f"table_pdfplumber_{i}.csv", index=False)

# Preview first table
print(all_tables[0].head())  

It’s worth noting that the function .extract_tables() in the snippet above does not automatically detect tables like Tabula or Camelot do. Instead, it extracts text based on how it is formatted on the page.

On the one hand, this means that you might have to help pdfplumber find the tables. On the other hand, it gives you more control over how complex or irregular tables are extracted. For example, you can manually specify a table area using bounding boxes with pdfplumber:

with pdfplumber.open("example.pdf") as pdf:
    first_page = pdf.pages[0]

    # Define a bounding box for the table (x0, top, x1, bottom)
    bbox = (50, 100, 500, 300)  # Adjust coordinates based on the PDF

    table = first_page.extract_table(bbox=bbox)
    df = pd.DataFrame(table)
    print(df)

If, in addition to tables, you would like to extract the accompanying tables, this is easy too:

with pdfplumber.open("example.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)

In short, pdfplumber works well for more complex tables because it gives you more control over the extraction. It also extracts text, unlike Tabula and Camelot. On the other hand, it can be unnecessary work to use it with simple tables, given the customization that is often needed.

Extracting Text & Tables with PyMuPDF (fitz)

PyMuPDF cannot recognize tables as such. However, when other methods fail, it can be a way to extract the necessary data. One can then add the tabular structure afterwards during data cleaning. It can also extract images and metadata from PDFs, which makes it a good general-purpose PDF wrangler.

Its usage is quite simple. The installation is straightforward:

pip install pymupdf

Then, to extract all text from a PDF, here’s how you go about it:

import fitz  # PyMuPDF

pdf_path = "example.pdf"

# Open the PDF
doc = fitz.open(pdf_path)

# Extract text from all pages
full_text = "\n".join([page.get_text("text") for page in doc])

# Save text to a file
with open("extracted_text.txt", "w") as f:
    f.write(full_text)

print(full_text[:500])  # Preview first 500 characters

This method extracts all selectable text from the PDF and preserves paragraph structure where possible. To preserve the spatial layout of the text—which is very useful for tables—one needs to replace get_text("text") with get_text("words").

Extracting Tables from Scanned PDFs with OCR (Tesseract)

Sometimes, all tools listed above fail. This can be due to two reasons: Either the document is not text-based but scanned. Or the document is text-based but with an atypical encryption that Tabula & Co. cannot handle.

When this is the case, it is best to treat the PDF as an image and use OCR technology to read it. Tesseract is by far the most widespread tool for doing this. Installing it is easy (for Windows it needs to be downloaded and installed):

Keep reading with a 7-day free trial

Subscribe to Wangari Digest to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Ari Joury
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share