How to clean messy text data with Python's Regex

How to go from a semi-structured text to usable data with Regex

Nov 01, 2024

∙ Paid

TLDR: Extracting data from financial documents can be a cumbersome task (see previous post), and the results can be messy. In a second step, we must clean a rather messy output to make it usable for further algorithms. Python’s library Regex is particularly well-suited to do this. We discuss the origins of regex, its key functions in Python, and show you how to apply them to your own data cleaning tasks. Through selective code snippets, you'll learn practical regex techniques to make your own data ready for analysis.

We can comb through messy data to find regular expressions using Python. Image generated with Leonardo AI

Consider this: You are tasked with analyzing numerical data from a lengthy PDF report consisting of text and tables. A colleague has already extracted the information using Optical Character Recognition (see last week’s post).

Unfortunately, rather than a structured dataset, this file is rather messy — you find redundant headers, extraneous footnotes, and irregular line breaks. Numbers are inconsistently formatted, and data descriptors are scattered throughout, rendering any meaningful analysis nearly impossible without significant preprocessing. It looks like you will be facing hours of tedious data cleaning today.

Gladly, though, you have stumbled on Regex. Short for “regular expressions,” it is a powerful tool for pattern matching in text. It sounds simple, but allowing users to define, search, and manipulate specific patterns within text makes it an excellent tool for cutting through messy data.

This piece shall provide a bit more background on Regex, and how it is implemented in Python. It then digs deeper into the essential Regex features for data cleaning, and provides a hands-on example (that we very recently faced at Wangari) to illustrate how this works in practice. If you are facing similar challenges, this knowledge should hopefully spare you of hours of cleaning work, especially if you work on such tasks repeatedly.

How Regex came into existence

The origins of Regex trace back to the concept of “regular events” as defined by mathematician Stephen Cole Kleene in the 1950s. This was done with particular regards to theoretical computer science, which at that time was emerging with automata theory and early artificial neural networks (yes, by this we mean AI).

Regex got some traction a decade later when computer science legend Ken Thompson built Kleene’s notation into a program called QED in order to match patterns in text files. Which is exactly what we will be using it for today, some than 60 years later! Thompson is known as the inventor of UNIX, a classic operating system.

The elegance of Regex lies in the fact that it complex pattern matching can be conducted with precision and minimal code. This is creditable to the abstract concepts that Kleene brought in.

Today, Regex is a mainstay in most programming languages. This includes but is not limited to Java, Perl, and Python.

Regex in Python

In Python, Regex can be accessed through the re module. Beyond its syntax — which can be hard to read at first sight but gets better as you get used to it — notable concepts are flags, functions, regular expression objects, and match objects. We will look into each here.

Keep reading with a 7-day free trial

Subscribe to Wangari Digest to keep reading this post and get 7 days of free access to the full post archives.