KonurPapa.com

Understanding Levenshtein Automata

The Magic Behind Spellcheck

Ever wondered how spellcheck works? Perhaps if you've ever been angry by its incorrect rewording of whatever it was you wrote.

But consider for a moment how often it gets things right. How does it so often seem to know what word you meant to type, based on your typo?

My latest nerdy forray has been into the world of Levenshtein automata. Simply put, this is an algorithm for determining how similar a given string of text is to another string (probably from a list). Like the spellcheck example: how similar is the typo stea to an actual word in the dictionary?

This is an algorithm for determining how similar a given string of text is to another.

In this example, there are some things to consider. Firstly, this "similarity" can be composed of 3 distinct logical operations: deletions, insertions, and replacements. In this case, stea could be matched within 1 operation to the words tea (a deletion), stead (an insertion), and teas (a replacement).

This is why spellcheck often offers more than one potential word as a typo solution, and the likelihood of x match is based on probability. This is also why some editors don't ask your permission and automatically correct some words; it's certain the word must have been a typo because the probability is so high.

Beyond Spellcheck

I've been working on a project where I've put this concept to use (hence my deep-dive). I have an input item (or items), and I need to find the best match from an internal database of items.

Problem #1 is this input item may or may not be typoed. Problem #2 is this searching database is pretty big (a few thousand entries).

Obviously, the speed at which this processes is of utmost importance. I could manually iterate over every single index in the database until a match is found, which I know would technically work. But if there are numerous input items, this will start to take a really long time:

Processing Timex100 Items
0.5 seconds50 seconds
3 seconds5 minutes

Once it starts taking even a fraction of a second longer, the processing time is going to multiply out exponentially by the number of items.

I proceeded to do a ton of research, and came across a few good resources. I'd be amiss not to share some of them; specifically, this article on the logic behind a spelling corrector, and this article explaining how Lenvenshtein automata works (with some Python examples as well).

The solution I eventually came up with is to first try direct key matching by converting all database entries into key/value pairs. Then something like:

input_handler(item):
    return database.get(item) != null

This handles a pretty sizeable percentage (apparently people sometimes can spell).

But what about typos? This is where I use Levenshtein automata (AKA "fuzzy" matching). There's a really helpful Python library called fuzzywuzzy (I think now it's called thefuzz) that has a bunch of built-in functions for handling matching operations. It also returns a probability score, which I use to sort more likely matches from less likely ones.

Here's another helpful article explaining how each of these functions work:

https://www.datacamp.com/tutorial/fuzzy-string-python

So here's my flow of logic:

  • Split the database into categories, so we only have to iterate over database entries of the same type (I'll probably get into the details of how this part works in a future post).
  • After the direct-key matching attempt, iterate over the entire database list of the same category as the input item.
  • Check if the item shares any starting letters in common with the current database item; if not, skip it (no sense wasting time on fuzzy matching).
  • Add this potential entry into a list, and rank the list based on similarity (using the built-in Python function .intersection).
  • Finally, try to perform a fuzzy match on the top potentials.

The resulting system works accurately and runs pretty fast, averaging at about 0.5 seconds (from what I can tell... it's difficult for me to get a totally accurate metric for other reasons).

There are tons of other fun applications for Levenshtein automata; this is just my own example. Hopefully I've explained this in a way that's a little easier to understand and more accessible.

A good tool to have in the programmer arsenal!

All rights reserved.