Tuesday, December 24, 2013

Writing a full-text search engine using Bloom filters

A post by Stavros Korokithakis@http://www.stavros.io/posts/bloom-filter-search-engine/

Writing a full-text search engine using Bloom filters

Search engines and Bloom filters: A match made in heaven?
A few minutes ago I came across a Hacker News post that detailed a method of adding search to your static site. As you probably know, adding search to a static site is a bit tricky, because you can’t just send the query to a server and have the server process it and return the results. If you want full-text search, you have to implement something like an inverted index.

How an inverted index works

An inverted index is a data structure that basically maps every word in every document to the ID of the document it can be found in. For example, such an index might look like {"python": [1, 3, 6], "raspberry": [3, 7, 19]}. To find the documents that mention both “python” and “raspberry”, you look those terms up in the index and find the common document IDs (in our example, that is only document with ID 3).
However, when you have very long documents with varied words, this can grow a lot. It’s a hefty data structure, and, when you want to implement a client-side search engine, every byte you transmit counts.

Client-side search engine caveats

The problem with client-side search engines is that you (obviously) have to do all the searching on the client, so you have to transmit all available information there. What static site generators do is generate every required file when generating your site, then making those available for the client to download. Usually, search-engine plugins limit themselves to tags and titles, to cut down on the amount of information that needs to be transmitted. How do we reduce the size? Easy, use a Bloom filter!

Bloom filters to the rescue

Bloom filter is a very interesting data structure that can store elements in a fixed number of bits and tell you whether it’s seen those elements before when you query it. It sounds great for our use case, but let’s see if it will live up to the challenge.
A simple way to implement a full-text search engine that uses Bloom filters is the following:
  • Create one filter per document and add all the words in that document in the filter.
  • Serialize the (fixed-size) filter in some sort of string and send it to the client.
  • When the client needs to search, iterate through all the filters, looking for ones that match all the terms, and return the document names.
  • Profit!
Next, we’ll implement this very quickly in my favorite language, Python, using pybloom.

A quick implementation

To start with, let’s read some posts from this blog and create a list of all the words in each one:
from pybloom import BloomFilter
import os
import re

# Read all my posts.
posts = {post_name: open(POST_DIR + post_name).read() for post_name in os.listdir(POST_DIR)}
# Create a dictionary of {"post name": "lowercase word set"}.
split_posts = {name: set(re.split("\W+", contents.lower())) for name, contents in posts.items()}
At this point, we have a dictionary of posts and a normalized set of words in each. We could do more things, like stemming, removing common words (a, the, etc), but we’re going for naive, so let’s just create the filters for now:
filters = {}
for name, words in split_posts.items():
    filters[name] = BloomFilter(capacity=len(words), error_rate=0.1)
    for word in words:
        filters[name].add(word)
You can see above that the capacity of each filter is exactly the number of words in each post, to cut down on the number of bits needed to represent it. The error rate is tweakable, and it is the probability of a false positive. The lower the probability, the more accurate the filter, but the longer it becomes.
Now that we have all the filters ready, we can transmit them to the client using whatever serialization method we like. Let’s write a very simple search algorithm to find posts based on some search terms:
def search(search_string):
    search_terms = re.split("\W+", search_string)
    return [name for name, filter in filters.items() if all(term in filter for term in search_terms)]
All search() does is iterate through all the filters and return the ones that match every given term. Let’s try it out:
>>> search("android raspberry")
['2013-06-19 - how-remote-control-rf-devices-raspberry-pi.md',
 '2013-06-24 - writing-my-first-android-app-control-your-raspberr.md']
Judging by the titles, it found all the relevant posts, and without any false positives! Not bad at all, for a few minutes’ work! Let’s see what the average size of the filter is:
>>> sum(len(filter.bitarray.tobytes()) for filter in filters.values()) / len(filters)
298
298 bytes per post strikes me as a pretty reasonable size for something like this. We could decrease this further, if we didn’t mind more false positives, but, given the search results above, I think it’s pretty good for such a naive approach. For comparison, this paragraph is also 298 bytes long.

An alternate approach

Another approach would be to use a single filter and concatenate post IDs with the search words (e.g “3:term”) and search for that. Unfortunately, that results in a filter that is exactly the same size as the sum of the small ones, and has more computational complexity because you have to hash N times (where N is the number of posts) rather than just once.

Strengths and weaknesses of this approach

Using a Bloom filter has a few advantages that make it suitable for use in static sites:
  • The space it takes up is proportional to the number of pages, rather than the number of words, so it takes up roughly the same space for a very long page as for a very short one (this isn’t exactly true because we size the filters depending on the number of words, but it’s much more compact than an inverted index).
  • Search complexity is proportional to the number of pages, rather than to their length. This doesn’t matter much when you have, at most, a few thousand pages, but it’s still good if you only have a few long ones.
  • Since Bloom filters are probabilistic, this method may produce false positives, but it will not produce false negatives, which is what we want for a search engine (we don’t mind a few irrelevant pages in the results, but we do mind if relevant ones are missing).
Naturally, it also has weaknesses, which include:
  • You can’t weight pages by relevance, since you don’t know how many times a word appears in a page, all you know is whether it appears or not. You may or may not care about this.
  • You need to implement a Bloom filter algorithm on the client side. This will probably not be much longer than the inverted index search algorithm, but it’s still probably a bit more complicated.
Of course, the full-text index will still be large and probably not practical to load on every page view, even when using this approach, but this method may be suitable for using in a dedicated “search” page in your static site.
I’m not actually advocating using this method on your static site, but, hey, it made for a fun hour of Saturday night hacking. Now, it’s time to go out for drinks. If you’re in Thessaloniki and want to join me, drop me an email or get me on Twitter.

No comments:

Post a Comment