Harry Marr

Projects

About

Twitter

Recent Entries

Archive

Tag Cloud

RSS/Atom

Full text search with MongoDB

4 months, 1 week ago — 1 Comment — Permalink

  • mongoengine
  • mongodb
  • nosql
  • search

Here I’ll present a simple full text search engine, that uses MongoDB as its backend. It’s implemented using MongoEngine, and is intended as more of a proof-of-concept than a viable alternative to “real” search engines such as Solr, Sphinx, etc.

What will the search engine do?

The search engine will index documents in a certain MongoDB collection. Multiple fields may be indexed, and a weight may be assigned to each field. The words in each field will be split up and stemmed. A mapping between the words and the indexed documents will be stored in the database, and will be used when a query is run to determine the relevance of each document.

How should the index be stored?

Most search engines use inverted indexes, which map terms to documents. A simple way of storing an inverted index in MongoDB would be to have a collection where the _id is a term, and a documents field has a list of references to documents in the collection we are indexing. However, as MongoDB allows us to index on list fields, a more sensible approach is to store the reference to the indexed document as the _id in the index collection, and have a list field that contains the terms that are in the document. To allow terms from different fields to carry different weights, embedded documents containing the term and a weight will be stored in the terms field, rather than just the term. A nice side effect of this is that terms that appear multiple times within a document can be stored as one embedded document; the weight will be the sum of the weights of the individual occurrences of the term. Let’s see the code for that:

class SearchTerm(EmbeddedDocument):
    term = fields.StringField()
    weight = fields.FloatField()

class DocumentIndex(Document):
    doc_id = fields.StringField(primary_key=True)
    terms = fields.ListField(fields.EmbeddedDocumentField(SearchTerm))

Querying

To perform a search, a textual query will be split and stemmed in a similar manner to the indexed documents. Each document will be compared to the query to determine its relevance, then the documents will be returned with an associated “score”.

The ranking function I’ve opted to use is BM25. For efficiency, this will be executed on the server, rather than downloading the entire index and performing the ranking client-side in Python. To do this the ranking function will be written in Javascript and the exec_js method on a MongoEngine QuerySet will be used.

BM25 uses the inverse-document frequency (IDF) of each term to rank a document in a collection. The IDF effectively determines how important a term is across a collection by calculating its rarity (i.e. it will be low for common words, and high for words that only occur in a small number of documents). The IDF will be calculated before the main ranking occurs:

idfs = {}
# Get the total number of documents in the collection
num_docs = document_index.objects.count()
for term in query_terms:
    # Use the number of docs that contain the term to calculate the IDF
    term_docs = document_index.objects(terms__term=term).count()
    idfs[term] = log((num_docs - term_docs + 0.5) / (term_docs + 0.5))

Now we have a dictionary of the IDF for each term, we can define the ranking function:

function() {
    var results = {};
    // Iterate over each document to calculate the document's score
    db[collection].find(query).forEach(function(doc) {
        var score = 0;
        // Iterate over each term in the document, calculating the 
        // score for the term, which will be added to the doc's score
        doc.terms.forEach(function(term) {
            // Only look at the term if it is part of the query
            if (options.queryTerms.indexOf(term.term) != -1) {
                // term.weight is equivalent to the term's 
                // frequency in the document
                //
                // f(qi, D) * (k1 + 1)
                var dividend = term.weight * (options.k + 1);
                // |D| / avgdl
                var relDocSize = doc.length / options.avgDocLength;
                // (1 - b + b * |D| / avgdl)
                var divisor = 1.0 - options.b + options.b * relDocSize;
                // f(qi, D) + k1 * (1 - b + b * |D| / avgdl)
                divisor = term.weight + divisor * options.k
                // Divide the top half by the bottom half
                var termScore = dividend / divisor;
                // Then scale by the inverse document frequency
                termScore *= options.idfs[term.term];
                // The document's score is the sum of its terms scores
                score += termScore;
            }
        });
        results[doc._id] = score;
    });
    return results;
}

And that’s pretty much it, we get back a dictionary that has document ids as the keys and relevance scores as the values. In the future it would be nice to add sorting by relevance and a way of saying “only give me back the top n results”, but for the time being that can just be done in Python:

from operator import itemgetter
from heapq import nlargest
num_results = 10
top_matches = nlargest(num_results, results.iteritems(), itemgetter(1))

Optimisations

Firstly, as MongoDB is a schema-free database, it stores the field names along with each field on a document. As we are storing a large number of terms, renaming term and weight on the SearchTerm embedded document will save a fair bit of space. Secondly, rather than ranking all documents we could use a query that only includes documents that contain at least one of the search terms:

query = document_index.objects(terms__term__in=query_terms)

As I mentioned earlier, this will not perform nearly as well as the proper search servers, but it seems to produce reasonable results for the limited tests I’ve run. The full code for this is available on GitHub, along with an example of how to use it.

Notes from a production MongoDB deployment

5 months ago — 0 Comments — Permalink

  • mongodb
  • nosql
  • scale

Really interesting post about how Boxed Ice handled some of the issues that appeared when using MongoDB for storing massive datasets (17,810 collections, 43,175 indexes and 664,158,090 documents).

Notes on MongoDB

5 months ago — 0 Comments — Permalink

  • mongodb
  • nosql

Nice overview of MongoDB’s capabilities.

Introducing MongoEngine

5 months, 3 weeks ago — 9 Comments — Permalink

  • mongodb
  • mongoengine
  • nosql
  • python

MongoEngine is a Document-Object Mapper (think ORM, but for document databases) for working with MongoDB from Python. It uses a simple declarative API, similar to that of the Django ORM.

So what does it do?

Here’s a brief run-down of some of the main features of MongoEngine:

  • Document schema declaration and validation
  • An elegant querying syntax, similar to that of Django
  • Document inheritance, with support for “polymorphic querying”
  • Aggregation methods, such as sum and average
  • Advanced query condition combination using Q objects
  • Session and authentication backends for Django

Show me the code!

To define a document, just inherit from the Document class and add some fields:

class BlogPost(Document):
    title = StringField(required=True)
    slug = StringField(required=True, max_length=250)
    content = StringField(required=True)
    date = DateTimeField(default=datetime.now, required=True)
    tags = ListField(StringField())

To save documents to the database, just instantiate a Document object, fill in the fields, and call save:

post = BlogPost(title='Introducing MongoEngine', slug='introducing-mongoengine')
post.content = 'MongoEngine is a Document-Object Mapper...'
post.tags = ['mongodb', 'mongoengine']
post.save()

To find documents, use the objects attribute of a Document subclass:

latest_posts = BlogPost.objects.order_by('-date')[:25]
mongodb_posts = BlogPost.objects(tags='mongodb')

How about a tag cloud? Simple:

# Get a dictionary with tags as the keys and frequencies as the values
tag_freqs = BlogPost.objects.item_frequencies('tag')

Every blog need comments, right?

class Comment(EmbeddedDocument):
    author = StringField()
    content = StringField(required=True)
    date = DateTimeField()

# Modify the previously defined BlogPost document
class BlogPost(Document):
    ...
    comments = ListField(EmbeddedDocumentField(Comment))
    ...

# Let's add a comment, this is performed as an atomic operation
comment = Comment(author=form['author'], content=form['content'])
BlogPost.objects(id=post_id).update(push__comments=comment)

I could go on, but I’ll keep this post short and to the point. For more information, see the documentation. The source is available on GitHub, fork it and have a play!

« NewerOlder »

Log in

Mumblr is a basic Django tumblelog application that uses MongoDB with MongoEngine. Fork it on Github. Designed and developed by Harry Marr and Steve Challis.