New MongoEngine Website
1 month, 1 week ago — 0 Comments — Permalink
We’ve just launched a new website for MongoEngine over at mongoengine.org. Check it out and let us know what you think.
1 month, 1 week ago — 0 Comments — Permalink
We’ve just launched a new website for MongoEngine over at mongoengine.org. Check it out and let us know what you think.
Here I’ll present a simple full text search engine, that uses MongoDB as its backend. It’s implemented using MongoEngine, and is intended as more of a proof-of-concept than a viable alternative to “real” search engines such as Solr, Sphinx, etc.
The search engine will index documents in a certain MongoDB collection. Multiple fields may be indexed, and a weight may be assigned to each field. The words in each field will be split up and stemmed. A mapping between the words and the indexed documents will be stored in the database, and will be used when a query is run to determine the relevance of each document.
Most search engines use inverted indexes, which map terms to documents. A simple way of storing an inverted index in MongoDB would be to have a collection where the _id is a term, and a documents field has a list of references to documents in the collection we are indexing. However, as MongoDB allows us to index on list fields, a more sensible approach is to store the reference to the indexed document as the _id in the index collection, and have a list field that contains the terms that are in the document. To allow terms from different fields to carry different weights, embedded documents containing the term and a weight will be stored in the terms field, rather than just the term. A nice side effect of this is that terms that appear multiple times within a document can be stored as one embedded document; the weight will be the sum of the weights of the individual occurrences of the term. Let’s see the code for that:
class SearchTerm(EmbeddedDocument): term = fields.StringField() weight = fields.FloatField() class DocumentIndex(Document): doc_id = fields.StringField(primary_key=True) terms = fields.ListField(fields.EmbeddedDocumentField(SearchTerm))
To perform a search, a textual query will be split and stemmed in a similar manner to the indexed documents. Each document will be compared to the query to determine its relevance, then the documents will be returned with an associated “score”.
The ranking function I’ve opted to use is BM25. For efficiency, this will be executed on the server, rather than downloading the entire index and performing the ranking client-side in Python. To do this the ranking function will be written in Javascript and the exec_js method on a MongoEngine QuerySet will be used.
BM25 uses the inverse-document frequency (IDF) of each term to rank a document in a collection. The IDF effectively determines how important a term is across a collection by calculating its rarity (i.e. it will be low for common words, and high for words that only occur in a small number of documents). The IDF will be calculated before the main ranking occurs:
idfs = {} # Get the total number of documents in the collection num_docs = document_index.objects.count() for term in query_terms: # Use the number of docs that contain the term to calculate the IDF term_docs = document_index.objects(terms__term=term).count() idfs[term] = log((num_docs - term_docs + 0.5) / (term_docs + 0.5))
Now we have a dictionary of the IDF for each term, we can define the ranking function:
function() { var results = {}; // Iterate over each document to calculate the document's score db[collection].find(query).forEach(function(doc) { var score = 0; // Iterate over each term in the document, calculating the // score for the term, which will be added to the doc's score doc.terms.forEach(function(term) { // Only look at the term if it is part of the query if (options.queryTerms.indexOf(term.term) != -1) { // term.weight is equivalent to the term's // frequency in the document // // f(qi, D) * (k1 + 1) var dividend = term.weight * (options.k + 1); // |D| / avgdl var relDocSize = doc.length / options.avgDocLength; // (1 - b + b * |D| / avgdl) var divisor = 1.0 - options.b + options.b * relDocSize; // f(qi, D) + k1 * (1 - b + b * |D| / avgdl) divisor = term.weight + divisor * options.k // Divide the top half by the bottom half var termScore = dividend / divisor; // Then scale by the inverse document frequency termScore *= options.idfs[term.term]; // The document's score is the sum of its terms scores score += termScore; } }); results[doc._id] = score; }); return results; }
And that’s pretty much it, we get back a dictionary that has document ids as the keys and relevance scores as the values. In the future it would be nice to add sorting by relevance and a way of saying “only give me back the top n results”, but for the time being that can just be done in Python:
from operator import itemgetter from heapq import nlargest num_results = 10 top_matches = nlargest(num_results, results.iteritems(), itemgetter(1))
Firstly, as MongoDB is a schema-free database, it stores the field names along with each field on a document. As we are storing a large number of terms, renaming term and weight on the SearchTerm embedded document will save a fair bit of space. Secondly, rather than ranking all documents we could use a query that only includes documents that contain at least one of the search terms:
query = document_index.objects(terms__term__in=query_terms)
As I mentioned earlier, this will not perform nearly as well as the proper search servers, but it seems to produce reasonable results for the limited tests I’ve run. The full code for this is available on GitHub, along with an example of how to use it.
4 months, 1 week ago — 3 Comments — Permalink
Just released version 0.3 of MongoEngine, here’s a quick breakdown of some of the main changes.
Thanks to the great work by Matt Dennewitz, we now have support for MapReduce. Here’s an example to show how it works, in which we generate frequencies of tags over a collection of blog posts:
class BlogPost(Document): title = StringField() tags = ListField(StringField()) BlogPost(title="Post #1", tags=['music', 'film', 'print']).save() BlogPost(title="Post #2", tags=['music', 'film']).save() BlogPost(title="Post #3", tags=['film', 'photography']).save() map_f = """ function() { this.tags.forEach(function(tag) { emit(tag, 1); }); } """ reduce_f = """ function(key, values) { var total = 0; for(var i=0; i<values.length; i++) { total += values[i]; } return total; } """ # run a map/reduce operation spanning all posts for result in BlogPost.objects.map_reduce(map_f, reduce_f): print '%s: %s' % (result.key, result.value) # output: # film: 3.0 # music: 2.0 # photography: 1.0 # print: 1.0
If the keys in the results correspond to _ids in the collection, you can access the relevant object by using result.object, which is lazily loaded.
MongoEngine 0.3 sees the introduction of five new field types:
URLField - inherits from StringField, but validates URLs and optionally verifies their existence.DictField - as the name suggests, it allows you to store Python dictionaries. When the structure of the dictionary is known, EmbeddedDocuments are preferred, but DictFields are useful for storing data where the structure isn’t known in advance.GenericReferenceField - similar to the standard ReferenceField, but allows you to reference any type of Document.DecimalField - a field capable of storing Python Decimal objects.BinaryField - stores binary data.only() - pass in field names as positional arguments, and only these fields will be retrieved from the database. Note that trying to access fields that haven’t been retrieved will return None as deferred fields have not yet been implemented.in_bulk() - given a list of document ids, this will load all the corresponding documents and return a dictionary mapping the ids to the documents.get(), get_or_create() - like first() these methods retrieve one matching document, but if more than one document matches the query, a MultipleObjectsReturned exception will be thrown. If get_or_create() is used and no matching document is found, a document will be created from the query.Six new query operators have been added: contains, startswith, endswith, and their case-insensitive variants, icontains, istartswith and iendswith. These are are just shortcuts for regular expression queries.
QuerySets now have a rewind() method, which is called automatically when the iterator is exhausted, allowing QuerySets to be reused.ReferenceFields may now reference the document they are defined on (recursive references) and documents that have not yet been defined.name parameter on fields has been replaced by the more descriptive db_field.…and much more. For full details, see the changelog.
5 months ago — 0 Comments — Permalink
Really interesting post about how Boxed Ice handled some of the issues that appeared when using MongoDB for storing massive datasets (17,810 collections, 43,175 indexes and 664,158,090 documents).
5 months, 2 weeks ago — 0 Comments — Permalink
Check out this great introduction to MongoEngine and Mumblr from Kevin Fricovsky.
5 months, 3 weeks ago — 9 Comments — Permalink
MongoEngine is a Document-Object Mapper (think ORM, but for document databases) for working with MongoDB from Python. It uses a simple declarative API, similar to that of the Django ORM.
Here’s a brief run-down of some of the main features of MongoEngine:
sum and averageQ objectsTo define a document, just inherit from the Document class and add some fields:
class BlogPost(Document): title = StringField(required=True) slug = StringField(required=True, max_length=250) content = StringField(required=True) date = DateTimeField(default=datetime.now, required=True) tags = ListField(StringField())
To save documents to the database, just instantiate a Document object, fill in the fields, and call save:
post = BlogPost(title='Introducing MongoEngine', slug='introducing-mongoengine') post.content = 'MongoEngine is a Document-Object Mapper...' post.tags = ['mongodb', 'mongoengine'] post.save()
To find documents, use the objects attribute of a Document subclass:
latest_posts = BlogPost.objects.order_by('-date')[:25] mongodb_posts = BlogPost.objects(tags='mongodb')
How about a tag cloud? Simple:
# Get a dictionary with tags as the keys and frequencies as the values tag_freqs = BlogPost.objects.item_frequencies('tag')
Every blog need comments, right?
class Comment(EmbeddedDocument): author = StringField() content = StringField(required=True) date = DateTimeField() # Modify the previously defined BlogPost document class BlogPost(Document): ... comments = ListField(EmbeddedDocumentField(Comment)) ... # Let's add a comment, this is performed as an atomic operation comment = Comment(author=form['author'], content=form['content']) BlogPost.objects(id=post_id).update(push__comments=comment)
I could go on, but I’ll keep this post short and to the point. For more information, see the documentation. The source is available on GitHub, fork it and have a play!
6 months ago — 0 Comments — Permalink
An interesting, albeit slightly old, video explanation of V8’s use of hidden classes from the VM wizard, Lars Bak.
6 months, 1 week ago — 0 Comments — Permalink
Great article describing a solid Git workflow. It suggests doing all development in a separate develop branch, keeping master only for production-ready code. The develop branch is merged back in to master when it gets to a stable state, anything that gets merged in to master is tagged as a release.
Three main other types of branch are used:
develop. release branch is used - from this point on, no major features will be added, and the develop branch will be used for development on the next release. hotfix branch will be created from the last tag on master. When the bug is fixed, this branch will be merged back in to master, and the new release will be tagged.6 months, 1 week ago — 5 Comments — Permalink
I like to do most of my Python development inside virtualenvs. I also create a Git repository for any project that matters or that will have any kind of continued development. Constantly switching between the different virtualenvs to work on different projects used to be tedious, but this issue was largely solved by the fantastic virtualenvwrapper.
Virtualenvwrapper has certainly improved the situation, but even so, I can’t help but worry that the cd project-x, workon project-x, (do some work), cd .., deactivate work-flow is going to lead me to an early grave caused by a severe case of RSI. So in order to retain my good health, I’ve hacked together a bash function that automatically activates a virtualenv when you cd into a Git repository, and deactivates it when you leave the repository.
By default, it assumes that the virtualenv’s name will be the same as the repository’s name, but this can be overridden by creating a file called .venv in the repository’s root directory with the name of another virtualenv in it.
# Automatically activate Git projects' virtual environments based on the # directory name of the project. Virtual environment name can be overridden # by placing a .venv file in the project root with a virtualenv name in it function workon_cwd { # Check that this is a Git repo GIT_DIR=`git rev-parse --git-dir 2> /dev/null` if [ $? == 0 ]; then # Find the repo root and check for virtualenv name override GIT_DIR=`\cd $GIT_DIR; pwd` PROJECT_ROOT=`dirname "$GIT_DIR"` ENV_NAME=`basename "$PROJECT_ROOT"` if [ -f "$PROJECT_ROOT/.venv" ]; then ENV_NAME=`cat "$PROJECT_ROOT/.venv"` fi # Activate the environment only if it is not already active if [ "$VIRTUAL_ENV" != "$WORKON_HOME/$ENV_NAME" ]; then if [ -e "$WORKON_HOME/$ENV_NAME/bin/activate" ]; then workon "$ENV_NAME" && export CD_VIRTUAL_ENV="$ENV_NAME" fi fi elif [ $CD_VIRTUAL_ENV ]; then # We've just left the repo, deactivate the environment # Note: this only happens if the virtualenv was activated automatically deactivate && unset CD_VIRTUAL_ENV fi } # New cd function that does the virtualenv magic function venv_cd { cd "$@" && workon_cwd } alias cd="venv_cd"
Note: for this to work you will need virtualenv and virtualenvwrapper installed. To use it, just stick it in your .bashrc somewhere below where your $WORKON_HOME is specified.