Full-Text Search in MongoDB

By: Ashish Trivedi
To read more DBA articles, visit http://dba.fyicenter.com/article/

MongoDB, one of the leading NoSQL databases, is well known for its fast performance, flexible schema, scalability and great indexing capabilities. At the core of this fast performance lies MongoDB indexes, which support efficient execution of queries by avoiding full-collection scans and hence limiting the number of documents MongoDB searches.
Starting from version 2.4, MongoDB began with an experimental feature supporting Full-Text Search using Text Indexes. This feature has now become an integral part of the product (and is no longer an experimental feature). In this article we are going to explore the full-text search functionalities of MongoDB right from fundamentals.
If you are new to MongoDB, I recommend that you read the following articles on Envato Tuts+ that will help you understand the basic concepts of MongoDB:
Getting Started with MongoDB - Part 1
Mapping Relational Databases and SQL to MongoDB
The Basics
Before we get into any details, let us look at some background. Full-text search refers to the technique of searching a full-text database against the search criteria specified by the user. It is something similar to how we search any content on Google (or in fact any other search application) by entering certain string keywords/phrases and getting back the relevant results sorted by their ranking.
Here are some more scenarios where we would see a full-text search happening:
Consider searching your favorite topic on Wiki. When you enter a search text on Wiki, the search engine brings up results of all the articles related to the keywords/phrase you searched for (even if those keywords were used deep inside the article). These search results are sorted by relevance based on their matched score.
As another example, consider a social networking site where the user can make a search to find all the posts which contain the keyword cats in them; or to be more complex, all the posts which have comments containing the word cats.
Before we move on, there are certain general terms related to full-text search which you should know. These terms are applicable to any full-text search implementation (and not MongoDB-specific).
Stop Words

Stop words are the irrelevant words that should be filtered out from a text. For example: a, an, the, is, at, which, etc.
Stemming

Stemming is the process of reducing the words to their stem. For example: words like standing, stands, stood, etc. have a common base stand.
Scoring
A relative ranking to measure which of the search results is most relevant. Alternatives to Full-Text Search in MongoDB
Before MongoDB came up with the concept of text indexes, we would either model our data to support keyword searches or use regular expressions for implementing such search functionalities. However, using any of these approaches had its own limitations:
Firstly, none of these approaches supports functionalities like stemming, stop words, ranking, etc.
Using keyword searches would require the creation of multi-key indexes, which are not sufficient compared to full-text.
Using regular expressions is not efficient from the performance point of view, since these expressions do not effectively utilize indexes.
In addition to that, none of these techniques can be used to perform any phrase searches (like searching for ‘movies released in 2015’) or weighted searches.
Apart from these approaches, for more advanced and complex search-centric applications, there are alternative solutions like Elastic Search or SOLR. But using any of these solutions increases the architectural complexity of the application, since MongoDB now has to talk to an additional external database.
Note that MongoDB’s full-text search is not proposed as a complete replacement of search engine databases like Elastic, SOLR, etc. However, it can be effectively used for the majority of applications that are built with MongoDB today.
Introducing MongoDB Text Search
Using MongoDB full-text search, you can define a text index on any field in the document whose value is a string or an array of strings. When we create a text index on a field, MongoDB tokenizes and stems the indexed field’s text content, and sets up the indexes accordingly.
To understand things further, let us now dive into some practical things. I want you to follow the tutorial with me by trying out the examples in mongo shell. We will first create some sample data which we will be using throughout the article, and then we'll move on to discuss key concepts.
For the purpose of this article, consider a collection messages which stores documents of the following structure:
{
"subject":"Joe owns a dog",
"content":"Dogs are man's best friend",
"likes": 60,
"year":2015,
"language":"english"
}

Let us insert some sample documents using the insert command to create our test data:

Full article...