One of the most important things to consider when using mongodb is that you must think in terms of your application's access patterns when deciding on database design. We will try to solve your problem by using an example and see how it works.
Let's consider that you have the following documents in your collection and let us see below how to make this simple data format work wonders:
> db.performant.find()
{ "_id" : ObjectId("522bf7166094a4e72db22827"), "name" : "abc", "tags" : [ "chest", "bicep", "tricep" ] }
{ "_id" : ObjectId("522bf7406094a4e72db22828"), "name" : "def", "tags" : [ "routine", "trufala", "tricep" ] }
{ "_id" : ObjectId("522bf75f6094a4e72db22829"), "name" : "xyz", "tags" : [ "routine", "myTag", "tricep", "myTag2" ] }
{ "_id" : ObjectId("522bf7876094a4e72db2282a"), "name" : "mno", "tags" : [ "exercise", "myTag", "tricep", "myTag2", "biceps" ] }
First and foremost you absolutely must create an index on tags. (you can create a compound index if desire)
> db.performant.ensureIndex({tags:1})
> db.performant.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "test.performant",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"tags" : 1
},
"ns" : "test.performant",
"name" : "tags_1"
}
]
To query the tags data from the above collection, one would normally use db.performant.find({tags:{$in:["bicep"]}}), but this is not a good idea. Let me show you why:
> db.performant.find({tags:{$in:["bicep","chest","trufala"]}}).explain()
{
"cursor" : "BtreeCursor tags_1 multi",
"isMultiKey" : true,
"n" : 2,
"nscannedObjects" : 3,
"nscanned" : 5,
"nscannedObjectsAllPlans" : 3,
"nscannedAllPlans" : 5,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"tags" : [
[
"bicep",
"bicep"
],
[
"chest",
"chest"
],
[
"trufala",
"trufala"
]
]
},
"server" : "none-6674b8f4f2:27017"
}
As you might have already noticed, this query is performing an entire collection scan. This may make you wonder, why the hack we added that index, if its not used and I am wondering that too. But unfortunately, its an issue that is yet to be resolved by mongoDB (at least in my knowledge)
Fortunately though, we can get around this problem and still use the index we created on tags collection. Here is how:
> db.performant.find({$or:[{tags:"bicep"},{tags:"chest"},{tags:"trufala"}]}).explain()
{
"clauses" : [
{
"cursor" : "BtreeCursor tags_1",
"isMultiKey" : true,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 10,
"indexBounds" : {
"tags" : [
[
"bicep",
"bicep"
]
]
}
},
{
"cursor" : "BtreeCursor tags_1",
"isMultiKey" : true,
"n" : 0,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"tags" : [
[
"chest",
"chest"
]
]
}
},
{
"cursor" : "BtreeCursor tags_1",
"isMultiKey" : true,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"tags" : [
[
"trufala",
"trufala"
]
]
}
}
],
"n" : 2,
"nscannedObjects" : 3,
"nscanned" : 3,
"nscannedObjectsAllPlans" : 3,
"nscannedAllPlans" : 3,
"millis" : 10,
"server" : "none-6674b8f4f2:27017"
}
As you can see, n is very very close to nscanned. There are three records scanned 1 each corresponding to "bicep","chest","trufala". Since "bicep" and "chest" belong to the same document, only 1 result is returned corresponding to it. In general, both count() and find() will do limited scans and will be very efficient. Also, you never have to serve user with stale data. You can also completely avoid running any sorts of batch jobs whatsoever!!!
So, by using this approach we can derive the following conclusion: If you search by n tags, and each tag is present m times, then total documents scanned will be n * m. Now considering you have huge number of tags and huge number of documents, and you scan by a few tags (which in turn correspond to few documents - although not 1:1), the results will always be super fast, since 1 document scan occurs per tag and document combination.
Note: The index can never be covered with this approach, since there is an index on array i.e. "isMultiKey" : true. You can read more on covered indexes here
Limitations: Every approach has limitations, and this one has too!!
Sorting on results will yield an extremely poor performance as it will scan the entire collection the number of times equal to the tags passed to this query plus it scans additional records corresponding to each argument of $or.
> db.performant.find({$or:[{tags:"bicep"},{tags:"chest"},{tags:"trufala"}]}).sort({tags:1}).explain()
{
"cursor" : "BtreeCursor tags_1",
"isMultiKey" : true,
"n" : 2,
"nscannedObjects" : 15,
"nscanned" : 15,
"nscannedObjectsAllPlans" : 15,
"nscannedAllPlans" : 15,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"tags" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "none-6674b8f4f2:27017"
}
In this case, it scans 15 times which is equal to 3 full collection scans of 4 records each plus 3 records scanned by each parameter for $or
Final verdict: Use this approach for very efficient results if you are ok with unsorted results or are willing to make extra effort at your front end to sort them yourself.