dongwen2794 2013-09-04 10:37
浏览 40
已采纳

如何创建一个在大型动态数据集中计算标记的性能系统

Overview

I have an iOS app where people can search by tags some of the tags will be pre-defined and some will be user defined.

As the user write the tags that he/she wants to search for I would like to display a row that shows the number of results available with those tags (See example search picture).

Note: #Exercise or #Routine are parent tag meaning that the person is always going to use one of those first.

I am using PHP and MongoDB server side. I though of creating a file with the tags count every hour so that all clients can fetch it and minimize resource consumption.

Given that the manipulated information are tags which are user controlled, the list will expand considerably over time.

The challenge

  • I am puzzled on whats the best approach when taking into account performance & overhead for creating, manipulating & storing such list.

My first idea was to create a 2d array (see pic) that will store all my values. This then will be transformed into a JSON so that it can stored into MongoDB.

But this approach will make me to fetch all tags and load them in memory to do any +1 or -1. Thus, I do not think it might be the best.

All the manipulation would take place on insert,update and delete of each element. Thus there will be a considerable RAM overhead.

My second idea was to create a document where I store all my used tags and make the count queries every hour to produce the list used by the clients.

This would mean on every delete, update and insert check that the tag exist on this document and create or delete or do nothing depending on condition.

Every hour, fetch all tags, produce an array with all tags combinations. Query the DB against all tags combinations and count the number of results returned and create the file.

This approach, I think, could be the better one given that I use MongoDB and not MySQL. But I am still unsure of its performance.

Have anyone made a similar system and could advice on a better approach?

Example Picture of the search

2d Array

  • 写回答

3条回答 默认 最新

  • doucongmishang2385 2013-09-08 06:05
    关注

    One of the most important things to consider when using mongodb is that you must think in terms of your application's access patterns when deciding on database design. We will try to solve your problem by using an example and see how it works.

    Let's consider that you have the following documents in your collection and let us see below how to make this simple data format work wonders:

    > db.performant.find()
    { "_id" : ObjectId("522bf7166094a4e72db22827"), "name" : "abc", "tags" : [  "chest",  "bicep",  "tricep" ] }
    { "_id" : ObjectId("522bf7406094a4e72db22828"), "name" : "def", "tags" : [  "routine",  "trufala",  "tricep" ] }
    { "_id" : ObjectId("522bf75f6094a4e72db22829"), "name" : "xyz", "tags" : [  "routine",  "myTag",  "tricep",  "myTag2" ] }
    { "_id" : ObjectId("522bf7876094a4e72db2282a"), "name" : "mno", "tags" : [  "exercise",  "myTag",  "tricep",  "myTag2",  "biceps" ] }
    

    First and foremost you absolutely must create an index on tags. (you can create a compound index if desire)

    > db.performant.ensureIndex({tags:1})
    > db.performant.getIndexes()
    [
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "test.performant",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "tags" : 1
                },
                "ns" : "test.performant",
                "name" : "tags_1"
        }
    ]
    

    To query the tags data from the above collection, one would normally use db.performant.find({tags:{$in:["bicep"]}}), but this is not a good idea. Let me show you why:

    > db.performant.find({tags:{$in:["bicep","chest","trufala"]}}).explain()
    {
        "cursor" : "BtreeCursor tags_1 multi",
        "isMultiKey" : true,
        "n" : 2,
        "nscannedObjects" : 3,
        "nscanned" : 5,
        "nscannedObjectsAllPlans" : 3,
        "nscannedAllPlans" : 5,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 0,
        "indexBounds" : {
                "tags" : [
                        [
                                "bicep",
                                "bicep"
                        ],
                        [
                                "chest",
                                "chest"
                        ],
                        [
                                "trufala",
                                "trufala"
                        ]
                ]
        },
        "server" : "none-6674b8f4f2:27017"
    }
    

    As you might have already noticed, this query is performing an entire collection scan. This may make you wonder, why the hack we added that index, if its not used and I am wondering that too. But unfortunately, its an issue that is yet to be resolved by mongoDB (at least in my knowledge)

    Fortunately though, we can get around this problem and still use the index we created on tags collection. Here is how:

    > db.performant.find({$or:[{tags:"bicep"},{tags:"chest"},{tags:"trufala"}]}).explain()
    {
        "clauses" : [
                {
                        "cursor" : "BtreeCursor tags_1",
                        "isMultiKey" : true,
                        "n" : 1,
                        "nscannedObjects" : 1,
                        "nscanned" : 1,
                        "nscannedObjectsAllPlans" : 1,
                        "nscannedAllPlans" : 1,
                        "scanAndOrder" : false,
                        "indexOnly" : false,
                        "nYields" : 0,
                        "nChunkSkips" : 0,
                        "millis" : 10,
                        "indexBounds" : {
                                "tags" : [
                                        [
                                                "bicep",
                                                "bicep"
                                        ]
                                ]
                        }
                },
                {
                        "cursor" : "BtreeCursor tags_1",
                        "isMultiKey" : true,
                        "n" : 0,
                        "nscannedObjects" : 1,
                        "nscanned" : 1,
                        "nscannedObjectsAllPlans" : 1,
                        "nscannedAllPlans" : 1,
                        "scanAndOrder" : false,
                        "indexOnly" : false,
                        "nYields" : 0,
                        "nChunkSkips" : 0,
                        "millis" : 0,
                        "indexBounds" : {
                                "tags" : [
                                        [
                                                "chest",
                                                "chest"
                                        ]
                                ]
                        }
                },
                {
                        "cursor" : "BtreeCursor tags_1",
                        "isMultiKey" : true,
                        "n" : 1,
                        "nscannedObjects" : 1,
                        "nscanned" : 1,
                        "nscannedObjectsAllPlans" : 1,
                        "nscannedAllPlans" : 1,
                        "scanAndOrder" : false,
                        "indexOnly" : false,
                        "nYields" : 0,
                        "nChunkSkips" : 0,
                        "millis" : 0,
                        "indexBounds" : {
                                "tags" : [
                                        [
                                                "trufala",
                                                "trufala"
                                        ]
                                ]
                        }
                }
        ],
        "n" : 2,
        "nscannedObjects" : 3,
        "nscanned" : 3,
        "nscannedObjectsAllPlans" : 3,
        "nscannedAllPlans" : 3,
        "millis" : 10,
        "server" : "none-6674b8f4f2:27017"
    }
    

    As you can see, n is very very close to nscanned. There are three records scanned 1 each corresponding to "bicep","chest","trufala". Since "bicep" and "chest" belong to the same document, only 1 result is returned corresponding to it. In general, both count() and find() will do limited scans and will be very efficient. Also, you never have to serve user with stale data. You can also completely avoid running any sorts of batch jobs whatsoever!!!

    So, by using this approach we can derive the following conclusion: If you search by n tags, and each tag is present m times, then total documents scanned will be n * m. Now considering you have huge number of tags and huge number of documents, and you scan by a few tags (which in turn correspond to few documents - although not 1:1), the results will always be super fast, since 1 document scan occurs per tag and document combination.

    Note: The index can never be covered with this approach, since there is an index on array i.e. "isMultiKey" : true. You can read more on covered indexes here

    Limitations: Every approach has limitations, and this one has too!! Sorting on results will yield an extremely poor performance as it will scan the entire collection the number of times equal to the tags passed to this query plus it scans additional records corresponding to each argument of $or.

    > db.performant.find({$or:[{tags:"bicep"},{tags:"chest"},{tags:"trufala"}]}).sort({tags:1}).explain()
    {
        "cursor" : "BtreeCursor tags_1",
        "isMultiKey" : true,
        "n" : 2,
        "nscannedObjects" : 15,
        "nscanned" : 15,
        "nscannedObjectsAllPlans" : 15,
        "nscannedAllPlans" : 15,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 0,
        "indexBounds" : {
                "tags" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ]
        },
        "server" : "none-6674b8f4f2:27017"
    } 
    

    In this case, it scans 15 times which is equal to 3 full collection scans of 4 records each plus 3 records scanned by each parameter for $or

    Final verdict: Use this approach for very efficient results if you are ok with unsorted results or are willing to make extra effort at your front end to sort them yourself.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 树莓派与pix飞控通信
  • ¥15 自动转发微信群信息到另外一个微信群
  • ¥15 outlook无法配置成功
  • ¥30 这是哪个作者做的宝宝起名网站
  • ¥60 版本过低apk如何修改可以兼容新的安卓系统
  • ¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
  • ¥50 有数据,怎么建立模型求影响全要素生产率的因素
  • ¥50 有数据,怎么用matlab求全要素生产率
  • ¥15 TI的insta-spin例程
  • ¥15 完成下列问题完成下列问题