dongwen2794 2013-09-04 10:37
浏览 40
已采纳

如何创建一个在大型动态数据集中计算标记的性能系统

Overview

I have an iOS app where people can search by tags some of the tags will be pre-defined and some will be user defined.

As the user write the tags that he/she wants to search for I would like to display a row that shows the number of results available with those tags (See example search picture).

Note: #Exercise or #Routine are parent tag meaning that the person is always going to use one of those first.

I am using PHP and MongoDB server side. I though of creating a file with the tags count every hour so that all clients can fetch it and minimize resource consumption.

Given that the manipulated information are tags which are user controlled, the list will expand considerably over time.

The challenge

  • I am puzzled on whats the best approach when taking into account performance & overhead for creating, manipulating & storing such list.

My first idea was to create a 2d array (see pic) that will store all my values. This then will be transformed into a JSON so that it can stored into MongoDB.

But this approach will make me to fetch all tags and load them in memory to do any +1 or -1. Thus, I do not think it might be the best.

All the manipulation would take place on insert,update and delete of each element. Thus there will be a considerable RAM overhead.

My second idea was to create a document where I store all my used tags and make the count queries every hour to produce the list used by the clients.

This would mean on every delete, update and insert check that the tag exist on this document and create or delete or do nothing depending on condition.

Every hour, fetch all tags, produce an array with all tags combinations. Query the DB against all tags combinations and count the number of results returned and create the file.

This approach, I think, could be the better one given that I use MongoDB and not MySQL. But I am still unsure of its performance.

Have anyone made a similar system and could advice on a better approach?

Example Picture of the search

2d Array

  • 写回答

3条回答 默认 最新

  • doucongmishang2385 2013-09-08 06:05
    关注

    One of the most important things to consider when using mongodb is that you must think in terms of your application's access patterns when deciding on database design. We will try to solve your problem by using an example and see how it works.

    Let's consider that you have the following documents in your collection and let us see below how to make this simple data format work wonders:

    > db.performant.find()
    { "_id" : ObjectId("522bf7166094a4e72db22827"), "name" : "abc", "tags" : [  "chest",  "bicep",  "tricep" ] }
    { "_id" : ObjectId("522bf7406094a4e72db22828"), "name" : "def", "tags" : [  "routine",  "trufala",  "tricep" ] }
    { "_id" : ObjectId("522bf75f6094a4e72db22829"), "name" : "xyz", "tags" : [  "routine",  "myTag",  "tricep",  "myTag2" ] }
    { "_id" : ObjectId("522bf7876094a4e72db2282a"), "name" : "mno", "tags" : [  "exercise",  "myTag",  "tricep",  "myTag2",  "biceps" ] }
    

    First and foremost you absolutely must create an index on tags. (you can create a compound index if desire)

    > db.performant.ensureIndex({tags:1})
    > db.performant.getIndexes()
    [
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "test.performant",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "tags" : 1
                },
                "ns" : "test.performant",
                "name" : "tags_1"
        }
    ]
    

    To query the tags data from the above collection, one would normally use db.performant.find({tags:{$in:["bicep"]}}), but this is not a good idea. Let me show you why:

    > db.performant.find({tags:{$in:["bicep","chest","trufala"]}}).explain()
    {
        "cursor" : "BtreeCursor tags_1 multi",
        "isMultiKey" : true,
        "n" : 2,
        "nscannedObjects" : 3,
        "nscanned" : 5,
        "nscannedObjectsAllPlans" : 3,
        "nscannedAllPlans" : 5,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 0,
        "indexBounds" : {
                "tags" : [
                        [
                                "bicep",
                                "bicep"
                        ],
                        [
                                "chest",
                                "chest"
                        ],
                        [
                                "trufala",
                                "trufala"
                        ]
                ]
        },
        "server" : "none-6674b8f4f2:27017"
    }
    

    As you might have already noticed, this query is performing an entire collection scan. This may make you wonder, why the hack we added that index, if its not used and I am wondering that too. But unfortunately, its an issue that is yet to be resolved by mongoDB (at least in my knowledge)

    Fortunately though, we can get around this problem and still use the index we created on tags collection. Here is how:

    > db.performant.find({$or:[{tags:"bicep"},{tags:"chest"},{tags:"trufala"}]}).explain()
    {
        "clauses" : [
                {
                        "cursor" : "BtreeCursor tags_1",
                        "isMultiKey" : true,
                        "n" : 1,
                        "nscannedObjects" : 1,
                        "nscanned" : 1,
                        "nscannedObjectsAllPlans" : 1,
                        "nscannedAllPlans" : 1,
                        "scanAndOrder" : false,
                        "indexOnly" : false,
                        "nYields" : 0,
                        "nChunkSkips" : 0,
                        "millis" : 10,
                        "indexBounds" : {
                                "tags" : [
                                        [
                                                "bicep",
                                                "bicep"
                                        ]
                                ]
                        }
                },
                {
                        "cursor" : "BtreeCursor tags_1",
                        "isMultiKey" : true,
                        "n" : 0,
                        "nscannedObjects" : 1,
                        "nscanned" : 1,
                        "nscannedObjectsAllPlans" : 1,
                        "nscannedAllPlans" : 1,
                        "scanAndOrder" : false,
                        "indexOnly" : false,
                        "nYields" : 0,
                        "nChunkSkips" : 0,
                        "millis" : 0,
                        "indexBounds" : {
                                "tags" : [
                                        [
                                                "chest",
                                                "chest"
                                        ]
                                ]
                        }
                },
                {
                        "cursor" : "BtreeCursor tags_1",
                        "isMultiKey" : true,
                        "n" : 1,
                        "nscannedObjects" : 1,
                        "nscanned" : 1,
                        "nscannedObjectsAllPlans" : 1,
                        "nscannedAllPlans" : 1,
                        "scanAndOrder" : false,
                        "indexOnly" : false,
                        "nYields" : 0,
                        "nChunkSkips" : 0,
                        "millis" : 0,
                        "indexBounds" : {
                                "tags" : [
                                        [
                                                "trufala",
                                                "trufala"
                                        ]
                                ]
                        }
                }
        ],
        "n" : 2,
        "nscannedObjects" : 3,
        "nscanned" : 3,
        "nscannedObjectsAllPlans" : 3,
        "nscannedAllPlans" : 3,
        "millis" : 10,
        "server" : "none-6674b8f4f2:27017"
    }
    

    As you can see, n is very very close to nscanned. There are three records scanned 1 each corresponding to "bicep","chest","trufala". Since "bicep" and "chest" belong to the same document, only 1 result is returned corresponding to it. In general, both count() and find() will do limited scans and will be very efficient. Also, you never have to serve user with stale data. You can also completely avoid running any sorts of batch jobs whatsoever!!!

    So, by using this approach we can derive the following conclusion: If you search by n tags, and each tag is present m times, then total documents scanned will be n * m. Now considering you have huge number of tags and huge number of documents, and you scan by a few tags (which in turn correspond to few documents - although not 1:1), the results will always be super fast, since 1 document scan occurs per tag and document combination.

    Note: The index can never be covered with this approach, since there is an index on array i.e. "isMultiKey" : true. You can read more on covered indexes here

    Limitations: Every approach has limitations, and this one has too!! Sorting on results will yield an extremely poor performance as it will scan the entire collection the number of times equal to the tags passed to this query plus it scans additional records corresponding to each argument of $or.

    > db.performant.find({$or:[{tags:"bicep"},{tags:"chest"},{tags:"trufala"}]}).sort({tags:1}).explain()
    {
        "cursor" : "BtreeCursor tags_1",
        "isMultiKey" : true,
        "n" : 2,
        "nscannedObjects" : 15,
        "nscanned" : 15,
        "nscannedObjectsAllPlans" : 15,
        "nscannedAllPlans" : 15,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 0,
        "indexBounds" : {
                "tags" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ]
        },
        "server" : "none-6674b8f4f2:27017"
    } 
    

    In this case, it scans 15 times which is equal to 3 full collection scans of 4 records each plus 3 records scanned by each parameter for $or

    Final verdict: Use this approach for very efficient results if you are ok with unsorted results or are willing to make extra effort at your front end to sort them yourself.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 微信会员卡接入微信支付商户号收款
  • ¥15 如何获取烟草零售终端数据
  • ¥15 数学建模招标中位数问题
  • ¥15 phython路径名过长报错 不知道什么问题
  • ¥15 深度学习中模型转换该怎么实现
  • ¥15 HLs设计手写数字识别程序编译通不过
  • ¥15 Stata外部命令安装问题求帮助!
  • ¥15 从键盘随机输入A-H中的一串字符串,用七段数码管方法进行绘制。提交代码及运行截图。
  • ¥15 TYPCE母转母,插入认方向
  • ¥15 如何用python向钉钉机器人发送可以放大的图片?