dongwen2794 2013-09-04 10:37

已采纳

如何创建一个在大型动态数据集中计算标记的性能系统

Overview

I have an iOS app where people can search by tags some of the tags will be pre-defined and some will be user defined.

As the user write the tags that he/she wants to search for I would like to display a row that shows the number of results available with those tags (See example search picture).

Note: #Exercise or #Routine are parent tag meaning that the person is always going to use one of those first.

I am using PHP and MongoDB server side. I though of creating a file with the tags count every hour so that all clients can fetch it and minimize resource consumption.

Given that the manipulated information are tags which are user controlled, the list will expand considerably over time.

The challenge

I am puzzled on whats the best approach when taking into account performance & overhead for creating, manipulating & storing such list.

My first idea was to create a 2d array (see pic) that will store all my values. This then will be transformed into a JSON so that it can stored into MongoDB.

But this approach will make me to fetch all tags and load them in memory to do any +1 or -1. Thus, I do not think it might be the best.

All the manipulation would take place on insert,update and delete of each element. Thus there will be a considerable RAM overhead.

My second idea was to create a document where I store all my used tags and make the count queries every hour to produce the list used by the clients.

This would mean on every delete, update and insert check that the tag exist on this document and create or delete or do nothing depending on condition.

Every hour, fetch all tags, produce an array with all tags combinations. Query the DB against all tags combinations and count the number of results returned and create the file.

This approach, I think, could be the better one given that I use MongoDB and not MySQL. But I am still unsure of its performance.

Have anyone made a similar system and could advice on a better approach?

Example Picture of the search

2d Array

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

doucongmishang2385 2013-09-08 06:05

关注

One of the most important things to consider when using mongodb is that you must think in terms of your application's access patterns when deciding on database design. We will try to solve your problem by using an example and see how it works.

Let's consider that you have the following documents in your collection and let us see below how to make this simple data format work wonders:

> db.performant.find()
{ "_id" : ObjectId("522bf7166094a4e72db22827"), "name" : "abc", "tags" : [  "chest",  "bicep",  "tricep" ] }
{ "_id" : ObjectId("522bf7406094a4e72db22828"), "name" : "def", "tags" : [  "routine",  "trufala",  "tricep" ] }
{ "_id" : ObjectId("522bf75f6094a4e72db22829"), "name" : "xyz", "tags" : [  "routine",  "myTag",  "tricep",  "myTag2" ] }
{ "_id" : ObjectId("522bf7876094a4e72db2282a"), "name" : "mno", "tags" : [  "exercise",  "myTag",  "tricep",  "myTag2",  "biceps" ] }

First and foremost you absolutely must create an index on tags. (you can create a compound index if desire)

> db.performant.ensureIndex({tags:1})
> db.performant.getIndexes()
[
    {
            "v" : 1,
            "key" : {
                    "_id" : 1
            },
            "ns" : "test.performant",
            "name" : "_id_"
    },
    {
            "v" : 1,
            "key" : {
                    "tags" : 1
            },
            "ns" : "test.performant",
            "name" : "tags_1"
    }
]

To query the tags data from the above collection, one would normally use db.performant.find({tags:{$in:["bicep"]}}), but this is not a good idea. Let me show you why:

> db.performant.find({tags:{$in:["bicep","chest","trufala"]}}).explain()
{
    "cursor" : "BtreeCursor tags_1 multi",
    "isMultiKey" : true,
    "n" : 2,
    "nscannedObjects" : 3,
    "nscanned" : 5,
    "nscannedObjectsAllPlans" : 3,
    "nscannedAllPlans" : 5,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
            "tags" : [
                    [
                            "bicep",
                            "bicep"
                    ],
                    [
                            "chest",
                            "chest"
                    ],
                    [
                            "trufala",
                            "trufala"
                    ]
            ]
    },
    "server" : "none-6674b8f4f2:27017"
}

As you might have already noticed, this query is performing an entire collection scan. This may make you wonder, why the hack we added that index, if its not used and I am wondering that too. But unfortunately, its an issue that is yet to be resolved by mongoDB (at least in my knowledge)

Fortunately though, we can get around this problem and still use the index we created on tags collection. Here is how:

> db.performant.find({$or:[{tags:"bicep"},{tags:"chest"},{tags:"trufala"}]}).explain()
{
    "clauses" : [
            {
                    "cursor" : "BtreeCursor tags_1",
                    "isMultiKey" : true,
                    "n" : 1,
                    "nscannedObjects" : 1,
                    "nscanned" : 1,
                    "nscannedObjectsAllPlans" : 1,
                    "nscannedAllPlans" : 1,
                    "scanAndOrder" : false,
                    "indexOnly" : false,
                    "nYields" : 0,
                    "nChunkSkips" : 0,
                    "millis" : 10,
                    "indexBounds" : {
                            "tags" : [
                                    [
                                            "bicep",
                                            "bicep"
                                    ]
                            ]
                    }
            },
            {
                    "cursor" : "BtreeCursor tags_1",
                    "isMultiKey" : true,
                    "n" : 0,
                    "nscannedObjects" : 1,
                    "nscanned" : 1,
                    "nscannedObjectsAllPlans" : 1,
                    "nscannedAllPlans" : 1,
                    "scanAndOrder" : false,
                    "indexOnly" : false,
                    "nYields" : 0,
                    "nChunkSkips" : 0,
                    "millis" : 0,
                    "indexBounds" : {
                            "tags" : [
                                    [
                                            "chest",
                                            "chest"
                                    ]
                            ]
                    }
            },
            {
                    "cursor" : "BtreeCursor tags_1",
                    "isMultiKey" : true,
                    "n" : 1,
                    "nscannedObjects" : 1,
                    "nscanned" : 1,
                    "nscannedObjectsAllPlans" : 1,
                    "nscannedAllPlans" : 1,
                    "scanAndOrder" : false,
                    "indexOnly" : false,
                    "nYields" : 0,
                    "nChunkSkips" : 0,
                    "millis" : 0,
                    "indexBounds" : {
                            "tags" : [
                                    [
                                            "trufala",
                                            "trufala"
                                    ]
                            ]
                    }
            }
    ],
    "n" : 2,
    "nscannedObjects" : 3,
    "nscanned" : 3,
    "nscannedObjectsAllPlans" : 3,
    "nscannedAllPlans" : 3,
    "millis" : 10,
    "server" : "none-6674b8f4f2:27017"
}

As you can see, n is very very close to nscanned. There are three records scanned 1 each corresponding to "bicep","chest","trufala". Since "bicep" and "chest" belong to the same document, only 1 result is returned corresponding to it. In general, both count() and find() will do limited scans and will be very efficient. Also, you never have to serve user with stale data. You can also completely avoid running any sorts of batch jobs whatsoever!!!

So, by using this approach we can derive the following conclusion: If you search by n tags, and each tag is present m times, then total documents scanned will be n * m. Now considering you have huge number of tags and huge number of documents, and you scan by a few tags (which in turn correspond to few documents - although not 1:1), the results will always be super fast, since 1 document scan occurs per tag and document combination.

Note: The index can never be covered with this approach, since there is an index on array i.e. "isMultiKey" : true. You can read more on covered indexes here

Limitations: Every approach has limitations, and this one has too!! Sorting on results will yield an extremely poor performance as it will scan the entire collection the number of times equal to the tags passed to this query plus it scans additional records corresponding to each argument of $or.

> db.performant.find({$or:[{tags:"bicep"},{tags:"chest"},{tags:"trufala"}]}).sort({tags:1}).explain()
{
    "cursor" : "BtreeCursor tags_1",
    "isMultiKey" : true,
    "n" : 2,
    "nscannedObjects" : 15,
    "nscanned" : 15,
    "nscannedObjectsAllPlans" : 15,
    "nscannedAllPlans" : 15,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
            "tags" : [
                    [
                            {
                                    "$minElement" : 1
                            },
                            {
                                    "$maxElement" : 1
                            }
                    ]
            ]
    },
    "server" : "none-6674b8f4f2:27017"
}

In this case, it scans 15 times which is equal to 3 full collection scans of 4 records each plus 3 records scanned by each parameter for $or

Final verdict: Use this approach for very efficient results if you are ok with unsorted results or are willing to make extra effort at your front end to sort them yourself.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(2条)

报告相同问题？

关注问题

如何创建一个在大型动态数据集中计算标记的性能系统 mongodb php
2013-09-04 10:37

回答 3 已采纳 One of the most important things to consider when using mongodb is that you must think in terms of
在创建列表时 li标记符的结束标记符不可省略 html5
2022-12-20 16:56

回答 2 已采纳你可以自己用浏览器工具看下你写的代码，被浏览器解析后的样子，其实是被加上的了。相当于浏览器给你补全了。但是规范的写法还是要加上结束标记符，不加的话在比较严格的情况下肯定会报错。
要导入一个excel表并且对里面的数据进行分类标记要怎么做呢 matlab
2021-09-14 13:12

回答 1 已采纳你好，这里给你先导入表格，分别标记为1，2，3，4,...20类 T = readtable('yourexcel.xlsx', 'readvariablenames', false); %没有表头的
数据中台建设方案-基于大数据平台
2023-03-14 16:42

FRDATA1550333的博客通过对客户大数据应用平台服务需求的理解，根据建设目标、设计原则的多方面考虑，建议采用星环... 通过建立大数据集成平台、大数据计算平台、大数据开发平台及大数据运维平台来满足客户大数据应用平台服务建设的要求。
如何创建一个二叉树？有人想的出来嘛？ c++ c语言算法
2022-11-09 06:52

回答 1 已采纳 ```cpp #include using namespace std; typedef struct node { char val; struct node *left;
yolov5怎么在图像中标记漏检？ opencv python 计算机视觉
2023-02-24 20:57

回答 4 已采纳该回答内容部分引用GPT，GPT_Pro更好的解决问题使用yolov5进行图像检测，在确定目标检出时，采用的是iou（交并比）>0.45的方式。如果想要在图像中标记漏检的目标，我们可以通过计算i
如何用二叉链表创建一个二叉树 c++ c语言广度优先
2022-11-09 00:09

回答 1 已采纳帮你找了个相似的问题, 你可以看下: https://ask.csdn.net/questions/7739306这篇博客你也可以参考下：创建二叉树的二叉链表存储结构这篇博客也不错, 你可以看下创建二
分布式计算、云计算与大数据
2019-12-03 11:50

昵称小丁的博客分布式计算是研究把一个需要非常巨大的计算能力解决的问题分成许多小的部分，然后把这些部分分配给许多计算机进行处理，最后把各部分的计算结果合并起来得到的最终成果（分而治之）。分布式计算的优缺点：优点：...
请问怎么在@Autowired执行之前来执行一段代码，来给@Autowired标记的interface来动态生成一个实现类 java spring spring boot 有问必答
2021-12-30 15:05

回答 1 已采纳第一：你就算自己注入了接口的实现类，你觉得你用了Autowired以后，他就会跳过这个字段嘛？第二：InstantiationAwareBeanPostProcessor.postProcessAft
用C语言编写一个学生成绩管理系统 c语言有问必答
2021-07-07 22:34

回答 1 已采纳参考： #include<stdio.h> #include<string.h> #include<stdlib.h> #define N 100
关于c语言的一个系统 c++ c语言有问必答
2022-11-04 23:56

回答 3 已采纳反正就是输入输出函数使用的错误导致崩溃 #include <stdio.h> #include <stdlib.h> #include <string.h> #
大数据开发离线计算框架知识点总结
2022-10-11 18:17

我想去吃ya的博客 Hadoop是一个分布式系统架构，由Apache基金会所开发，其核心主要包括两个组件：HDFS和MapReduce，前者为海量存储提供了存储，而后者为海量的数据提供了计算。这里我们主要关注MapReduce。以下资料来源于Hadoop的官方...
是否可以在一个php文档中包含多个file（）标记 php
2019-02-26 20:13

回答 1 已采纳 In my example at least this code has to be above the file() tag or under another tag but not direc
大数据分布式计算系统 Spark 入门核心之 RDD
2022-03-22 10:29

恒生LIGHT云社区的博客 Apache Spark 是一个快速且通用的集群计算系统。提供 Java、Scala、Python 和 R 中的高级 API，以及支持通用执行图的优化引擎。它还支持一组丰富的高级工具，包括用于 SQL 和结构化数据处理的 Spark SQL、用于机器...
顶级赛事：第十届CCF大数据与计算智能大赛开赛！
2022-09-02 11:19

DataFountain数据科学的博客由中国计算机学会主办的第十届CCF 大数据与计算智能大赛，携20道赛题、百万奖金来袭！欢迎广大数据爱好者参赛PK！
没有解决我的问题, 去提问

悬赏问题

¥15 微信会员卡接入微信支付商户号收款
¥15 如何获取烟草零售终端数据
¥15 数学建模招标中位数问题
¥15 phython路径名过长报错不知道什么问题
¥15 深度学习中模型转换该怎么实现
¥15 HLs设计手写数字识别程序编译通不过
¥15 Stata外部命令安装问题求帮助！
¥15 从键盘随机输入A-H中的一串字符串，用七段数码管方法进行绘制。提交代码及运行截图。
¥15 TYPCE母转母，插入认方向
¥15 如何用python向钉钉机器人发送可以放大的图片？

码龄粉丝数原力等级 --

如何创建一个在大型动态数据集中计算标记的性能系统

3条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

如何创建一个在大型动态数据集中计算标记的性能系统

3条回答 默认 最新

悬赏问题

3条回答默认最新