dougong7850 2015-06-14 05:29
浏览 34

MongoDB文件计数波动

I am facing an issue with the document count in a collection being slightly erratic.

Here is my workflow:

Crawling is first done with scrapy. Scraped items are sent through a pipeline and prepared for writing to the collection using pymongo library.

Next, perform a check to see if the item currently exists (using a key) and if so, inherit the _id and use db.collection.save() to achieve an upsert. A check is done to ensure that all fields exist before writing.

If the item does not exist, a new document is created in the collection.

Lastly, a frontend PHP webpage allows users to search for documents in the collection using the PHP mongoDB driver.

Issue

I started noticing on the webpage that some new documents would appear in one crawl, then disappear from view suddenly, and then mysteriously appear again after the next crawl. So I went into mongo shell and found that a specific query would return a fluctuating number of results if sent repeatedly. Something like up by one and then down by two and then back to a stable number.

The thing I don't get is that at no point in the code do I remove() any documents from the collection. My impression is that db.collection.save() will only result in an equal or increasing number of documents in the collection.

Is there some form of blocking whereby documents being written cannot be queried? Or does it have something to do with my crawling interval?

Notes:

  • No indexing is done on the collection
  • Each crawl+write process only takes about 5-10s and are repeated in 30s interval.

Code snippet of the query:

    $cursor = $collection->find(array( '$or' => array(
            array('post_content' => new MongoRegex("/$safe/i")),
            array('post_user' => new MongoRegex("/^$safe$/i"))
    )));
    $cursor->sort(array('post_datetime' => -1));
  • 写回答

0条回答

    报告相同问题?

    悬赏问题

    • ¥15 孟德尔随机化结果不一致
    • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败
    • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
    • ¥15 谁有desed数据集呀
    • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100
    • ¥15 关于#hadoop#的问题
    • ¥15 (标签-Python|关键词-socket)
    • ¥15 keil里为什么main.c定义的函数在it.c调用不了
    • ¥50 切换TabTip键盘的输入法
    • ¥15 可否在不同线程中调用封装数据库操作的类