dougong7850 2015-06-14 05:29
浏览 34

MongoDB文件计数波动

I am facing an issue with the document count in a collection being slightly erratic.

Here is my workflow:

Crawling is first done with scrapy. Scraped items are sent through a pipeline and prepared for writing to the collection using pymongo library.

Next, perform a check to see if the item currently exists (using a key) and if so, inherit the _id and use db.collection.save() to achieve an upsert. A check is done to ensure that all fields exist before writing.

If the item does not exist, a new document is created in the collection.

Lastly, a frontend PHP webpage allows users to search for documents in the collection using the PHP mongoDB driver.

Issue

I started noticing on the webpage that some new documents would appear in one crawl, then disappear from view suddenly, and then mysteriously appear again after the next crawl. So I went into mongo shell and found that a specific query would return a fluctuating number of results if sent repeatedly. Something like up by one and then down by two and then back to a stable number.

The thing I don't get is that at no point in the code do I remove() any documents from the collection. My impression is that db.collection.save() will only result in an equal or increasing number of documents in the collection.

Is there some form of blocking whereby documents being written cannot be queried? Or does it have something to do with my crawling interval?

Notes:

  • No indexing is done on the collection
  • Each crawl+write process only takes about 5-10s and are repeated in 30s interval.

Code snippet of the query:

    $cursor = $collection->find(array( '$or' => array(
            array('post_content' => new MongoRegex("/$safe/i")),
            array('post_user' => new MongoRegex("/^$safe$/i"))
    )));
    $cursor->sort(array('post_datetime' => -1));
  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥15 如何在scanpy上做差异基因和通路富集?
    • ¥20 关于#硬件工程#的问题,请各位专家解答!
    • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
    • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
    • ¥30 截图中的mathematics程序转换成matlab
    • ¥15 动力学代码报错,维度不匹配
    • ¥15 Power query添加列问题
    • ¥50 Kubernetes&Fission&Eleasticsearch
    • ¥15 報錯:Person is not mapped,如何解決?
    • ¥15 c++头文件不能识别CDialog