I am facing an issue with the document
count in a collection
being slightly erratic.
Here is my workflow:
Crawling is first done with scrapy
. Scraped items are sent through a pipeline
and prepared for writing to the collection
using pymongo
library.
Next, perform a check to see if the item currently exists (using a key) and if so, inherit the _id
and use db.collection.save()
to achieve an upsert
. A check is done to ensure that all fields
exist before writing.
If the item does not exist, a new document
is created in the collection
.
Lastly, a frontend PHP
webpage allows users to search for documents
in the collection
using the PHP mongoDB driver
.
Issue
I started noticing on the webpage that some new documents would appear in one crawl, then disappear from view suddenly, and then mysteriously appear again after the next crawl. So I went into mongo shell
and found that a specific query would return a fluctuating number of results if sent repeatedly. Something like up by one and then down by two and then back to a stable number.
The thing I don't get is that at no point in the code do I remove()
any documents
from the collection
. My impression is that db.collection.save()
will only result in an equal or increasing number of documents in the collection.
Is there some form of blocking whereby documents being written cannot be queried? Or does it have something to do with my crawling interval?
Notes:
- No indexing is done on the collection
- Each crawl+write process only takes about 5-10s and are repeated in 30s interval.
Code snippet of the query:
$cursor = $collection->find(array( '$or' => array(
array('post_content' => new MongoRegex("/$safe/i")),
array('post_user' => new MongoRegex("/^$safe$/i"))
)));
$cursor->sort(array('post_datetime' => -1));