We've seen occasional mysterious issues when resuming crawls from checkpoints multiple times. Close inspection of the behaviour when resuming the crawl indicates that checkpoints can only reliably be used once.
The code uses the DbBackup helper to manage backups. This depends on calling startBackup, the docs for which note:
Following this method call, all new data will be written to other, new log files. In other words, the last file in the backup set will not be modified after this method returns.
i.e. the BDB-JE makes sure the backup is consistent and after that those files should no longer be altered (BDB is an append-only DB system). The backup documentation implies this flushed, sync'd consistency is necessary for the backup to work.
(Note that a H3 checkpoint is not same thing as a BDB-JE checkpoint - the former is a point in time backup, but the latter is a flush/sync operation).
However, when resuming from a crawl, rather than copying the checkpoint files (as recommended by the documentation), Heritrix3 uses hard links (and cleans out any other state files not part of the checkpoint).
This causes an issues because, having resumed a crawl, I noticed that the last .jdb file in the checkpoint was being changed! From the backup behaviour, we might expect that existing files would not be changed, but in fact when resuming a crawl, the system proceeds by appending data to the last .jdb file.
As this file is a hard link, this activity also changes the contents of the checkpoint. Furthermore, if we are resuming the crawl from one older checkpoint among many, all subsequent checkpoints are also modified.
As an example, we recently attempted to resume from a checkpoint, hit some difficulties and had to force-kill the crawler. After this, we attempted to re-start from that checkpoint, and hit some very strange behaviour that hung the crawler. (Fortunately, we happen to have a backup of those checkpoints!)
To avoid this, we could actually copy the last log file rather than make a hard link back to the checkpoint. Alternatively, it may be possible to call start/stop backup immediately upon restoring the DB, which should prevent existing files being appended to (assuming no race conditions).
该提问来源于开源项目:internetarchive/heritrix3