weixin_39810441 2020-11-29 16:45
浏览 0

BdbFrontier thread safety

We're attempting to use Heritrix3 with an external module that populates the BdbFrontier via Kafka, and we're hitting problems interacting with the frontier safely. There's some more details in ukwa/ukwa-heritrix#16, but to summarise, ToeThreads are dying because keepItem is null when it should not be.

I believe this is because peekItem is marked as transient. Occasionally, between setting peekItem (this statement) and using it (this one), the WorkQueue gets updated by a separate thread in a way that forces it to get written out to disk and then read back in again. As peekItem is transient, flushing it out to the disk and back drops the value and we're left with a null.

NetArchive Suite have also seen this issue when using a RabbitMQ-based URL receiver, and patched it by ignoring the null.

The simplest way to avoid this would be to remove the transient modified from peekItem but that makes me worry because someone deliberately chose to make it transient and I don't understand why.

Secondly, I don't understand why we are seeing this, when IA also use similar methods and are (presumably?) not seeing this. Moreover, this model appears not to be fundamentally different to the traditional ActionDirectory, so I don't understand why this wasn't seen a long time ago.

Finally, this issue also made it clear that I don't actually understand how best to interact with the BdbFrontier in a thread-safe manner. If I am right in assuming that every modification to a WorkQueue needs to be followed by a .makeDirty() that serialised the queue out to disk and reads it back in again, then surely every modification needs to edit-then-write within a synchronized(WorkQueue) block? But it's pretty easy to find examples where this appears to be deliberately not the case:

https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/engine/src/main/java/org/archive/crawler/frontier/WorkQueueFrontier.java#L390-L410

I'd appreciate any information anyone has on how best to inject URLs into Heritrix3, and on whether or not I've understood how the BdbFrontier works.

该提问来源于开源项目:internetarchive/heritrix3

  • 写回答

9条回答 默认 最新

  • weixin_39810441 2020-11-29 16:45
    关注
    评论

报告相同问题?