weixin_39810441 2020-11-29 16:45
浏览 0

Intermittent problems with Kryo serialisation for crawls resumed from checkpoints

I'm hitting problems when re-using crawl state (checkpoints). I get a lot of errors like:


WARNING: com.google.common.cache.LocalCache processPendingNotifications Exception thrown by removal listener [Tue Mar 19 12:07:00 GMT 2019]
java.lang.IllegalArgumentException: Can not set org.archive.modules.fetcher.FetchStats field org.archive.crawler.frontier.WorkQueue.substats to java.lang.Byte
        at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:167)
        at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:171)
        at sun.reflect.UnsafeObjectFieldAccessorImpl.set(UnsafeObjectFieldAccessorImpl.java:81)
        at java.lang.reflect.Field.set(Field.java:764)
        at com.esotericsoftware.kryo.serialize.FieldSerializer$CachedField.set(FieldSerializer.java:290)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:209)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:178)
        at com.esotericsoftware.kryo.Kryo.readObjectData(Kryo.java:512)
        at com.esotericsoftware.kryo.ObjectBuffer.readObjectData(ObjectBuffer.java:212)
        at org.archive.bdb.KryoBinding.entryToObject(KryoBinding.java:84)
        at com.sleepycat.collections.DataView.makeValue(DataView.java:595)
        at com.sleepycat.collections.DataCursor.getCurrentValue(DataCursor.java:349)
        at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:813)
        at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
        at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
        at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
        at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:119)
        at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
        at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
        at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
        at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
        at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
        at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:374)
        at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:688)
        at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:1016)
        at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:569)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)

One possible cause is that the Kryo serialisers are not getting set up right.

As I understand it, the reflection-based auto-registration magic attempts to register the classes needed, and as I understand the documentation this saves storage space but relies on classes getting registered in a consistent order (so the same classes get the same IDs).

However, this registration appears to happen on the Spring Lifecycle.start() event, e.g. org.archive.modules.net.BdbServerCache.start() or org.archive.crawler.frontier.WorkQueueFrontier.start() and AFAICT nothing is explicitly enforcing the order of these events.

It looks like the latter leads to

https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/modules/src/main/java/org/archive/modules/CrawlURI.java#L1808-L1811

(i.e. there we see Byte getting registered) and the former leads to

https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/modules/src/main/java/org/archive/modules/net/CrawlServer.java#L319-L321

(i.e. there's FetchStats) which seems suspicious. However, in both cases, the autoregistered class is the second class to get registered, not the first, so it's not clear why this would be the case.

I'm having trouble understanding exactly what goes on with Kryo 1 and thread context and therefore whether the reference IDs are global or ThreadLocal or AutoKyro-instance scoped.

I'm left to assume I must have missed something, otherwise this would never have worked reliably at all!

该提问来源于开源项目:internetarchive/heritrix3

  • 写回答

8条回答 默认 最新

  • weixin_39810441 2020-11-29 16:45
    关注

    Hm, well it seems each store gets it's own ObjectIdentityBdbManualCache and each of those has it's own AutoKryo instance so this seems reasonably safe. i.e. the code should be able to reload the checkpoint if the code that registers the classes for that part of the system has not been changed?

    评论

报告相同问题?