I'm hitting problems when re-using crawl state (checkpoints). I get a lot of errors like:
WARNING: com.google.common.cache.LocalCache processPendingNotifications Exception thrown by removal listener [Tue Mar 19 12:07:00 GMT 2019]
java.lang.IllegalArgumentException: Can not set org.archive.modules.fetcher.FetchStats field org.archive.crawler.frontier.WorkQueue.substats to java.lang.Byte
at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:167)
at sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:171)
at sun.reflect.UnsafeObjectFieldAccessorImpl.set(UnsafeObjectFieldAccessorImpl.java:81)
at java.lang.reflect.Field.set(Field.java:764)
at com.esotericsoftware.kryo.serialize.FieldSerializer$CachedField.set(FieldSerializer.java:290)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:209)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:178)
at com.esotericsoftware.kryo.Kryo.readObjectData(Kryo.java:512)
at com.esotericsoftware.kryo.ObjectBuffer.readObjectData(ObjectBuffer.java:212)
at org.archive.bdb.KryoBinding.entryToObject(KryoBinding.java:84)
at com.sleepycat.collections.DataView.makeValue(DataView.java:595)
at com.sleepycat.collections.DataCursor.getCurrentValue(DataCursor.java:349)
at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:813)
at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:119)
at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:374)
at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:688)
at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:1016)
at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:569)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)
One possible cause is that the Kryo serialisers are not getting set up right.
As I understand it, the reflection-based auto-registration magic attempts to register the classes needed, and as I understand the documentation this saves storage space but relies on classes getting registered in a consistent order (so the same classes get the same IDs).
However, this registration appears to happen on the Spring Lifecycle.start() event, e.g. org.archive.modules.net.BdbServerCache.start() or org.archive.crawler.frontier.WorkQueueFrontier.start() and AFAICT nothing is explicitly enforcing the order of these events.
It looks like the latter leads to
https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/modules/src/main/java/org/archive/modules/CrawlURI.java#L1808-L1811
(i.e. there we see Byte getting registered) and the former leads to
https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/modules/src/main/java/org/archive/modules/net/CrawlServer.java#L319-L321
(i.e. there's FetchStats) which seems suspicious. However, in both cases, the autoregistered class is the second class to get registered, not the first, so it's not clear why this would be the case.
I'm having trouble understanding exactly what goes on with Kryo 1 and thread context and therefore whether the reference IDs are global or ThreadLocal or AutoKyro-instance scoped.
I'm left to assume I must have missed something, otherwise this would never have worked reliably at all!
该提问来源于开源项目:internetarchive/heritrix3