Checkpoints 'spoiled' when used to resume crawls

We've seen occasional mysterious issues when resuming crawls from checkpoints multiple times. Close inspection of the behaviour when resuming the crawl indicates that checkpoints can only reliably be used once.

The code uses the DbBackup helper to manage backups. This depends on calling startBackup, the docs for which note:

Following this method call, all new data will be written to other, new log files. In other words, the last file in the backup set will not be modified after this method returns.

i.e. the BDB-JE makes sure the backup is consistent and after that those files should no longer be altered (BDB is an append-only DB system). The backup documentation implies this flushed, sync'd consistency is necessary for the backup to work.

(Note that a H3 checkpoint is not same thing as a BDB-JE checkpoint - the former is a point in time backup, but the latter is a flush/sync operation).

However, when resuming from a crawl, rather than copying the checkpoint files (as recommended by the documentation), Heritrix3 uses hard links (and cleans out any other state files not part of the checkpoint).

This causes an issues because, having resumed a crawl, I noticed that the last .jdb file in the checkpoint was being changed! From the backup behaviour, we might expect that existing files would not be changed, but in fact when resuming a crawl, the system proceeds by appending data to the last .jdb file.

As this file is a hard link, this activity also changes the contents of the checkpoint. Furthermore, if we are resuming the crawl from one older checkpoint among many, all subsequent checkpoints are also modified.

As an example, we recently attempted to resume from a checkpoint, hit some difficulties and had to force-kill the crawler. After this, we attempted to re-start from that checkpoint, and hit some very strange behaviour that hung the crawler. (Fortunately, we happen to have a backup of those checkpoints!)

To avoid this, we could actually copy the last log file rather than make a hard link back to the checkpoint. Alternatively, it may be possible to call start/stop backup immediately upon restoring the DB, which should prevent existing files being appended to (assuming no race conditions).

该提问来源于开源项目：internetarchive/heritrix3

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

7条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
weixin_39810441 2020-11-29 14:18
关注
points out that enabling copy-on-write via cp --reflink=auto might present a safe and space-efficient alternative on file systems that support it.

But after discussion we agreed we should implement the simplest possible fix first, and check it works, before looking into that as an optimisation.

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

comfyUI背后的一些技术——Checkpoints
2025-08-04 11:42

zcg1942的博客 Checkpoints Checkpoints本来指的是使用tensorflow训练存储的中间节点，便于继续训练。对应的是yTorch保存模型参数的格式.pt。在comfyUI中的Checkpoints特指的是基座大模型。它是生成图片的核心模块，潜空间中的前...
.ipynb_checkpoints
2024-04-02 11:59

qq_43550139的博客在读取数据集标签的时候报错，因为jupter会自动创建.ipynb_checkpoints，这是一个隐藏文件。用try去解决这个异常。
Some weights of the model checkpoint at bert_pretrain were not used when initializing BertModel
2022-11-09 11:00

三千院本院的博客 Some weights of the model checkpoint at bert_pretrain were not used when initializing BertModel参数不匹配
autodl无法打开checkpoints文件夹
2024-03-16 19:45

皮皮皮皮猪゛的博客解决方法：重新建立一个文件夹，不要命名为checkpoints即可，将路径改为新的文件夹所在路径。checkpoints为autodl关键词，所以无法打开。
loras和checkpoints的概念和应用
2024-06-04 19:53

紫色菠萝Boy的博客最后,checkpoints还可以用于模型选择,选择在验证集上表现最好的checkpoint作为最终的模型。它的核心思想是,对于预训练好的大型模型,并不是所有的参数都对最终的性能贡献很大。在迁移学习中,Loras可以快速地将预训练...
公共检查点（checkpoints）+探针（Probe）详解
2024-11-15 11:24

摆烂仙君的博客公共检查点”（checkpoints）是指在模型训练过程中保存的模型参数和状态。这些检查点通常在模型训练完成后或者在特定的训练阶段被保存下来，以便后续可以重新加载模型并继续训练或者用于模型评估。其包括（1）模型...
AutoDL中Notebook中无法打开“checkpoints”文件夹
2024-05-27 20:45

业余小程序猿的博客 checkpoints是Notebook的关键字，若用户创建文件夹命名为checkpoints，则在JupyterLab上无法打开、重命名和删除。此时可以在Terminal里使用命令行打开checkpoints，或者新建文件夹将checkpoints里的数据移动到新的...
TensorFlow Estimator 官方文档之----Checkpoints
2018-10-10 15:25

黑暗星球的博客 checkpoints：该格式不可以跨语言。 SavedModel：该格式可以跨语言。本文主要讲述checkpoints相关内容。关于 SavedModel 的更多细节，详见 Saving and Restoring。 Estimator 模型的保存 Estimators 在训练...
Transformers 转换Tensorflow的Checkpoints | 九
2020-04-13 20:31

磐创 AI的博客提供了一个命令行界面来转换模型中的原始Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM的Checkpoints，然后使用库的from_pretrained方法加载该Checkpoints。注意：从2.3.0版本开始，转换脚本现在已成为 transformers ...
.ipynb_checkpoints报错的解决方法
2022-10-10 15:30

F_-_-_的博客可以发现提示有一个.ipynb_checkpoints的文件。但当我去对应的文件夹找时根本看不到这个文件，所以猜测是一个隐藏文件。所以通过终端进入对应的目录：输入cd coco进入对应目录，输入。之后代码就可以正常运行了。...
.ipynb_checkpoints隐藏文件引发的错误
2022-09-16 17:15

LinuxMelo的博客删除隐藏文件.ipynb_checkpoints，解除文件读取错误
Flink实战问题（三）： Failed to rollback to checkpoint/savepoint
2021-12-28 20:52

码农_程序员的博客 Caused by: java.lang.IllegalStateException: Failed to rollback to checkpoint/savepoint hdfs://127.0.0.1/flink-checkpoints/78f7cb6b577fe6db19648ca63607e640/chk-6976. Cannot map checkpoint/savepoint ...
删除Jupyter中产生的.ipynb_checkpoints
2022-05-08 13:19

-徐徐图之-的博客数据处理后发现部分数据文件并未生成，经过检查发现是由于这部分的数据文件夹中存在.ipynb_checkpoints导致判断条件未满足（index == len(os.listdir(os.path.join(args.data, current_seq)))）而造成的，所以问题...
mmdetection中的load_from和resume_from
2023-06-03 16:57

800问的博客 load_from和resume_from都是mmdetection库中的函数，用于加载模型文件。它们的区别在于： load_from是从头开始加载整个模型文件，包括预训练模型和指定层的模型参数。而resume_from则是从上一次加载的位置继续加载...
Flink状态Checkpoints检查点设置
2022-08-23 11:37

angelasp的博客为你的程序如何开启和配置checkpoint见 Checkpointing for how to enable and configure checkpoints for your program. Externalized Checkpoints 默认情况下，checkpoint不是持久化的，只用于从故障中恢复作业。...
PyTorch笔记：如何保存与加载checkpoints
2022-11-05 20:47

X-ocean的博客保存和加载checkpoints很有帮助。为了保存checkpoints，必须将它们放在字典对象里，然后使用torch.save()来序列化字典。一个通用的PyTorch做法时使用.tar拓展名保存checkp...
成功解决OSError: [Errno 22] Invalid argument: ‘checkpoints\20211211.pth.tar‘
2021-12-11 22:30

一个处女座的程序猿的博客成功解决OSError: [Errno 22] Invalid argument: 'checkpoints\20211211.pth.tar' 目录解决问题解决思路解决方法解决问题 checkpoint = torch.load(checkpoint_path) File "F:\...
电脑内存明明还有很多，程序却显示内存不足，报错DefaultCPUAllocator: not enough memory:you tried to allocate 58982400 bytes.
2023-04-07 17:05

智能化测绘的博客 c10\core\impl\alloc_cpu.cpp:72] data.DefaultCPUAllocator: not enough memory:you tried to allocate 58982400 bytes. `号外号外~~` 1.软件环境⚙️ 2.问题描述 3.解决方法 4.结果预览号外号外~~ ⚡博主自用...
从零实现Transformer的简易版与强大版：从300多行到3000多行
2023-04-12 18:24

v_JULY_v的博客 transformer强大到什么程度呢，基本是17年之后绝大部分有影响力模型的基础架构都基于的transformer(比如，有200来个，包括且不限于基于decode的GPT、基于encode的BERT、基于encode-decode的T5等等)通过博客内的这篇...
LLaMA的解读与其微调(含LLaMA 2)：Alpaca-LoRA/Vicuna/BELLE/中文LLaMA/姜子牙
2023-03-22 14:45

v_JULY_v的博客 ArXiv占比2.5%，StackExchange占比2%)，论文中提到 When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our ...
没有解决我的问题, 去提问

Checkpoints 'spoiled' when used to resume crawls

7条回答 默认 最新

7条回答默认最新