我是小蔡呀~~~ 2024-10-17 16:50 采纳率: 0%
浏览 127

OSError: [Errno 5] Input/output error

具体报错信息如下:


python train.py exp_dir="/home/featurize/work/auto_avsr/checkpoints" exp_name="exp_av" data.modality="audiovisual" data.dataset.root_dir="/home/featurize/work/data/preTLRS" data.dataset.train_file="lrs3_train_transcript_lengths_seg24s.csv"
Epoch 0:   0%|                                                         | 0/2054 [00:00<?, ?it/s]/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [64, 1, 80], strides() = [80, 1, 1]
bucket_view.sizes() = [64, 1, 80], strides() = [80, 80, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [64, 1, 80], strides() = [80, 1, 1]
bucket_view.sizes() = [64, 1, 80], strides() = [80, 80, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [64, 1, 80], strides() = [80, 1, 1]
bucket_view.sizes() = [64, 1, 80], strides() = [80, 80, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [64, 1, 80], strides() = [80, 1, 1]
bucket_view.sizes() = [64, 1, 80], strides() = [80, 80, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [64, 1, 80], strides() = [80, 1, 1]
bucket_view.sizes() = [64, 1, 80], strides() = [80, 80, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
 ** Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Epoch 0: 100%|██████████████████████████████████████████████████████| 2054/2054 [37:50<00:00,  1.11s/it, loss=93.3, v_num=3Error executing job with overrides: ['exp_dir=/home/featurize/work/auto_avsr/checkpoints', 'exp_name=exp_av', 'data.modality=audiovisual', 'data.dataset.root_dir=/home/featurize/work/data/preTLRS', 'data.dataset.train_file=lrs3_train_transcript_lengths_seg24s.csv']
Traceback (most recent call last):
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 298, in on_run_end
    self.trainer.call_hook("on_train_epoch_end")
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
    callback_fx(*args, **kwargs)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 93, in on_train_epoch_end
    callback.on_train_epoch_end(self, self.lightning_module)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 321, in on_train_epoch_end
    self.save_checkpoint(trainer)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 400, in save_checkpoint
    self._save_last_checkpoint(trainer, monitor_candidates)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 672, in _save_last_checkpoint
    trainer.save_checkpoint(filepath, self.save_weights_only)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1913, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 478, in save_checkpoint
    self.trainer.training_type_plugin.save_checkpoint(_checkpoint, filepath)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 294, in save_checkpoint
    return self.checkpoint_io.save_checkpoint(checkpoint, filepath)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/io/torch_plugin.py", line 37, in save_checkpoint
    atomic_save(checkpoint, path)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/utilities/cloud_io.py", line 70, in atomic_save
    f.write(bytesbuffer.getvalue())
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/fsspec/core.py", line 134, in __exit__
    self.close()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/fsspec/core.py", line 154, in close
    f.close()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/fsspec/implementations/local.py", line 444, in close
    return self.f.close()
OSError: [Errno 5] Input/output error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 43, in main
    trainer.fit(model=modelmodule, datamodule=datamodule)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 698, in _call_and_handle_interrupt
    self.training_type_plugin.reconciliate_processes(traceback.format_exc())
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 533, in reconciliate_processes
    raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 0 
 Traceback (most recent call last):
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 298, in on_run_end
    self.trainer.call_hook("on_train_epoch_end")
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
    callback_fx(*args, **kwargs)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 93, in on_train_epoch_end
    callback.on_train_epoch_end(self, self.lightning_module)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 321, in on_train_epoch_end
    self.save_checkpoint(trainer)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 400, in save_checkpoint
    self._save_last_checkpoint(trainer, monitor_candidates)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 672, in _save_last_checkpoint
    trainer.save_checkpoint(filepath, self.save_weights_only)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1913, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 478, in save_checkpoint
    self.trainer.training_type_plugin.save_checkpoint(_checkpoint, filepath)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 294, in save_checkpoint
    return self.checkpoint_io.save_checkpoint(checkpoint, filepath)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/io/torch_plugin.py", line 37, in save_checkpoint
    atomic_save(checkpoint, path)
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/utilities/cloud_io.py", line 70, in atomic_save
    f.write(bytesbuffer.getvalue())
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/fsspec/core.py", line 134, in __exit__
    self.close()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/fsspec/core.py", line 154, in close
    f.close()
  File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/fsspec/implementations/local.py", line 444, in close
    return self.f.close()
OSError: [Errno 5] Input/output error******
  • 写回答

1条回答 默认 最新

  • 无疆76 2024-10-17 17:03
    关注

    以下回复参考:皆我百晓生券券喵儿等免费微信小程序作答:

    这个问题看起来是由于在尝试保存模型检查点时发生的输入/输出错误(Input/Output error)。这个错误可能由多种原因引起,例如硬盘空间不足、文件路径错误或者文件系统的问题等。此外,从错误日志中还可以看到一个关于死锁(Deadlock detected)的异常,这可能是由于多线程或分布式计算环境中的同步问题导致的。下面是一些可能的解决方案:

    1. 检查硬盘空间:确保你的硬盘有足够的空间来保存新的检查点。如果硬盘空间不足,你需要清理一些不需要的文件或者增加硬盘空间。
    2. 检查文件路径:确认你提供的文件路径是否正确。在这个情况下,检查点被尝试保存在"/home/featurize/work/auto_avsr/checkpoints"路径下,你需要确认这个路径是否存在并且有足够的权限进行写入操作。
    3. 检查文件系统:确认你的文件系统没有问题。如果文件系统有问题,可能需要修复或者恢复文件系统。
    4. 关于死锁问题,你可以尝试减少并行处理的任务数量,或者检查你的代码是否有可能导致死锁的同步问题。这可能需要你更深入地检查你的代码和使用的库。

    如果以上建议都不能解决你的问题,你可能需要提供更多的信息来进一步诊断问题,例如你的操作系统、硬件配置、使用的库版本等。此外,你也可以尝试在 PyTorch Lightning 的社区或者论坛中寻求帮助,他们可能有遇到类似问题的解决方案。

    评论

报告相同问题?

问题事件

  • 创建了问题 10月17日