具体报错信息如下:
python train.py exp_dir="/home/featurize/work/auto_avsr/checkpoints" exp_name="exp_av" data.modality="audiovisual" data.dataset.root_dir="/home/featurize/work/data/preTLRS" data.dataset.train_file="lrs3_train_transcript_lengths_seg24s.csv"
Epoch 0: 0%| | 0/2054 [00:00<?, ?it/s]/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [64, 1, 80], strides() = [80, 1, 1]
bucket_view.sizes() = [64, 1, 80], strides() = [80, 80, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [64, 1, 80], strides() = [80, 1, 1]
bucket_view.sizes() = [64, 1, 80], strides() = [80, 80, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [64, 1, 80], strides() = [80, 1, 1]
bucket_view.sizes() = [64, 1, 80], strides() = [80, 80, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [64, 1, 80], strides() = [80, 1, 1]
bucket_view.sizes() = [64, 1, 80], strides() = [80, 80, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [64, 1, 80], strides() = [80, 1, 1]
bucket_view.sizes() = [64, 1, 80], strides() = [80, 80, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
** Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Epoch 0: 100%|██████████████████████████████████████████████████████| 2054/2054 [37:50<00:00, 1.11s/it, loss=93.3, v_num=3Error executing job with overrides: ['exp_dir=/home/featurize/work/auto_avsr/checkpoints', 'exp_name=exp_av', 'data.modality=audiovisual', 'data.dataset.root_dir=/home/featurize/work/data/preTLRS', 'data.dataset.train_file=lrs3_train_transcript_lengths_seg24s.csv']
Traceback (most recent call last):
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
self.fit_loop.run()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 151, in run
output = self.on_run_end()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 298, in on_run_end
self.trainer.call_hook("on_train_epoch_end")
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
callback_fx(*args, **kwargs)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 93, in on_train_epoch_end
callback.on_train_epoch_end(self, self.lightning_module)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 321, in on_train_epoch_end
self.save_checkpoint(trainer)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 400, in save_checkpoint
self._save_last_checkpoint(trainer, monitor_candidates)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 672, in _save_last_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1913, in save_checkpoint
self.checkpoint_connector.save_checkpoint(filepath, weights_only)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 478, in save_checkpoint
self.trainer.training_type_plugin.save_checkpoint(_checkpoint, filepath)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 294, in save_checkpoint
return self.checkpoint_io.save_checkpoint(checkpoint, filepath)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/io/torch_plugin.py", line 37, in save_checkpoint
atomic_save(checkpoint, path)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/utilities/cloud_io.py", line 70, in atomic_save
f.write(bytesbuffer.getvalue())
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/fsspec/core.py", line 134, in __exit__
self.close()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/fsspec/core.py", line 154, in close
f.close()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/fsspec/implementations/local.py", line 444, in close
return self.f.close()
OSError: [Errno 5] Input/output error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 43, in main
trainer.fit(model=modelmodule, datamodule=datamodule)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
self._call_and_handle_interrupt(
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 698, in _call_and_handle_interrupt
self.training_type_plugin.reconciliate_processes(traceback.format_exc())
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 533, in reconciliate_processes
raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 0
Traceback (most recent call last):
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
self.fit_loop.run()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 151, in run
output = self.on_run_end()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 298, in on_run_end
self.trainer.call_hook("on_train_epoch_end")
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
callback_fx(*args, **kwargs)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 93, in on_train_epoch_end
callback.on_train_epoch_end(self, self.lightning_module)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 321, in on_train_epoch_end
self.save_checkpoint(trainer)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 400, in save_checkpoint
self._save_last_checkpoint(trainer, monitor_candidates)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 672, in _save_last_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1913, in save_checkpoint
self.checkpoint_connector.save_checkpoint(filepath, weights_only)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 478, in save_checkpoint
self.trainer.training_type_plugin.save_checkpoint(_checkpoint, filepath)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 294, in save_checkpoint
return self.checkpoint_io.save_checkpoint(checkpoint, filepath)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/plugins/io/torch_plugin.py", line 37, in save_checkpoint
atomic_save(checkpoint, path)
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/pytorch_lightning/utilities/cloud_io.py", line 70, in atomic_save
f.write(bytesbuffer.getvalue())
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/fsspec/core.py", line 134, in __exit__
self.close()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/fsspec/core.py", line 154, in close
f.close()
File "/environment/miniconda3/envs/auto_avsr/lib/python3.8/site-packages/fsspec/implementations/local.py", line 444, in close
return self.f.close()
OSError: [Errno 5] Input/output error******