再llama factory用两个以上的数据集训练同时训练一个模型时,出现一下问题,如何解决?
Converting format of dataset (num_proc=16): 0%| | 0/786 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16): 0%| | 1/786 [00:47<10:15:44, 47.06s/ examples]
Converting format of dataset (num_proc=16): 12%|█▏ | 98/786 [00:47<03:57, 2.90 examples/s]
Converting format of dataset (num_proc=16): 25%|██▍ | 196/786 [00:48<01:26, 6.81 examples/s]
Converting format of dataset (num_proc=16): 31%|███ | 245/786 [00:49<00:57, 9.44 examples/s]
Converting format of dataset (num_proc=16): 37%|███▋ | 294/786 [00:49<00:38, 12.83 examples/s]
Converting format of dataset (num_proc=16): 44%|████▎ | 343/786 [00:49<00:24, 17.90 examples/s]
Converting format of dataset (num_proc=16): 56%|█████▌ | 441/786 [00:50<00:11, 29.83 examples/s]
Converting format of dataset (num_proc=16): 62%|██████▏ | 491/786 [00:51<00:09, 32.67 examples/s]
Converting format of dataset (num_proc=16): 69%|██████▊ | 540/786 [00:52<00:06, 40.75 examples/s]
Converting format of dataset (num_proc=16): 81%|████████ | 638/786 [00:52<00:02, 65.63 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 786/786 [00:52<00:00, 117.02 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 786/786 [00:57<00:00, 13.66 examples/s]
Converting format of dataset (num_proc=16): 0%| | 0/2531 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16): 0%| | 1/2531 [00:47<33:37:10, 47.84s/ examples]
Converting format of dataset (num_proc=16): 6%|▋ | 159/2531 [00:48<08:23, 4.71 examples/s]
Converting format of dataset (num_proc=16): 14%|█▍ | 365/2531 [00:49<02:49, 12.81 examples/s]
Converting format of dataset (num_proc=16): 19%|█▉ | 492/2531 [00:49<01:42, 19.81 examples/s]
Converting format of dataset (num_proc=16): 25%|██▌ | 635/2531 [00:49<01:02, 30.38 examples/s]
Converting format of dataset (num_proc=16): 31%|███▏ | 793/2531 [00:51<00:41, 41.60 examples/s]
Converting format of dataset (num_proc=16): 42%|████▏ | 1067/2531 [00:51<00:18, 78.42 examples/s]
Converting format of dataset (num_proc=16): 48%|████▊ | 1219/2531 [00:52<00:13, 96.85 examples/s]
Converting format of dataset (num_proc=16): 52%|█████▏ | 1324/2531 [00:52<00:11, 104.01 examples/s]
Converting format of dataset (num_proc=16): 56%|█████▋ | 1424/2531 [00:52<00:08, 130.16 examples/s]
Converting format of dataset (num_proc=16): 62%|██████▏ | 1562/2531 [00:53<00:06, 153.01 examples/s]
Converting format of dataset (num_proc=16): 66%|██████▌ | 1671/2531 [00:53<00:04, 189.70 examples/s]
Converting format of dataset (num_proc=16): 72%|███████▏ | 1830/2531 [00:53<00:02, 240.28 examples/s]
Converting format of dataset (num_proc=16): 79%|███████▉ | 2001/2531 [00:54<00:01, 293.85 examples/s]
Converting format of dataset (num_proc=16): 88%|████████▊ | 2215/2531 [00:54<00:00, 320.85 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 2531/2531 [00:55<00:00, 462.87 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 2531/2531 [01:01<00:00, 41.12 examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [00:00<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [00:55<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [01:00<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [01:01<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [01:02<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [01:02<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [01:02<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [01:03<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [01:04<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [01:04<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [01:05<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [01:05<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [01:05<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/3317 [01:07<?, ? examples/s]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\lib\site-packages\multiprocess\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\lib\site-packages\datasets\utils\py_utils.py", line 586, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\lib\site-packages\datasets\arrow_dataset.py", line 3674, in _map_single
for i, batch in iter_outputs(shard_iterable):
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\lib\site-packages\datasets\arrow_dataset.py", line 3624, in iter_outputs
yield i, apply_function(example, i, offset=offset)
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\lib\site-packages\datasets\arrow_dataset.py", line 3547, in apply_function
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "C:\Users\Administrator\Documents\llama_factory\LLaMA-Factory\src\llamafactory\data\processor\supervised.py", line 99, in preprocess_dataset
input_ids, labels = self._encode_data_example(
File "C:\Users\Administrator\Documents\llama_factory\LLaMA-Factory\src\llamafactory\data\processor\supervised.py", line 43, in _encode_data_example
messages = self.template.mm_plugin.process_messages(prompt + response, images, videos, audios, self.processor)
File "C:\Users\Administrator\Documents\llama_factory\LLaMA-Factory\src\llamafactory\data\mm_plugin.py", line 1521, in process_messages
self._validate_messages(messages, images, videos, audios)
File "C:\Users\Administrator\Documents\llama_factory\LLaMA-Factory\src\llamafactory\data\mm_plugin.py", line 215, in _validate_messages
raise ValueError(
ValueError: The number of images does not match the number of <image> tokens in [{'content': '这是什么植物病虫害?', 'role': 'user'}, {'content': '西芹-瓢虫幼虫', 'role': 'assistant'}].
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\Scripts\llamafactory-cli.exe\__main__.py", line 6, in <module>
File "C:\Users\Administrator\Documents\llama_factory\LLaMA-Factory\src\llamafactory\cli.py", line 24, in main
launcher.launch()
File "C:\Users\Administrator\Documents\llama_factory\LLaMA-Factory\src\llamafactory\launcher.py", line 157, in launch
run_exp()
File "C:\Users\Administrator\Documents\llama_factory\LLaMA-Factory\src\llamafactory\train\tuner.py", line 132, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "C:\Users\Administrator\Documents\llama_factory\LLaMA-Factory\src\llamafactory\train\tuner.py", line 93, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "C:\Users\Administrator\Documents\llama_factory\LLaMA-Factory\src\llamafactory\train\sft\workflow.py", line 51, in run_sft
dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
File "C:\Users\Administrator\Documents\llama_factory\LLaMA-Factory\src\llamafactory\data\loader.py", line 314, in get_dataset
dataset = _get_preprocessed_dataset(
File "C:\Users\Administrator\Documents\llama_factory\LLaMA-Factory\src\llamafactory\data\loader.py", line 255, in _get_preprocessed_dataset
dataset = dataset.map(
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\lib\site-packages\datasets\arrow_dataset.py", line 560, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\lib\site-packages\datasets\arrow_dataset.py", line 3309, in map
for rank, done, content in iflatmap_unordered(
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\lib\site-packages\datasets\utils\py_utils.py", line 626, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\lib\site-packages\datasets\utils\py_utils.py", line 626, in <listcomp>
[async_result.get(timeout=0.05) for async_result in async_results]
File "C:\Users\Administrator\Documents\conda\envs\llama_factory\lib\site-packages\multiprocess\pool.py", line 774, in get
raise self._value
ValueError: The number of images does not match the number of <image> tokens in [{'content': '这是什么植物病虫害?', 'role': 'user'}, {'content': '西芹-瓢虫幼虫', 'role': 'assistant'}].