Transformers Trainer && datasets Dataset 问题
Traceback (most recent call last):
File "main.py", line 72, in <module>
trainer.train()
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 1411, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 1623, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 721, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
return self.collate_fn(data)
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/transformers/data/data_collator.py", line 67, in default_data_collator
return torch_default_data_collator(features)
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/transformers/data/data_collator.py", line 131, in torch_default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 44 at dim 1 (got 40)
问题描述:使用如下代码进行训练时报错,实际上是有输入的一个batch内的维度不同,导致tensor不能拼接。
model = FineTuneT5Model()
# tokenizer = T5Tokenizer.from_pretrained("/data/yutian/DIUR/model_hub/my_t5")
# data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
training_args = TrainingArguments(
output_dir = './checkpoints',
num_train_epochs = 5,
per_device_train_batch_size=2, # batch size per device during training 训练批大小
per_device_eval_batch_size=2, # batch size for evaluation 评估批大小
logging_dir='./logs/trainer_log', # directory for storing logs 日志存储位置
learning_rate=1e-3, # 学习率
save_steps=500,
)
trainer = Trainer(
model = model,
args = training_args,
train_dataset = dataset,
eval_dataset = valid_dataset,
compute_metrics = get_metric_func
)
trainer.train()
来龙去脉:
通过datasets中的Dataset构建数据。希望给模型输入两个文本特征,一个标签。也就是说,现在是一个字典,前两个键的值都是字符串的列表,相当于s2s中的成对语料;第三个键对应的是一个int的列表,希望用于分类的标签。通过如下代码构建Dataset。
data_dict = {'src_text_field':self.src_text_field,
'tgt_text_field':self.tgt_text_field,
'label_field':self.label_field}
dataset = Dataset.from_dict(data_dict)
通过如下代码对src和tgt进行tokenize。需要注意的是,tokenizer返回的都包括input_ids,直接将返回值map给dataset,会导致第二次赋值的时候覆盖。所以,在给tgt做tokenize的时候,新保存了一个键,加入了dataset中。
tokenizer = T5Tokenizer.from_pretrained("/data/yutian/DIUR/model_hub/my_t5")
def src_preprocess_function(examples):
text_token = tokenizer(examples['src_text_field'], padding = True, truncation=True, max_length=256, return_token_type_ids=False)
logging.info(text_token)
return text_token
dataset = dataset.map(src_preprocess_function, batched=True, batch_size=8)
def tgt_preprocess_function(examples):
text_token = tokenizer(examples['tgt_text_field'], padding = True, truncation=True, max_length=256, return_token_type_ids=False, return_attention_mask=False)
new_dict = {'tgt_ids': text_token['input_ids']}
return new_dict
dataset = dataset.map(tgt_preprocess_function, batched=True, batch_size=8)
with open(os.path.join('./cache', self.dataset_name, self.mode+'.pkl'), 'wb') as f:
pickle.dump(dataset, f)
在主函数中,通过打印dataset内的数据,发现每8个数据的对应键size相同。然而在通过Trainer的时候,数据会传给collator。这个的作用是将batch的数据转化成tensor,或者做其它预处理。我自定义了一个collator,传递给trainer。发现tgt_id的size异常。input_id和attention_mask能够向量化(因为形状相同),但是tgt_id长短不一,怀疑在加载的时候被打乱了顺序,或者发生了其它事情。能够保证直接print dataset中的数据时,tgt_id是每8个形状相同;但是不知道什么原因,加载之后不是了。
def DataCollator(features):
for i in features:
print(len(i['tgt_ids']))
return 0
请各位熟悉这几个工具的佬给出建议。目前就是希望输入两个文本域,一个数字域;希望能够给出目前问题的原因或其它使用transformers Trainer、 datasets的建议!