关于Finetune模型，CUDA error: device-side assert triggered 报错

大家好，我尝试finetune BERT 模型，但是始终遇到相同的报错
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
这里是我的代码

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the data
df = pd.read_csv('/root/manu_data.csv')
df = df[df['label'] != 'Noise']
# Rename the column if necessary (assuming the column is named 'label' in your CSV)
df['label'] = df['label'].astype(int) - 1

train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['sequence'].tolist(), 
    df['label'].astype(int).tolist(), 
    test_size=0.2
)

tokenizer = AutoTokenizer.from_pretrained('/root/Model/DNABERT_s')
model = BertForSequenceClassification.from_pretrained('/root/Model/DNABERT_s', num_labels=30)

train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512)
train_dataset = Dataset.from_dict({'input_ids': train_encodings['input_ids'], 'attention_mask': train_encodings['attention_mask'], 'labels': train_labels})
val_dataset = Dataset.from_dict({'input_ids': val_encodings['input_ids'], 'attention_mask': val_encodings['attention_mask'], 'labels': val_labels})

# Define the training arguments
training_args = TrainingArguments(
    output_dir='/root/Model/finetuned_DNABERT',  
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='/root/Model/finetuned_DNABERT/logs', 
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_total_limit=2,
    fp16=True,  
)

# Define a Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)

我检查了print(f"Unique labels in the dataset: {df['label'].unique()}")，print的结果是0-29一共30个label。麻烦大家帮我看一下那里有问题，虚心提问，谢谢。

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

14条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
吃不了席 2024-08-22 23:40
关注
以下回复参考：皆我百晓生、券券喵儿等免费微信小程序作答:

根据您提供的代码和错误信息，这个问题可能是由于CUDA内存分配问题导致的。您可以尝试以下方法来解决这个问题：

减小per_device_train_batch_size和per_device_eval_batch_size的值。这将减少每次训练迭代中使用的GPU内存量。例如，将它们设置为2或1：

training_args = TrainingArguments( output_dir='/root/Model/finetuned_DNABERT', num_train_epochs=3, per_device_train_batch_size=2, # 修改为2 per_device_eval_batch_size=2, # 修改为2 warmup_steps=500, weight_decay=0.01, logging_dir='/root/Model/finetuned_DNABERT/logs', logging_steps=10, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="accuracy", greater_is_better=True, save_total_limit=2, fp16=True, )

如果您的GPU内存仍然不足，您可以尝试使用CPU进行训练。将device变量更改为：

device = torch.device("cpu")

请注意，使用CPU进行训练可能会比使用GPU慢得多。

如果问题仍然存在，您可以尝试更新您的PyTorch和Transformers库到最新版本，以确保您使用的是最新的修复程序和优化。您可以使用以下命令更新：

pip install --upgrade torch transformers

希望这些建议能帮助您解决问题。如果问题仍然存在，请随时提问，我们会尽力提供帮助。
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

运行yolov4出现下列报错怎么办？ pytorch 目标检测
2022-02-25 11:25

回答 2 已采纳你好，看你的提示，应该是CV2版本问题，你看你的最后一句，你用的是4.5.5.我建议是用一下4.1.2楼上说的是对的，因为cv包升级，参数格式不一样了。你执行下这个命令。 pip uninstall
yolov3训练：AssertionError: Model accepts 2 classes labeled from 0-1, however you labelled a class 18. python 深度学习目标检测
2022-02-13 21:36

回答 1 已采纳提问注重方法，你这一大段代码别人没时间给你看！错误提示已经很明显了，模型的cls分类是2，但是数据给的cls是30.修改一下模型的配置即可。
自制c++密码程序报错，如何解决？(操作系统-linux) c++ c语言 linux 有问必答
2022-03-06 17:33

回答 2 已采纳首先，这人就是拿出来炫耀一下，竟然当真了...你这个删除算法太怪异了，一般不都是用pop_back吗？？？？ #include<iostream> #include<conio.h&
RuntimeError: CUDA error: device-side assert triggered
2024-01-16 20:17

只会git clone的程序员的博客记录下解决步骤…cuda报错真要人命首先根据终端的提示他说让你加这个来定位具体的python代码错哪了，所以咱们就加。加上了之后，终端打印的东西巨巨巨长，不好定位报错代码，所以再加定向输出到文件内！然后把test....
关于#fortran#的问题：请问COLLECT_GCC=f95怎么链接为COLLECT_GCC=ifort linux ubuntu
2023-02-04 21:41

回答 2 已采纳回答不易，求求您采纳点赞哦这个问题是涉及到编译器的链接问题。将COLLECT_GCC=f95替换为COLLECT_GCC=ifort 一般情况下，你可以在环境变量或者编译脚本中找到链接编译器的
Cgo：抱歉，未实现：未在64位模式下编译
2016-10-04 23:16

回答 2 已采纳 1- Short Answer: I tried many versions, the Only working version for both 32bit and 64bit go build
Pytorch调用bertEncoderbaTypeError: forward() missing 1 required positional argument: 'attention_mask' bert pytorch 深度学习
2022-07-07 15:35

回答 2 已采纳已解决，根本原因是数据格式的问题，在使用bert_encoder之前，需要将数据格式转换为BertData()格式
ChatGLM两代的部署/微调/实现：从基座GLM、ChatGLM的LoRA/P-Tuning微调、6B源码解读到ChatGLM2的微调与实现
2023-03-31 16:40

v_JULY_v的博客缺点是无法利用到下文的信息 autoencoding，自编码模型是通过某个降噪目标(如掩码语言模型，简单理解就是通过挖洞，训练模型做完形填空的能力)训练的语言编码器，如双向的BERT、ALBERT、RoBERTa、DeBERTa 自编码模型...
No such file or directory: 'saved_model_weight/resnet34_pretrain_ori_low_torch_version.pth' python pytorch 深度学习
2022-07-13 22:40

回答 2 已采纳 FileNotFoundError: [Errno 2] No such file or directory: 'saved_model_weight/resnet34_pretrain_ori_lo
遇到LinuxIP配置新问题求解答 linux
2016-11-10 13:47

回答 1 已采纳网卡驱动有没有安装好
运行NUTS采样算法报错，非常感谢解答 python 有问必答
2021-11-15 16:54

回答 1 已采纳报错提示是警告信息，在4.0版本中使用函数sample时要指定参数return_inferencedata=True 或False,说明一下返回数据类型，这样就可能避免出现警告信息。
PyTorch学习笔记-7.PyTorch训练技巧
2020-08-03 00:33

ruoqi23的博客 7.1.模型保存与加载保存： torch.save 主要参数： • obj：对象 • f：输出路径保存分为两种模式： 1:保存整个Module，即保存了整个模型的框架和参数 torch.save(net, path) 2:保存模型参数，即只保存模型的...
关于java向父转换的例题，不太清楚，希望能解释一下 java
2021-11-19 17:48

回答 3 已采纳所以是哪里不懂
yoloV5-face学习笔记
2021-12-15 22:36

长方形混凝土瞬间移动师(已退坑)的博客 def compute_loss(p, targets, model): # predictions, targets, model device = targets.device lcls, lbox, lobj, lmark = torch.zeros(1, device=device), torch.zeros(1, device=device), torch.zeros(1, ...
深度学习入门stage2
2024-01-24 12:48

战斗的咸鱼的博客学习多模态大模型的记录
【Lane】 Ultra-Fast-Lane-Detection 复现
2021-08-16 15:31

摇曳的树的博客 conda install pytorch torchvision cudatoolkit=10.1 -c pytorch pip install -r requirements.txt 2 下载项目源码若在服务器上可以用指令下载，否则直接在网址中下载 git clone https://github.com/cfzd/U
LLaMA微调记录
2023-07-18 17:54

Enabler_Rx的博客 No module named 'torch.utils._device' · Issue #135 · Lightning-AI/lit-llama · GitHub bug:RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported ...
pytorch pruning训练自己的数据库（流程+BUG调试）
2017-12-16 13:36

代号沫沫的博客 RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorCopy.c:18 因为我的图片class为10类，而代码中只有2类，所以要将finetune.py中...
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
已结题（查看结题原因） 8月23日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
赞助了问题酬金15元 8月22日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 8月22日

悬赏问题

¥15 35114 SVAC视频验签的问题
¥15 impedancepy
¥15 在虚拟机环境下完成以下，要求截图！
¥15 求往届大挑得奖作品（ppt…）
¥15 如何在vue.config.js中读取到public文件夹下window.APP_CONFIG.API_BASE_URL的值
¥50 浦育平台scratch图形化编程
¥20 求这个的原理图只要原理图
¥15 vue2项目中，如何配置环境，可以在打完包之后修改请求的服务器地址
¥20 微信的店铺小程序如何修改背景图
¥15 UE5.1局部变量对蓝图不可见

关于Finetune模型，CUDA error: device-side assert triggered 报错

14条回答 默认 最新

问题事件

悬赏问题

14条回答默认最新