为什么用bio_clinicalbert模型再来训练病历文本数据时，总出现问题

用bio_clinicalbert模型再来训练我的病历文本数据时，总会出现一些问题，请帮忙调试一个能用的代码。
如下的代码来自chatGPT，我用这些代码调试时总有问题，我已经用chatGPT调试过多次了，请不要发来自GPT的答案。
我给的提问是：
I have text data of more than 1000 patients' medical records which are classified into two types, healthy and unhealthy. Please combine Bio_ClinicalBERT model to train these text data and debug an optimal model. Divide the data into training set, validation set and test set. During model training, train with the training set and test with the validation set to come up with the optimal hyperparameter combination, and show the accuracy and loss of each epoch with matplotlab.pyplot drawing with python code. Then the training and validation sets are merged into a new training set, and the new training set is used to train the Bio_ClinicalBERT model with the optimal hyperparameters obtained before, and then the trained model is saved. At the time of model testing, load the saved model and then perform accuracy testing on the test set. I have gpu in my computer, please use gpu for training. Please give specific executable code that explains each step and parameter.

chatGPT给的代码：

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn.utils import shuffle

os.chdir('E:/CNSI_project/demo_and_data_3.0/aim2_tfModel_CliBertToken/')
# Set device to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load and preprocess the data
df = pd.read_csv('InputData_short_test_filtered.csv')
df = shuffle(df, random_state=2023)
texts = df['2']
labels = df['1']

# Split the data into train, validation, and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2, random_state=2023)
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.25, random_state=2023)

# Load the tokenizer and encode the data
tokenizer = AutoTokenizer.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')
train_encodings = tokenizer(list(train_texts), padding=True, truncation=True, return_tensors='pt')
val_encodings = tokenizer(list(val_texts), padding=True, truncation=True, return_tensors='pt')
test_encodings = tokenizer(list(test_texts), padding=True, truncation=True, return_tensors='pt')

# Create PyTorch datasets
class MedicalRecordDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = MedicalRecordDataset(train_encodings, train_labels)
val_dataset = MedicalRecordDataset(val_encodings, val_labels)
test_dataset = MedicalRecordDataset(test_encodings, test_labels)

# Create PyTorch data loaders
batch_size = 16
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Define the model
num_labels = len(train_labels.unique())
model = AutoModelForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)

# Define the optimizer and learning rate
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Define the number of epochs and initialize lists to store losses and accuracies
num_epochs = 10
train_losses, val_losses, train_accs, val_accs = [], [], [], []

# Train the model，尤其是这里面，总是出现问题
for epoch in range(num_epochs):
    # Train the model
    model.train()
    train_loss = 0
    correct_train = 0
    total_train = 0
    for data, labels in train_loader:
        optimizer.zero_grad()
        data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)

        labels = torch.tensor(labels).to(device)
        outputs = model(**data, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * data['input_ids'].size(0)
        preds = torch.argmax(outputs.logits, axis=1)
        correct_train += (preds == labels).sum().item()
        total_train += data['input_ids'].size(0)
    train_losses.append(train_loss / total_train)
    train_accs.append(correct_train / total_train)

    # Evaluate the model on the validation set
    model.eval()
    val_loss = 0
    correct_val = 0
    total_val = 0
    with torch.no_grad():
        for data, labels in val_loader:
            data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)
            labels = torch.tensor(labels.values).to(device)
            outputs = model(**data, labels=labels)
            loss = outputs.loss
            val_loss += loss.item() * data['input_ids'].size(0)
            preds = torch.argmax(outputs.logits, axis=1)
            correct_val += (preds == labels).sum().item()
            total_val += data['input_ids'].size(0)
    val_losses.append(val_loss / total_val)
    val_accs.append(correct_val / total_val)

    # Print the training and validation loss and accuracy for each epoch
    print(f'Epoch {epoch+1}/{num_epochs}:')
    print(f'Train loss: {train_losses[-1]:.4f}, Train acc: {train_accs[-1]*100:.2f}%')
    print(f'Val loss: {val_losses[-1]:.4f}, Val acc: {val_accs[-1]*100:.2f}%')

    # Plot the training and validation loss and accuracy for each epoch
    plt.figure(figsize=(10,5))
    plt.subplot(1,2,1)
    plt.plot(train_losses, label='Training Loss')
    plt.plot(val_losses, label='Validation Loss')
    plt.title('Training and Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.subplot(1,2,2)
    plt.plot(train_accs, label='Training Accuracy')
    plt.plot(val_accs, label='Validation Accuracy')
    plt.title('Training and Validation Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.show()

# Merge training and validation sets and retrain model with optimal hyperparameters
train_val_data = pd.concat([train_data, val_data], axis=0)
train_val_loader = create_data_loader(train_val_data, batch_size)

# Train model with optimal hyperparameters on merged training and validation set
model = BertForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)
optimizer = AdamW(model.parameters(), lr=lr, eps=eps)
total_steps = len(train_val_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=total_steps)

model = BertForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)
optimizer = AdamW(model.parameters(), lr=lr, eps=eps)
total_steps = len(train_val_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=total_steps)

for epoch in range(num_epochs):
    # Train the model
    model.train()
    train_loss = 0
    correct_train = 0
    total_train = 0
    for data, labels in train_val_loader:
        optimizer.zero_grad()
        data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)
        labels = torch.tensor(labels.values).to(device)
        outputs = model(**data, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * data['input_ids'].size(0)
        preds = torch.argmax(outputs.logits, axis=1)
        correct_train += (preds == labels).sum().item()
        total_train += data['input_ids'].size(0)
        
        train_losses.append(train_loss / total_train)
        train_accs.append(correct_train / total_train)

# Save the trained model.
torch.save(model.state_dict(), 'clinical_bert_model.pth')

# Load the saved model and perform accuracy testing on the test set
model = BertForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)
model.load_state_dict(torch.load('clinical_bert_model.pth'))
model.eval()
test_loss = 0
correct_test = 0
total_test = 0
with torch.no_grad():
    for data, labels in test_loader:
        data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)
        labels = torch.tensor(labels.values).to(device)
        outputs = model(**data, labels=labels)
        loss = outputs.loss
        test_loss += loss.item() * data['input_ids'].size(0)
        preds = torch.argmax(outputs.logits, axis=1)
        correct_test += (preds == labels).sum().item()
        total_test += data['input_ids'].size(0)

print(f'Test loss: {test_loss / total_test:.4f}, Test acc: {correct_test / total_test*100:.2f}%')

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
「已注销」 2023-03-07 13:37
关注
你让他给你调试

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

Unity使用bio ik设置机器手臂时发生问题 unity
2022-09-14 08:19

回答 1 已采纳你要不要换一个试试啊,我没用过你的那个插件,我之前用的是Final IK ,你可以试试,目前没有出现过问题
最近在编写块设备驱动，使用制造请求队列函数时，set_bit()和bio_endio()函数报错，请问这是什么原因？ c语言 linux 驱动开发
2023-03-04 10:00

回答 2 已采纳 “Devil组”引证GPT后的撰写： et_bit()和bio_endio()函数报错可能是由于缺少头文件或函数定义不正确引起的。这两个函数都是属于Linux内核中的函数，set_bit()用于设置某
紧急情况：运行时错误：无效的内存地址或nil指针取消引用|| 人体模型
2019-07-21 06:43

回答 1 已采纳 According to the package documentation for Body io.ReadCloser: For client requests, a nil bod
Java编程中的IO模型详解：BIO，NIO，AIO的区别与实际应用场景分析
2024-01-04 19:45

IO模型决定了数据的传输方式，Java支持BIO，NIO，AIO三种IO模型。BIO是同步阻塞模型，特点是一对一的客户端与处理线程关系，适用场景是连接数量较小并且固定的，优点是编程简单，但对服务器资源要求高。NIO是同步非...
在PHP中使用“openssl_pkcs7_decrypt”错误“BIO_new_file：没有这样的文件” php windows
2015-01-16 16:45

回答 3 已采纳 Finally, I've found a solution of this problem. Firstly, the file paths were wrong. I've resolved
openssl的BIO方式如何设置ssl连接超时？
2018-02-04 08:39

回答 1 已采纳 https://segmentfault.com/q/1010000002418731 一样的问题。
modules_install的时候老报错，请问这是为啥啊 linux ubuntu
2022-02-25 23:54

回答 1 已采纳 No such file or directory 没有对应的文件，相关的库文件是否已经安装好，对应的内核库是否已经 probe上去
人工智能-项目实践-预训练-使用预训练语言模型BERT做中文NER.zip
2024-01-06 21:12

使用方法从BERT-TF下载bert源代码，存放在路径下bert文件夹中从BERT-Base Chinese下载模型，存放在checkpoint文件夹下使用BIO数据标注模式，使用人民日报经典数据
python 两列合并时出现nan，怎么去掉呢 python 有问必答
2021-11-04 17:15

回答 2 已采纳先把nan值的用空值填充再合并呗
用户在输入/文本区域中使用单引号时未处理的数据 php sql
2013-09-26 19:06

回答 3 已采纳 This is a serious issue. What you face here is a wide open SQL INJECTION You concatenate a query
C++读取文件内容时最后一行出现两遍 c++
2022-12-30 17:57

回答 2 已采纳是不是文件最后一行是空行啊
【代码报错】OSError: Can‘t load tokenizer for ‘emilyalsentzer/Bio_ClinicalBERT‘.
2024-04-09 17:19

lzxjly的博客我们只能在本地下载模型上传到服务器后，再进行加载。，可以看到模型权重。下载自己需要的版本，我需要的是pytorch版本。2. 输入你想要的模型，找到对应的模型。6. 完成上述，假设出现以下错误：检查。4. 最好下载...
为什么使用Java缓冲流复制的图片打不开。 java
2020-06-15 21:47

回答 2 已采纳 byte[] buff = new byte[2048]; int len; while ((len = bis.read(buff)) != -1) { bos.write(buff, 0
使用预训练语言模型BERT做中文NER尝试，fine - tune BERT模型
2023-08-28 20:59

使用BIO数据标注模式，使用人民日报经典数据 train： python BERT_NER.py --data_dir=data/ --bert_config_file=checkpoint/bert_config.json --init_checkpoint=checkpoint/bert_model.ckpt --vocab_file=vocab....
bio_embeddings:从蛋白质序列中获取蛋白质嵌入
2021-04-27 14:52

表示深度（来自不同实验室的不同模型，针对不同的目的在不同的数据集上进行了训练）大量的示例为用户处理复杂性（例如CUDA OOM抽象），并提供有据可查的警告和错误消息。该项目包括：基于在生物学序列表示...
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
已结题（查看结题原因） 3月14日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
修改了问题 3月7日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
修改了问题 3月7日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 3月7日

悬赏问题

¥15 怎么改成循环输入删除(语言-c语言)
¥15 安卓C读取/dev/fastpipe屏幕像素数据
¥15 pyqt5tools安装失败
¥15 mmdetection
¥15 nginx代理报502的错误
¥100 当AWR1843发送完设置的固定帧后，如何使其再发送第一次的帧
¥15 图示五个参数的模型校正是用什么方法做出来的。如何建立其他模型
¥100 描述一下元器件的基本功能，pcba板的基本原理
¥15 STM32无法向设备写入固件
¥15 使用ESP8266连接阿里云出现问题

为什么用bio_clinicalbert模型再来训练病历文本数据时，总出现问题

3条回答 默认 最新

问题事件

悬赏问题

3条回答默认最新