weixin_45099845 2023-03-07 13:34 采纳率: 42.9%
浏览 97
已结题

为什么用bio_clinicalbert模型再来训练病历文本数据时,总出现问题

用bio_clinicalbert模型再来训练我的病历文本数据时,总会出现一些问题,请帮忙调试一个能用的代码。
如下的代码来自chatGPT,我用这些代码调试时总有问题,我已经用chatGPT调试过多次了,请不要发来自GPT的答案。
我给的提问是:
I have text data of more than 1000 patients' medical records which are classified into two types, healthy and unhealthy. Please combine Bio_ClinicalBERT model to train these text data and debug an optimal model. Divide the data into training set, validation set and test set. During model training, train with the training set and test with the validation set to come up with the optimal hyperparameter combination, and show the accuracy and loss of each epoch with matplotlab.pyplot drawing with python code. Then the training and validation sets are merged into a new training set, and the new training set is used to train the Bio_ClinicalBERT model with the optimal hyperparameters obtained before, and then the trained model is saved. At the time of model testing, load the saved model and then perform accuracy testing on the test set. I have gpu in my computer, please use gpu for training. Please give specific executable code that explains each step and parameter.

chatGPT给的代码:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn.utils import shuffle

os.chdir('E:/CNSI_project/demo_and_data_3.0/aim2_tfModel_CliBertToken/')
# Set device to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load and preprocess the data
df = pd.read_csv('InputData_short_test_filtered.csv')
df = shuffle(df, random_state=2023)
texts = df['2']
labels = df['1']

# Split the data into train, validation, and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2, random_state=2023)
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.25, random_state=2023)

# Load the tokenizer and encode the data
tokenizer = AutoTokenizer.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')
train_encodings = tokenizer(list(train_texts), padding=True, truncation=True, return_tensors='pt')
val_encodings = tokenizer(list(val_texts), padding=True, truncation=True, return_tensors='pt')
test_encodings = tokenizer(list(test_texts), padding=True, truncation=True, return_tensors='pt')

# Create PyTorch datasets
class MedicalRecordDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = MedicalRecordDataset(train_encodings, train_labels)
val_dataset = MedicalRecordDataset(val_encodings, val_labels)
test_dataset = MedicalRecordDataset(test_encodings, test_labels)

# Create PyTorch data loaders
batch_size = 16
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Define the model
num_labels = len(train_labels.unique())
model = AutoModelForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)

# Define the optimizer and learning rate
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Define the number of epochs and initialize lists to store losses and accuracies
num_epochs = 10
train_losses, val_losses, train_accs, val_accs = [], [], [], []

# Train the model,尤其是这里面,总是出现问题
for epoch in range(num_epochs):
    # Train the model
    model.train()
    train_loss = 0
    correct_train = 0
    total_train = 0
    for data, labels in train_loader:
        optimizer.zero_grad()
        data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)

        labels = torch.tensor(labels).to(device)
        outputs = model(**data, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * data['input_ids'].size(0)
        preds = torch.argmax(outputs.logits, axis=1)
        correct_train += (preds == labels).sum().item()
        total_train += data['input_ids'].size(0)
    train_losses.append(train_loss / total_train)
    train_accs.append(correct_train / total_train)

    # Evaluate the model on the validation set
    model.eval()
    val_loss = 0
    correct_val = 0
    total_val = 0
    with torch.no_grad():
        for data, labels in val_loader:
            data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)
            labels = torch.tensor(labels.values).to(device)
            outputs = model(**data, labels=labels)
            loss = outputs.loss
            val_loss += loss.item() * data['input_ids'].size(0)
            preds = torch.argmax(outputs.logits, axis=1)
            correct_val += (preds == labels).sum().item()
            total_val += data['input_ids'].size(0)
    val_losses.append(val_loss / total_val)
    val_accs.append(correct_val / total_val)

    # Print the training and validation loss and accuracy for each epoch
    print(f'Epoch {epoch+1}/{num_epochs}:')
    print(f'Train loss: {train_losses[-1]:.4f}, Train acc: {train_accs[-1]*100:.2f}%')
    print(f'Val loss: {val_losses[-1]:.4f}, Val acc: {val_accs[-1]*100:.2f}%')

    # Plot the training and validation loss and accuracy for each epoch
    plt.figure(figsize=(10,5))
    plt.subplot(1,2,1)
    plt.plot(train_losses, label='Training Loss')
    plt.plot(val_losses, label='Validation Loss')
    plt.title('Training and Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.subplot(1,2,2)
    plt.plot(train_accs, label='Training Accuracy')
    plt.plot(val_accs, label='Validation Accuracy')
    plt.title('Training and Validation Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.show()

# Merge training and validation sets and retrain model with optimal hyperparameters
train_val_data = pd.concat([train_data, val_data], axis=0)
train_val_loader = create_data_loader(train_val_data, batch_size)

# Train model with optimal hyperparameters on merged training and validation set
model = BertForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)
optimizer = AdamW(model.parameters(), lr=lr, eps=eps)
total_steps = len(train_val_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=total_steps)

model = BertForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)
optimizer = AdamW(model.parameters(), lr=lr, eps=eps)
total_steps = len(train_val_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=total_steps)

for epoch in range(num_epochs):
    # Train the model
    model.train()
    train_loss = 0
    correct_train = 0
    total_train = 0
    for data, labels in train_val_loader:
        optimizer.zero_grad()
        data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)
        labels = torch.tensor(labels.values).to(device)
        outputs = model(**data, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * data['input_ids'].size(0)
        preds = torch.argmax(outputs.logits, axis=1)
        correct_train += (preds == labels).sum().item()
        total_train += data['input_ids'].size(0)
        
        train_losses.append(train_loss / total_train)
        train_accs.append(correct_train / total_train)

# Save the trained model.
torch.save(model.state_dict(), 'clinical_bert_model.pth')

# Load the saved model and perform accuracy testing on the test set
model = BertForSequenceClassification.from_pretrained('emilyalsentzer/Bio_ClinicalBERT', num_labels=num_labels).to(device)
model.load_state_dict(torch.load('clinical_bert_model.pth'))
model.eval()
test_loss = 0
correct_test = 0
total_test = 0
with torch.no_grad():
    for data, labels in test_loader:
        data = tokenizer(list(data), padding=True, truncation=True, return_tensors='pt').to(device)
        labels = torch.tensor(labels.values).to(device)
        outputs = model(**data, labels=labels)
        loss = outputs.loss
        test_loss += loss.item() * data['input_ids'].size(0)
        preds = torch.argmax(outputs.logits, axis=1)
        correct_test += (preds == labels).sum().item()
        total_test += data['input_ids'].size(0)

print(f'Test loss: {test_loss / total_test:.4f}, Test acc: {correct_test / total_test*100:.2f}%')
  • 写回答

3条回答 默认 最新

  • 「已注销」 2023-03-07 13:37
    关注

    你让他给你调试

    评论

报告相同问题?

问题事件

  • 已结题 (查看结题原因) 3月14日
  • 修改了问题 3月7日
  • 修改了问题 3月7日
  • 创建了问题 3月7日

悬赏问题

  • ¥15 怎么改成循环输入删除(语言-c语言)
  • ¥15 安卓C读取/dev/fastpipe屏幕像素数据
  • ¥15 pyqt5tools安装失败
  • ¥15 mmdetection
  • ¥15 nginx代理报502的错误
  • ¥100 当AWR1843发送完设置的固定帧后,如何使其再发送第一次的帧
  • ¥15 图示五个参数的模型校正是用什么方法做出来的。如何建立其他模型
  • ¥100 描述一下元器件的基本功能,pcba板的基本原理
  • ¥15 STM32无法向设备写入固件
  • ¥15 使用ESP8266连接阿里云出现问题