目前在做多分类的机器学习模型,计算准确率等指标,绘制多分类ROC曲线,计算AUC值,目前遇到的问题是准确率和AUC差距比较大,在准确率比较低的情况下,AUC看起来仍然挺高。在准确率比较高情况下,AUC看起来就特别的高,好几个模型AUC都是0.99以上,看起来挺假的,画出图来反而不太好看。哪位帮忙看看,我计算有没有问题,或者AUC值它就是这个样子。
下面是计算模型准确率、绘制混淆矩阵、绘制roc曲线和计算auc值的一段代码,以逻辑回归为例。
#下面是逻辑回归
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
regmodel = LogisticRegression()
regmodel.fit(X_train, y_train) #训练模型
# 准确率评分、混淆矩阵和分类报告
regmodel_acc = accuracy_score(y_test, regmodel.predict(X_test))
print(f"Training Accuracy of LogisticRegression is {accuracy_score(y_train, regmodel.predict(X_train))}")
print(f"Test Accuracy of LogisticRegression is {regmodel_acc} \n")
print(f"Confusion Matrix :- \n{confusion_matrix(y_test, regmodel.predict(X_test))}\n")
print(f"Classification Report :- \n {classification_report(y_test, regmodel.predict(X_test),digits=3)}")
ConfusionMatrixDisplay.from_predictions(y_test,regmodel.predict(X_test),display_labels=["1级", "2级", "3级","4级", "5级", "6级"],cmap=plt.cm.Blues, colorbar=True)
plt.title("LogisticRegression")
plt.grid(False)
#下面是画roc曲线和计算auc值
probability=regmodel.predict_proba(X_test)
from sklearn.preprocessing import OneHotEncoder
# 创建OneHotEncoder对象
encoder = OneHotEncoder(sparse=False)
# 将y_test转换成二维数组形式
y_test_array = [[label] for label in y_test]
# 进行One-Hot编码
y_test_encoded = encoder.fit_transform(y_test_array)
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn.metrics import roc_curve, auc
from scipy import interp
y_label = y_test_encoded
y_score = probability
n_classes = 6
# 计算每一类的ROC
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_label[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# micro(方法二)
fpr["micro"], tpr["micro"], _ = roc_curve(y_label.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
# macro(方法一)
# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])
# Finally average it and compute AUC
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
# Plot all ROC curves
lw=2
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
label='micro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["micro"]),
color='deeppink', linestyle=':', linewidth=4)
plt.plot(fpr["macro"], tpr["macro"],
label='macro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["macro"]),
color='navy', linestyle=':', linewidth=4)
colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('LogisticRegression ROC')
plt.legend(loc="lower right")
plt.show()
下面是分类结果:

可以看到准确率只有0.598
下面是roc和auc

可以看到auc是0.90
这个准确率和auc是不是差距有点大了?
再来看xgboost,代码跟上面基本一样。
下面是分类结果

可以看到准确率是0.908
下面是roc和auc

auc直接到了0.994
以上计算是正确的吗,准确率和auc有没有问题呢?