为什么LogisticRegression.fit输入2D数据无法强转1D

报错信息：

ValueError: Expected 2D array, got 1D array instead

# 信用卡交易数据异常检测问题
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

plt.ion()

data = pd.read_csv("../data/creditcard.csv")
print(data.head())
count_classes = pd.value_counts(data["Class"], sort=True).sort_index()
print("------------------------------------------------------------------")
print(count_classes)  # 正样本0 284315  |  负样本1 492
count_classes.plot(kind="bar")
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()

'''
    样本数据不均衡，即一个数据极大，而另一个数据极小
    下采样策略，使两种样本同样少
    过采样策略，使两种样本同样多
'''

# 对Amount均值归一化
data["normAmount"] = StandardScaler().fit_transform(data["Amount"].values.reshape(-1, 1))
data = data.drop(["Time", "Amount"], axis=1)
print("------------------------------------------------------------------")
print(data.head())
print("------------------------------------------------------------------")
x = data.iloc[:, data.columns != "Class"]
y = data.iloc[:, data.columns == "Class"]
# 负样本
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)
# 正样本
normal_indices = data[data.Class == 0].index

# 随机选取x,采取下采样策略，选取和异常样本数相等的正常样本数
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=False)
random_normal_indices = np.array(random_normal_indices)

# 连接样本合并
under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])
under_sample_data = data.iloc[under_sample_indices, :]

x_undersample = under_sample_data.iloc[:, under_sample_data.columns != "Class"]
y_undersample = under_sample_data.iloc[:, under_sample_data.columns == "Class"]

print("Percentage of normal transactions: ",
      len(under_sample_data[under_sample_data.Class == 0]) / len(under_sample_data))
print("Percentage of fraud transactions: ",
      len(under_sample_data[under_sample_data.Class == 1]) / len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

# test_size=0.3即30%的数据做测试集，70%的数据做训练集 random_state=0每次随机效果相同
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
print("------------------------------------------------------------------")
print("Number transactions train dataset: ", len(x_train))
print("Number of transactions test dataset: ", len(x_test))
print("Total number of transactions: ", len(x_train) + len(x_test))
print("------------------------------------------------------------------")

x_train_undersample, x_test_undersample, y_train_undersample, y_test_undertrainsample = train_test_split(x_undersample,
                                                                                                         y_undersample,
                                                                                                         test_size=0.3,
                                                                                                         random_state=0)
print("Number transactions train dataset: ", len(x_train_undersample))
print("Number of transactions test dataset: ", len(x_test_undersample))
print("Total number of transactions: ", len(x_train_undersample) + len(x_test_undersample))
print("------------------------------------------------------------------")

# 召回率Recall = TP/(FN+TP)
from sklearn.linear_model import LogisticRegression
# KFold——做几倍的交叉验证——即将原始数据集切分数据集，cross_val_score交叉验证评估结果
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import confusion_matrix, recall_score, classification_report

# 逻辑回归模型

def printing_Kfold_scores(x_train_data, y_train_data):
    fold = KFold(5, shuffle=False)  # 切分成五份数据
    c_param_range = [0.01, 0.1, 1, 10, 100]  # 惩罚

    results_table = pd.DataFrame(index=range(len(c_param_range), 2), columns=["C_parameter", "Mean recall score"])
    results_table["C_parameter"] = c_param_range

    j = 0
    for c_param in c_param_range:
        print("------------------------------------------------------------------")
        print("C_parameter: ", c_param)
        print("------------------------------------------------------------------")
        print("")

        recall_accs = []
        for iteration, indices in fold.split(y_train_data):
            # iteration训练集，indices测试集
            # 使用逻辑回归模型，C参数表示惩罚项力度，penalty可以选l1或l2惩罚，l1为绝对值惩罚，l2为平方惩罚
            lr = LogisticRegression(C=c_param, penalty='l1', solver='liblinear')
            print("test--------------------------------------------------------------")
            print(x_train_data.iloc[indices[0], :].values)
            print(y_train_data.iloc[indices[0], :].values.ravel())
            print("------------------------------------------------------------------")
            # 最好参数重新在训练数据上训练模型
            lr.fit(x_train_data.iloc[indices[0], :].values, y_train_data.iloc[indices[0], :].values.ravel())
            # 107行存在问题
            # ValueError: Expected 2D array, got 1D array instead:
            # array=[-1.86375555  3.44264398 -4.46825973  2.80533626 -2.11841248 -2.33228489
            # -4.2612372   1.70168184 -1.43939588 -6.99990663  6.31620968 -8.670818
            # 0.31602399 -7.41771206 -0.43653747 -3.65280196 -6.29314532 -1.24324829
            # 0.36481048  0.360924    0.66792657 -0.51624236 -0.01221781  0.0706137
            # 0.05850447  0.30488284  0.41801247  0.20885828 -0.34923131]. x_train_data.iloc[indices[0], :]的数据
            # Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
            
            # 建立好模型后，预测模型结果，这里用的就是验证集，索引为1
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1], :].values)

            # 计算召回率
            recall_acc = recall_score(y_train_data.iloc[indices[1], :].values, y_pred_undersample)
            recall_accs.append(recall_acc)
            print("Iteration ", iteration, " : recall score = ", recall_acc)

        results_table.ix[j, 'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')

    best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']

    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c)
    print('*********************************************************************************')

    return best_c


best_c = printing_Kfold_scores(x_train_undersample, y_train_undersample)

   Time        V1        V2        V3  ...       V27       V28  Amount  Class
0   0.0 -1.359807 -0.072781  2.536347  ...  0.133558 -0.021053  149.62      0
1   0.0  1.191857  0.266151  0.166480  ... -0.008983  0.014724    2.69      0
2   1.0 -1.358354 -1.340163  1.773209  ... -0.055353 -0.059752  378.66      0
3   1.0 -0.966272 -0.185226  1.792993  ...  0.062723  0.061458  123.50      0
4   2.0 -1.158233  0.877737  1.548718  ...  0.219422  0.215153   69.99      0

[5 rows x 31 columns]
------------------------------------------------------------------
0    284315
1       492
Name: Class, dtype: int64
------------------------------------------------------------------
         V1        V2        V3  ...       V28  Class  normAmount
0 -1.359807 -0.072781  2.536347  ... -0.021053      0    0.244964
1  1.191857  0.266151  0.166480  ...  0.014724      0   -0.342475
2 -1.358354 -1.340163  1.773209  ... -0.059752      0    1.160686
3 -0.966272 -0.185226  1.792993  ...  0.061458      0    0.140534
4 -1.158233  0.877737  1.548718  ...  0.215153      0   -0.073403

[5 rows x 30 columns]
------------------------------------------------------------------
Percentage of normal transactions:  0.5
Percentage of fraud transactions:  0.5
Total number of transactions in resampled data:  984
------------------------------------------------------------------
Number transactions train dataset:  199364
Number of transactions test dataset:  85443
Total number of transactions:  284807
------------------------------------------------------------------
Number transactions train dataset:  688
Number of transactions test dataset:  296
Total number of transactions:  984
------------------------------------------------------------------
------------------------------------------------------------------
C_parameter:  0.01
------------------------------------------------------------------

test--------------------------------------------------------------
[-1.86375555  3.44264398 -4.46825973  2.80533626 -2.11841248 -2.33228489
 -4.2612372   1.70168184 -1.43939588 -6.99990663  6.31620968 -8.670818
  0.31602399 -7.41771206 -0.43653747 -3.65280196 -6.29314532 -1.24324829
  0.36481048  0.360924    0.66792657 -0.51624236 -0.01221781  0.0706137
  0.05850447  0.30488284  0.41801247  0.20885828 -0.34923131]
[1]
------------------------------------------------------------------
Traceback (most recent call last):
  File "F:/code/deeplearning/DeepLearning/logistic_regression/learning12_transaction_data_anomaly_detection.py", line 140, in <module>
    best_c = printing_Kfold_scores(x_train_undersample, y_train_undersample)
  File "F:/code/deeplearning/DeepLearning/logistic_regression/learning12_transaction_data_anomaly_detection.py", line 107, in printing_Kfold_scores
    lr.fit(x_train_data.iloc[indices[0], :].values, y_train_data.iloc[indices[0], :].values.ravel())
  File "C:\Users\lhw\python\lib\site-packages\sklearn\linear_model\_logistic.py", line 1346, in fit
    accept_large_sparse=solver != 'liblinear')
  File "C:\Users\lhw\python\lib\site-packages\sklearn\base.py", line 433, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "C:\Users\lhw\python\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\lhw\python\lib\site-packages\sklearn\utils\validation.py", line 878, in check_X_y
    estimator=estimator)
  File "C:\Users\lhw\python\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\lhw\python\lib\site-packages\sklearn\utils\validation.py", line 698, in check_array
    "if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[-1.86375555  3.44264398 -4.46825973  2.80533626 -2.11841248 -2.33228489
 -4.2612372   1.70168184 -1.43939588 -6.99990663  6.31620968 -8.670818
  0.31602399 -7.41771206 -0.43653747 -3.65280196 -6.29314532 -1.24324829
  0.36481048  0.360924    0.66792657 -0.51624236 -0.01221781  0.0706137
  0.05850447  0.30488284  0.41801247  0.20885828 -0.34923131].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Process finished with exit code 1

打印x_train_data.iloc[indices[0], :].values显示如下

print(x_train_data.iloc[indices[0], :].values)
print(y_train_data.iloc[indices[0], :].values.ravel())
print(type(x_train_data.iloc[indices[0], :].values))
print(x_train_data.iloc[indices[0], :].values.shape)

[-1.86375555  3.44264398 -4.46825973  2.80533626 -2.11841248 -2.33228489
 -4.2612372   1.70168184 -1.43939588 -6.99990663  6.31620968 -8.670818
  0.31602399 -7.41771206 -0.43653747 -3.65280196 -6.29314532 -1.24324829
  0.36481048  0.360924    0.66792657 -0.51624236 -0.01221781  0.0706137
  0.05850447  0.30488284  0.41801247  0.20885828 -0.34923131]
[1]
<class 'numpy.ndarray'>
(29,)

照理是符合参数要求的，但是一直报错，想不明白为什么，萌新求助！

感谢社区大佬指点！使我明白了我错在哪里

终于修改完毕

事实上在这一步就有错：（一直没发现copy的时候copy错了，真是糊涂妈妈给糊涂开门，糊涂到家了(*￣︿￣)）

for iteration, indices in fold.split(y_train_data)

应该是

for iteration, indices in fold.split(x_train_data)

然后对于fit函数参数也需要进行修改，此前一直没理解fit，在看过这篇文章后我明白了新旧方法间的差异：

https://stackoverflow.com/questions/48641290/typeerror-kfold-object-is-not-iterable

以下是修改后的代码

# 信用卡交易数据异常检测问题
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

plt.ion()

data = pd.read_csv("../data/creditcard.csv")
print(data.head())
count_classes = pd.value_counts(data["Class"], sort=True).sort_index()
print("------------------------------------------------------------------")
print(count_classes)  # 正样本0 284315  |  负样本1 492
count_classes.plot(kind="bar")
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()

'''
    样本数据不均衡，即一个数据极大，而另一个数据极小
    下采样策略，使两种样本同样少
    过采样策略，使两种样本同样多
'''

# 对Amount均值归一化
data["normAmount"] = StandardScaler().fit_transform(data["Amount"].values.reshape(-1, 1))
data = data.drop(["Time", "Amount"], axis=1)
print("------------------------------------------------------------------")
print(data.head())
print("------------------------------------------------------------------")
x = data.iloc[:, data.columns != "Class"]
y = data.iloc[:, data.columns == "Class"]
# 负样本
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)
# 正样本
normal_indices = data[data.Class == 0].index

# 随机选取x,采取下采样策略，选取和异常样本数相等的正常样本数
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=False)
random_normal_indices = np.array(random_normal_indices)

# 连接样本合并
under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])
under_sample_data = data.iloc[under_sample_indices, :]

x_undersample = under_sample_data.iloc[:, under_sample_data.columns != "Class"]
y_undersample = under_sample_data.iloc[:, under_sample_data.columns == "Class"]

print("Percentage of normal transactions: ",
      len(under_sample_data[under_sample_data.Class == 0]) / len(under_sample_data))
print("Percentage of fraud transactions: ",
      len(under_sample_data[under_sample_data.Class == 1]) / len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

# test_size=0.3即30%的数据做测试集，70%的数据做训练集 random_state=0每次随机效果相同
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
print("------------------------------------------------------------------")
print("Number transactions train dataset: ", len(x_train))
print("Number of transactions test dataset: ", len(x_test))
print("Total number of transactions: ", len(x_train) + len(x_test))
print("------------------------------------------------------------------")

x_train_undersample, x_test_undersample, y_train_undersample, y_test_undertrainsample = train_test_split(x_undersample,
                                                                                                         y_undersample,
                                                                                                         test_size=0.3,
                                                                                                         random_state=0)
print("Number transactions train dataset: ", len(x_train_undersample))
print("Number of transactions test dataset: ", len(x_test_undersample))
print("Total number of transactions: ", len(x_train_undersample) + len(x_test_undersample))
print("------------------------------------------------------------------")

# 召回率Recall = TP/(FN+TP)
from sklearn.linear_model import LogisticRegression
# KFold——做几倍的交叉验证——即将原始数据集切分数据集，cross_val_score交叉验证评估结果
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import confusion_matrix, recall_score, classification_report

# 逻辑回归模型


def printing_Kfold_scores(x_train_data, y_train_data):
    fold = KFold(5, shuffle=False)  # 切分成五份数据
    c_param_range = [0.01, 0.1, 1, 10, 100]  # 惩罚

    results_table = pd.DataFrame(index=range(len(c_param_range), 2), columns=["C_parameter", "Mean recall score"])
    results_table["C_parameter"] = c_param_range

    j = 0
    for c_param in c_param_range:
        print("------------------------------------------------------------------")
        print("C_parameter: ", c_param)
        print("------------------------------------------------------------------")
        print("")

        recall_accs = []
        for iteration, indices in fold.split(x_train_data):
            # iteration训练集，indices测试集
            # 使用逻辑回归模型，C参数表示惩罚项力度，penalty可以选l1或l2惩罚，l1为绝对值惩罚，l2为平方惩罚
            lr = LogisticRegression(C=c_param, penalty='l1', solver='liblinear')
            y_shape_num = y_train_data.iloc[iteration, :].values.ravel().shape
            print(y_shape_num[0])
            # 最好参数重新在训练数据上训练模型
            lr.fit(x_train_data.iloc[iteration, :].values.reshape(y_shape_num[0],-1), y_train_data.iloc[iteration, :].values.ravel())

            # 建立好模型后，预测模型结果，这里用的就是验证集，索引为1
            y_pred_undersample = lr.predict(x_train_data.iloc[indices, :])

            # 计算召回率
            recall_acc = recall_score(y_train_data.iloc[indices, :].values, y_pred_undersample)
            recall_accs.append(recall_acc)
            print("Iteration ", " : recall score = ", recall_acc)

        # print(type(results_table.ix[j]))
        # print(np.mean(recall_accs),type(np.mean(recall_accs)))
        results_table.loc[j, 'Mean recall score'] = np.mean(recall_accs)

        print()
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')

    print("+++++++++++++++++++++++++++++++ recall score list +++++++++++++++++++++++++++++++")
    print(results_table)
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    best_c = results_table.loc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter']

    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c)
    print('*********************************************************************************')

    return best_c


best_c = printing_Kfold_scores(x_train_undersample, y_train_undersample)

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
CSDN专家-深度学习进阶 2021-06-21 20:52
关注
在新版的sklearn中，所有的数据都应该是二维矩阵，哪怕它只是单独一行或一列（比如前面做预测时，仅仅只用了一个样本数据），所以需要使用.reshape(1,-1)进行转换

具体可以参考：使用sklearn报错ValueError: Expected 2D array, got 1D array instead - 简书 (jianshu.com)

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决 1
无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(2条)

报告相同问题？

关注问题

为什么LogisticRegression.fit输入2D数据无法强转1D python 有问必答机器学习
2021-06-21 20:51

回答 3 已采纳在新版的sklearn中，所有的数据都应该是二维矩阵，哪怕它只是单独一行或一列（比如前面做预测时，仅仅只用了一个样本数据），所以需要使用.reshape(1,-1)进行转换具体可以参考：使用skl
matlab逻辑回归logistic regression 数据类型问题 matlab 逻辑回归
2021-10-31 15:19

回答 1 已采纳你好同学，你的分类数据最好转换成分类向量，比如说男女转换成男1 0女0 1这种，如果仅有男女作为标签，那么建议最后标签设置就是10（分别代表男女）也是很不错的有帮助望采纳呢
有关numpy 库在logistic regression中画图的问题 python 机器学习
2023-01-20 00:59

回答 2 已采纳（1） input_[output == 0, x] 表示，绘制输出为 0 的数据点，即绘制所有第 0 类数据点，即蓝色点；类似地，input_[output == 1, x] 表示绘制第一类
sklearn之LinearRegression.fit。报错ValueError: Expected 2D array, got 1D array instead
2020-07-18 20:58

画个一样的我的博客 # 用训练集的数据进行训练 from sklearn.linear_model import LinearRegression regressor = LinearRegression() # regressor = regressor.fit(X_train, Y_train) # 版本过高，fit方法的参数形式改变 # print(help...
怎么将biao.csv中的数据用train_test_split划分第一次接触求帮助 python
2021-07-01 23:30

回答 1 已采纳代码里面不是已经使用train_test_split 划分了么？
为什么逻辑回归结果会和autogluon一摸一样啊 python 机器学习逻辑回归
2023-03-03 10:58

回答 9 已采纳 import autogluon from autogluon.tabular import TabularDataset,TabularPredictor import pandas as p
X has 2 features per sample; python 大数据机器学习
2022-07-29 13:43

回答 1 已采纳 plt.contourf绘制的图是基于其中某两个特征的，需要重新构建分类器，并且选择数据集其中的某两个特征，代码以前两个特征为例，即代码中的0: 2，PS：由于代码太长，我就不一一复制了，从195行开
sklearn.linear_model.LogisticRegression模型参数详解与predict、predict_proba区别以及源码解析
2020-01-22 17:57

月上流骚头的博客参数详解 from sklearn import linear_model linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, ...
分离数据集老是出现错误 python sklearn 有问必答机器学习
2022-03-09 00:23

回答 2 已采纳报错应该是在 dataset.hist那条语句，其中参数名写错了，两个参数分别是 xlabelsize和 ylabelsize，x,y不要写成大写的，改成如下： dataset.hist(sharex
如何用mplus做logistic回归？ python r语言学习方法
2023-03-17 22:22

回答 3 已采纳该回答引用GPTᴼᴾᴱᴺᴬᴵMplus是一种结构方程模型软件，可以用于拟合多种统计模型，包括二元逻辑回归模型。下面是一个简单的二元逻辑回归模型的Mplus语法示例： TITLE: Binary Log
adaptive lasso-logistic r语言有问必答
2021-05-26 18:32

回答 2 已采纳参考一下：https://www.zhihu.com/question/36730804，希望对你有帮助
ML之sklearn：sklearn.linear_mode中的LogisticRegression函数的简介、使用方法之详细攻略
2020-07-21 15:59

一个处女座的程序猿的博客 ML之sklearn：sklearn.linear_mode中的LogisticRegression函数的简介、使用方法之详细攻略目录 sklearn.linear_mode中的LogisticRegression函数的简介、使用方法 sklearn.linear_mode中的...
Django网页向数据库中插入数据时报错，只获取到null django html5 python 有问必答
2021-07-20 14:10

回答 1 已采纳 form表单提交时，name属性值才是提交的属性。并不是id属性值。类似这种，其他同理。
Python：sklearn单独测试一个数据的报错处理：ValueError: Expected 2D array, got 1D array instead:
2020-08-14 20:46

DeniuHe的博客 product from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.metrics import accuracy_score, f1_score from itertools import combinations, product from sklearn....
sklean学习之LogisticRegression（逻辑斯蒂回归分类器）【源码】
2018-05-24 20:45

清萝卜头的博客本文是根据sklean官方文档进行...def fit(self, X, y, sample_weight=None): """根据给定的训练数据拟合模型. 参数 ---------- X : {array-like, sparse matrix}, shape (n_samples, n_feature...
没有解决我的问题, 去提问

悬赏问题

¥15 metadata提取的PDF元数据，如何转换为一个Excel
¥15 关于arduino编程toCharArray()函数的使用
¥100 vc++混合CEF采用CLR方式编译报错
¥15 coze 的插件输入飞书多维表格 app_token 后一直显示错误，如何解决？
¥15 vite+vue3+plyr播放本地public文件夹下视频无法加载
¥15 c#逐行读取txt文本，但是每一行里面数据之间空格数量不同
¥50 如何openEuler 22.03上安装配置drbd
¥20 ING91680C BLE5.3 芯片怎么实现串口收发数据
¥15 无线连接树莓派，无法执行update，如何解决？（相关搜索：软件下载）
¥15 Windows11, backspace, enter, space键失灵

为什么LogisticRegression.fit输入2D数据无法强转1D

3条回答 默认 最新

悬赏问题

3条回答默认最新