将数据集写入hdf5格式文件时出现bug

将东北大学数据集写入hdf5文件格式过程中出现bug

from config import gray_config as config
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from gaoimage.io import HDF5DatasetWriter
from imutils import paths
import numpy as np
import progressbar
# import json
import cv2
import os

imagePaths = list(paths.list_images(config.IMAGE_PATH))
imageLabels = [p.split(os.path.sep)[-2] for p in imagePaths]
le = LabelEncoder()
imageLabels = le.fit_transform(imageLabels)

# split the original paths to res and test, 240(1440) for res, 60(360) for test
(resPaths, testPaths, resLabels, testLabels) = train_test_split(
    imagePaths, imageLabels, test_size=0.2, random_state=42)

# split the res paths to train and validation, 180(1080) for train, 60(360) for validation
(trainPaths, valPaths, trainLabels, valLabels) = train_test_split(
    resPaths, resLabels, test_size=0.25, random_state=42)

# construct a list pairing the training, validation, and testing
# image paths along with their corresponding labels and output HDF5 files
datasets = [
    ("train", trainPaths, trainLabels, config.TRAIN_HDF5),
    ("val", valPaths, valLabels, config.VAL_HDF5),
    ("test", testPaths, testLabels, config.TEST_HDF5)
]

# initialize the image preprocessor and the list of RGB channel averages
# (R, G, B) = ([], [], [])

# loop over the dataset tuples
for (dType, paths, labels, outputPath) in datasets:
    # create HDF5 writer
    print("[INFO] building {}...".format(outputPath))
    writer = HDF5DatasetWriter((len(paths), 200, 200, 1), outputPath)

    # initialize the progress bar
    widgets = ["Building Dataset: ", progressbar.Percentage(), " ",
               progressbar.Bar(), " ", progressbar.ETA()]
    pbar = progressbar.ProgressBar(maxval=len(paths),
                                   widgets=widgets).start()

    # loop over the image paths
    for (i, (path, label)) in enumerate(zip(paths, labels)):
        # load the image and process it
        image = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
        image = np.expand_dims(image, axis=2)

        # if we are building the training dataset, then compute the
        # mean of each channel in the image, then update the respective lists
        # if dType == "train":
        #     (b, g, r) = cv2.mean(image)[:3]
        #     R.append(r)
        #     G.append(g)
        #     B.append(b)

        # add the image and label to the HDF5 dataset
        writer.add([image], [label])
        pbar.update(i)

    # close the HDF5 writer
    pbar.finish()
    writer.close()

from os import path

IMAGE_PATH = "../zhai/dataset1/NEU-CLS/images"


TRAIN_HDF5 = "../zhai/dataset1/NEU-CLS/hdf5/train.hdf5"
VAL_HDF5 = "../zhai/dataset1/NEU-CLS/hdf5/val.hdf5"
TEST_HDF5 = "../zhai/dataset1/NEU-CLS/hdf5/test.hdf5"

OUTPUT_PATH = "gray_output"

figPath = path.sep.join([OUTPUT_PATH, "ms_test1.png"])
jsonPath = path.sep.join([OUTPUT_PATH, "ms_test1.json"])
DATASET_MEAN = "gray_output/NEU_DET_1_mean.json"

报错内容：With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

4条回答默认最新

Jackyin0720 2022-11-24 17:50

关注

With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
如果n_samples=0，test_size=0.2，train_size=None，则生成的训练集将为空。调整任何上述参数。
分析：原数据集训练图片使用的是png格式，实际数据集图片为jpg格式。
解决思路：
将训练图片从jpg格式转成png格式。
参考代码示例：

import os
import cv2


def transform(input_path, output_path):
    for root, dirs, files in os.walk(input_path):
        for name in files:
            file = os.path.join(root, name)
            print('transform' + name)
            im = cv2.imread(file)
            if output_path:
                cv2.imwrite(os.path.join(output_path, name.replace('jpg', 'png')), im)
            else:
                cv2.imwrite(file.replace('jpg', 'png'), im)


if __name__ == '__main__':
    input_path = input("请输入目标文件夹: ")

    output_path = input("请输入输出文件夹： (回车则输出到原地址)")
    if not os.path.exists(input_path):
        print("文件夹不存在!")
    else:
        print("Start to transform!")
        transform(input_path, output_path)
        print("Transform end!")

编辑记录

报告相同问题？

关注问题

pandas使用HDF5存储文件，对存储模式为table的数据使用select方法进行数据筛选时报错 python 数据分析
2022-03-27 21:37

回答 1 已采纳 store.put('col_2',sales_df2,format='table', data_columns=True) df_2=store.select('col_2', where=["in
python读取和存储hdf5文件无法使用中文路径 python 有问必答
2021-07-15 14:46

回答 3 已采纳何必要在一棵树上吊死呢？试试h5py,这才是读写hdf文件的正确方式。 >>> import h5py >>> with h5py.File(r'D:\数据文件\h
如何使用GO语言读取HDF5属性，该属性可能是两种不同数据类型之一？ c++
2019-02-21 15:07

回答 1 已采纳 I have confirmed my suspicions and now have a proper answer. The essential problem is that there w
利用python进行数据分析：XML 和 HTML ：网络抓取，二进制格式，使用 HDF5 格式，读取 Microsoft Excel 文件
2020-06-07 16:06

AI路漫漫的博客每个 HDF5 文件可以存储多个数据集并且支持元数据。 HDF5支持多种压缩模式的即时压缩，使得重复模式的数据可以更高效的存储，HDF5适用于处理不合适再内存种存储的超大型数据，可以是你高效的读写大型数组的一块。 ...
使用hdf5存文件的速度比使用csv存文件的速度还慢是怎么回事？ python
2023-02-21 22:15

回答 1 已采纳如果数据结构是大量的小数组，是有这个可能的
python读取hdf文件报错 python
2022-08-27 21:03

回答 3 已采纳关于该问题，我找了一篇非常好的博客，你可以看看是否有帮助，链接：Python读取hdf文件
MATLAB如何读取HDF文件 java matlab 学习方法
2023-03-26 15:28

回答 2 已采纳不知道你这个问题是否已经解决, 如果还没有解决的话: 这个问题的回答你可以参考下: https://ask.csdn.net/questions/7621492这篇博客你也可以参考下：解决Matlab
打开深度学习的锁：（1）入门神经网络
2023-09-09 20:39

Jiashun Hao的博客打开深度学习的锁导言一、导入的包和说明二、数据的预处理2.1 数据集说明2.2 数据集降维度并且转置2.3 数据预处理完整代码三、逻辑回归3.1 线性回归函数公式3.2 sigmoid函数公式四、初始化函数五、构建逻辑回归的前...
HDF5 library version mismatched error的问题 python
2021-05-08 20:07

回答 1 已采纳降
hadoop格式化namenode命令找不到文件或目录这怎么解决啊是配置文件的错误吗？急 hadoop 有问必答
2022-03-26 01:06

回答 2 已采纳这还看不明白吗，一开始你这就是错误的，建议重装系统，或者删除所有的下载文件也行在你的opt目录下新建三个文件夹，分别是install,moudle,data,这三个文件夹的用途是install：专
selenium爬取网站时,没有出现“下载”链接 http selenium
2021-09-07 23:19

回答 2 已采纳怀疑是原网页检查了referer试试模拟从网站主页进入,点击搜索的过程,我直接访问你贴出的具体网页也没有对应的下载选项,可能是原网站对访问方式有所检查有帮助望采纳
《Deep Learning for Computer Vision withPython》阅读笔记-StarterBundle(第18 - 23章)
2022-02-03 15:20

wyypersist的博客在上一章中，我们学习了如何在发生欠拟合和过拟合时发现它们，使你能够在保留训练时表现良好的模型的同时，剔除表现不佳的实验。然而，您可能想知道是否有可能将这两种策略结合起来。当我们的损失/准确性提高...
fortran读取.csv文件 python
2023-01-18 19:56

回答 5 已采纳既然可用python读取HDF，那就用python把它处理后生成标准的csv【即单元格中保存单个数据】,不要中间走弯路，你图中的csv每单元格中是列表格式，用fortran处理起来估计麻烦些
Python 有哪些好的学习资料或者博客？
2022-08-16 17:40

测试小扎的博客 https://www.bilibili.com/video/BV1qW4y1a7fU?spm_id_from=333.999.0.0&vd_source=d15cfe2763f6aa65b562d3221bcfc7c4 课程无缝衔接数据开发、人工智能、数据分析，后续挑战30w年薪。从零基础开始入门学习Python，...
Pandas库相关用法总结【万字梳理，用法描述+代码示例结果，详细实用！】
2024-03-17 19:29

TFY_Newone的博客本文总结归纳了使用pandas库对dataframe格式数据进行各类预处理、统计、和分析过程中常用的一些函数和应用场景，希望能帮助看到的小伙伴在日常调试时少走弯路。
没有解决我的问题, 去提问