x2849209890 2025-05-07 17:38 采纳率: 0%
浏览 15

图片是url的格式,数据集应该怎么制作

我的图片是用url格式传入的

[ {
  "id" : "2500010_1",
  "conversations" : [ {
    "from" : "user",
    "value" : "FabricDesign Yes: <|vision_start|>https://deti.cn-sh2.ufileos.com/PC%2F9db094a2-2e8c-4fa1-b931-50f9e25c9c98.png?UCloudPublicKey=TOKEN_696b1902-38dd-411f-b9f6-18c454ec200c&Signature=x4SPImsAbXI3EctpIQwYKu2ZfnM%3D&Expires=1746687410<|vision_end|>"
  }, {
    "from" : "assistant",
    "value" : "女生-上衣,型号:2500010,颜色:深灰,质量等级:合格品,服务类型:打版+采购+生产,生产类型:包工包料"
  } ]
}, {
  "id" : "2500010_2",
  "conversations" : [ {
    "from" : "user",
    "value" : "FabricDesign Yes: <|vision_start|>https://deti.cn-sh2.ufileos.com/PC%2F5c21c566-d60d-4679-9d0b-9423d00f585f.png?UCloudPublicKey=TOKEN_696b1902-38dd-411f-b9f6-18c454ec200c&Signature=WxAMVLeCtL0UQKdQwRWU18V6wnI%3D&Expires=1746675952<|vision_end|>"
  }, {
    "from" : "assistant",
    "value" : "主料A:50单面 | 编号:50单面 | 成分:90%棉 | 颜色:精白 | 幅宽:20*20 | 单位:米 | 克重:20 | 供应商:(13632374222) | 含税价:17.80 | 特殊工艺:"
  } ]

然后报错了

D:\Anaconda\envs\wt\python.exe D:\xm\wt\train.py 
swanlab: Tracking run with swanlab version 0.5.7
swanlab: Run data will be saved locally in D:\xm\wt\swanlog\run-20250507_172653-2126b24a
swanlab: 👋 Hi xun, welcome to swanlab!
swanlab: Syncing run Qwen/Qwen2-VL-2B-Instruct to the cloud
swanlab: 🏠 View project at https://swanlab.cn/@xun/qwen-finetune
swanlab: 🚀 View run at https://swanlab.cn/@xun/qwen-finetune/runs/ybsxv8axwjtlelg87rmm0
2025-05-07 17:26:54,534 - modelscope - WARNING - Using branch: master as version is unstable, use with caution
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  9.83it/s]
Generating train split: 819 examples [00:00, 34821.44 examples/s]
Map:   7%|▋         | 57/819 [00:09<02:11,  5.81 examples/s]
swanlab: Error happened while training
swanlab: 🏠 View project at https://swanlab.cn/@xun/qwen-finetune
swanlab: 🚀 View run at https://swanlab.cn/@xun/qwen-finetune/runs/ybsxv8axwjtlelg87rmm0
  File "D:\xm\wt\train.py", line 146, in <module>
    train_dataset = train_ds.map(process_func)
  File "D:\Anaconda\envs\wt\lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "D:\Anaconda\envs\wt\lib\site-packages\datasets\arrow_dataset.py", line 3079, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "D:\Anaconda\envs\wt\lib\site-packages\datasets\arrow_dataset.py", line 3501, in _map_single
    for i, example in iter_outputs(shard_iterable):
  File "D:\Anaconda\envs\wt\lib\site-packages\datasets\arrow_dataset.py", line 3475, in iter_outputs
    yield i, apply_function(example, i, offset=offset)
  File "D:\Anaconda\envs\wt\lib\site-packages\datasets\arrow_dataset.py", line 3398, in apply_function
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "D:\xm\wt\train.py", line 56, in process_func
    image_inputs, video_inputs = process_vision_info(messages)  # 获取数据数据(预处理过)
  File "D:\Anaconda\envs\wt\lib\site-packages\qwen_vl_utils\vision_process.py", line 330, in process_vision_info
    image_inputs.append(fetch_image(vision_info))
  File "D:\Anaconda\envs\wt\lib\site-packages\qwen_vl_utils\vision_process.py", line 91, in fetch_image
    image_obj = Image.open(requests.get(image, stream=True).raw)
  File "D:\Anaconda\envs\wt\lib\site-packages\PIL\Image.py", line 3572, in open
    raise UnidentifiedImageError(msg)
cannot identify image file <_io.BytesIO object at 0x0000029BEBD236A0>



是数据集格式有问题吗,标准的格式是什么样子的?

  • 写回答

3条回答 默认 最新

  • 宝码香车 优质创作者: JavaScript技术领域 2025-05-07 17:56
    关注

    让【宝码香车】来帮你解答,本回答参考DeepSeek编写,并整理提供。
    如果还有疑问可以点击头像关注私信或评论。如果答案让您满意,请采纳、点赞、关注,非常感谢!

    根据您提供的错误信息和数据集格式,问题主要出在图片URL的处理上。标准的Qwen-VL多模态数据集格式应该遵循特定的结构,特别是对于包含URL图片的情况。以下是详细分析和解决方案:

    1. 问题分析

    错误表明系统无法识别从URL获取的图片数据,可能原因:

    • URL格式不正确(包含特殊标记<|vision_start|>
    • 服务器返回的不是有效图片数据
    • 网络请求失败或权限问题

    2. 标准数据集格式

    对于Qwen-VL模型,标准的多模态对话数据集格式应如下:

    [
      {
        "id": "unique_id_1",
        "conversations": [
          {
            "from": "user",
            "value": "<image>https://example.com/image1.jpg</image> 图片描述文本"
          },
          {
            "from": "assistant",
            "value": "对图片的回复内容"
          }
        ]
      }
    ]
    

    3. 修正建议

    修改您的数据集格式如下:

    [
      {
        "id": "2500010_1",
        "conversations": [
          {
            "from": "user",
            "value": "<image>https://deti.cn-sh2.ufileos.com/PC/9db094a2-2e8c-4fa1-b931-50f9e25c9c98.png?UCloudPublicKey=TOKEN_696b1902-38dd-411f-b9f6-18c454ec200c&Signature=x4SPImsAbXI3EctpIQwYKu2ZfnM%3D&Expires=1746687410</image> FabricDesign Yes"
          },
          {
            "from": "assistant",
            "value": "女生-上衣,型号:2500010,颜色:深灰,质量等级:合格品,服务类型:打版+采购+生产,生产类型:包工包料"
          }
        ]
      }
    ]
    

    4. 关键修改点

    1. <|vision_start|><|vision_end|>替换为<image></image>
    2. 确保URL是可直接访问的有效图片链接
    3. 图片描述文本应放在图片标签之后

    5. 验证URL有效性

    建议先单独测试URL是否能正确返回图片:

    import requests
    from PIL import Image
    from io import BytesIO
    
    url = "https://deti.cn-sh2.ufileos.com/PC/9db094a2-2e8c-4fa1-b931-50f9e25c9c98.png?UCloudPublicKey=TOKEN_696b1902-38dd-411f-b9f6-18c454ec200c&Signature=x4SPImsAbXI3EctpIQwYKu2ZfnM%3D&Expires=1746687410"
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        img = Image.open(BytesIO(response.content))
        img.show()  # 如果能正常显示图片,说明URL有效
    else:
        print(f"URL访问失败,状态码: {response.status_code}")
    

    6. 完整处理函数示例

    如果您需要自定义处理函数,可以参考以下模板:

    def process_func(example):
        messages = example["conversations"]
        new_messages = []
        for msg in messages:
            if msg["from"] == "user":
                # 处理包含图片的消息
                if "<image>" in msg["value"]:
                    # 提取图片URL
                    start = msg["value"].find("<image>") + len("<image>")
                    end = msg["value"].find("</image>")
                    img_url = msg["value"][start:end]
                    # 这里添加您的图片处理逻辑
                    new_value = msg["value"].replace("<image>", "").replace("</image>", "")
                    new_messages.append({"from": msg["from"], "value": new_value})
                else:
                    new_messages.append(msg)
            else:
                new_messages.append(msg)
        return {"conversations": new_messages}
    

    如果问题仍然存在,可能需要检查:

    1. URL是否设置了访问权限(如签名过期)
    2. 服务器是否限制了请求频率
    3. 图片格式是否被支持(建议使用常见格式如jpg/png)
    评论

报告相同问题?

问题事件

  • 创建了问题 5月7日