Gavin_Stargazer 2021-09-30 14:41 采纳率: 62.5%
浏览 92
已结题

【python】【文字识别】百度文档识别返回的结果,整理为dataframe表格

一、背景
pdf文档中有一个表格(请见附一),已获得百度文档识别返回的结果(请见附二,节选红框所示的三行)。

二、目标
将返回的结果整理成python_dataframe表格

三、请帮助
如何使用python实现?谢谢

附一:原始表格

img

附二:初步识别代码(json)

a = [
    {
        "words_location": {"top": 382, "left": 124, "width": 52, "height": 12},
        "word": "豪华客房",
    },
    {
        "words_location": {"top": 383, "left": 280, "width": 59, "height": 11},
        "word": "14501530",
    },
    {
        "words_location": {"top": 383, "left": 425, "width": 23, "height": 10},
        "word": "450",
    },
    {
        "words_location": {"top": 383, "left": 553, "width": 28, "height": 11},
        "word": "510",
    },
    {
        "words_location": {"top": 383, "left": 689, "width": 25, "height": 10},
        "word": "NA ",
    },
    {
        "words_location": {"top": 412, "left": 113, "width": 76, "height": 13},
        "word": "高级豪华客房",
    },
    {
        "words_location": {"top": 414, "left": 277, "width": 61, "height": 11},
        "word": "5001580",
    },
    {
        "words_location": {"top": 413, "left": 424, "width": 23, "height": 11},
        "word": "500",
    },
    {
        "words_location": {"top": 413, "left": 554, "width": 26, "height": 11},
        "word": "560",
    },
    {
        "words_location": {"top": 413, "left": 690, "width": 22, "height": 10},
        "word": "NA ",
    },
    {
        "words_location": {"top": 442, "left": 111, "width": 76, "height": 12},
        "word": "行攻豪华客房",
    },
    {
        "words_location": {"top": 444, "left": 278, "width": 60, "height": 12},
        "word": "1700/1700",
    },
    {
        "words_location": {"top": 444, "left": 424, "width": 25, "height": 10},
        "word": "600",
    },
    {
        "words_location": {"top": 444, "left": 554, "width": 27, "height": 10},
        "word": "600",
    },
    {
        "words_location": {"top": 444, "left": 689, "width": 22, "height": 10},
        "word": "NA ",
    },
]

  • 写回答

3条回答 默认 最新

  • 此人真菜 2021-09-30 15:55
    关注
    
    import pprint
    import pandas as pd
    df=pd.DataFrame(columns=['B','C','D','E'])
    dic = [
        {
            "words_location": {"top": 382, "left": 124, "width": 52, "height": 12},
            "word": "豪华客房",
        },
        {
            "words_location": {"top": 383, "left": 280, "width": 59, "height": 11},
            "word": "14501530",
        },
        {
            "words_location": {"top": 383, "left": 425, "width": 23, "height": 10},
            "word": "450",
        },
        {
            "words_location": {"top": 383, "left": 553, "width": 28, "height": 11},
            "word": "510",
        },
        {
            "words_location": {"top": 383, "left": 689, "width": 25, "height": 10},
            "word": "NA ",
        },
        {
            "words_location": {"top": 412, "left": 113, "width": 76, "height": 13},
            "word": "高级豪华客房",
        },
        {
            "words_location": {"top": 414, "left": 277, "width": 61, "height": 11},
            "word": "5001580",
        },
        {
            "words_location": {"top": 413, "left": 424, "width": 23, "height": 11},
            "word": "500",
        },
        {
            "words_location": {"top": 413, "left": 554, "width": 26, "height": 11},
            "word": "560",
        },
        {
            "words_location": {"top": 413, "left": 690, "width": 22, "height": 10},
            "word": "NA ",
        },
        {
            "words_location": {"top": 442, "left": 111, "width": 76, "height": 12},
            "word": "行攻豪华客房",
        },
        {
            "words_location": {"top": 444, "left": 278, "width": 60, "height": 12},
            "word": "1700/1700",
        },
        {
            "words_location": {"top": 444, "left": 424, "width": 25, "height": 10},
            "word": "600",
        },
        {
            "words_location": {"top": 444, "left": 554, "width": 27, "height": 10},
            "word": "600",
        },
        {
            "words_location": {"top": 444, "left": 689, "width": 22, "height": 10},
            "word": "NA ",
        },
    
    ]
    k=0
    for i in range(len(dic)//5):
        a=dic[5*k+0]["word"]
        b=dic[5*k+1]['word']
        c=dic[5*k+2]['word']
        d=dic[5*k+3]['word']
        e = dic[5 * k +4]['word']
        df.loc[a]=[b,c,d,e]
        k+=1
    pprint.pprint(df)
    
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

问题事件

  • 系统已结题 10月8日
  • 已采纳回答 9月30日
  • 创建了问题 9月30日

悬赏问题

  • ¥15 stm32开发clion时遇到的编译问题
  • ¥15 lna设计 源简并电感型共源放大器
  • ¥15 如何用Labview在myRIO上做LCD显示?(语言-开发语言)
  • ¥15 Vue3地图和异步函数使用
  • ¥15 C++ yoloV5改写遇到的问题
  • ¥20 win11修改中文用户名路径
  • ¥15 win2012磁盘空间不足,c盘正常,d盘无法写入
  • ¥15 用土力学知识进行土坡稳定性分析与挡土墙设计
  • ¥70 PlayWright在Java上连接CDP关联本地Chrome启动失败,貌似是Windows端口转发问题
  • ¥15 帮我写一个c++工程