doujupa7567 2013-05-07 16:53
浏览 49
已采纳

从PDF获取数据到php / html / javascript

i want to ask one think about pdfs.

So i want to get out some data from pdf, but only specified data. Is it possible to choose what to get out from pdf?

For example is this image, so you can see which data i want to put out from pdf: pic http://shrani.si/f/1k/AA/Ph2cBYG/informativna-ponudba-gre.png

thanks

  • 写回答

1条回答 默认 最新

  • douxian1939 2013-05-07 18:43
    关注

    This question touched two major processes: OCR and Data Capture (or parsing)

    OCR stands for Optical Character Recognition. This process converts images to text. You will have to use this category of software if your PDFs are image-only PDFs (no text layer, such as scan, fax, rasterized, etc.). If your PDF already contains electronic text data, you 'may' be able to skip this step.

    Data Capture standard for intelligent data location and extraction, such as finding specific fields among all other text. There are specialized software packages and/or parsing processes for that (see my previous post here).

    If all your docs have the same 'area' that contains your text, you can crop the images, then pass smaller zones to OCR, which in turn will simplify your text extraction logic (because there will be less text to deal with).

    ilya

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?
  • ¥15 c++头文件不能识别CDialog