douying6206 2015-02-16 15:42
浏览 19

如何通过PHP从pdf文件中获取精确度

Is it possible to get only some of the data from a PDF document? What I want to do in detail, is to export some data from an advertising paper (pdf format) and get all the products and their prices automatically through PHP.

Would this be possible?

I have already tried, reading the pdf files directly through PHP, but the data returned, is completely messed up.

I have also tried to convert the pdf to html code, but the html code that is generated contains a lot more than just the product name and prices. Also the styling and size of the text is not consistent at all so it is very difficult to check wether it is "some describing text" or a "product name". Here is an example of some html code generated from a pdf file:

// Product name 1
<div style="position:absolute;top:250;left:39"><span class="ft2">MGP CD’en 2015 </span></div>
// Product name 1 end

// Price 1
<div style="position:absolute;top:260;left:39"><span class="ft5">139,-</span></div>
// Price 1 end

<div style="position:absolute;top:71;left:124"><span class="ft8">NYHED</span></div>

// Product name 2
<div style="position:absolute;top:375;left:614"><span class="ft9"> vores </span></div>
<div style="position:absolute;top:397;left:614"><span class="ft9">kyllingeinderfilet </span></div>
// Product name 2 end

<div style="position:absolute;top:422;left:614"><span class="ft3">650 g.</span></div>
<div style="position:absolute;top:437;left:614"><span class="ft7">Pr. kg 69,23</span></div>
<div style="position:absolute;top:447;left:614"><span class="ft10">Frit valg</span></div>

// Price 2
<div style="position:absolute;top:464;left:614"><span class="ft11">4</span></div>
<div style="position:absolute;top:464;left:679"><span class="ft11">5</span></div>
<div style="position:absolute;top:464;left:743"><span class="ft11">,-</span></div>
// Price 2 end

<div style="position:absolute;top:250;left:274"><span class="ft12">ÅBENT ALLE DAGE 8.21</span>

The above PDF can be seen online at (code above is the page on the right): http://www.foetex.dk/ugenstilbud/Pages/Aktuel-tilbudsavis.aspx

I hope someone can give some good suggestions on how to get around this problem. Also, does anyone know a great "pdf to html" converter? The above html code is generated through a free online tool.

Any help will be greatly appreciated.

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥30 深度学习,前后端连接
    • ¥15 孟德尔随机化结果不一致
    • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败
    • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
    • ¥15 谁有desed数据集呀
    • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100
    • ¥15 关于#hadoop#的问题
    • ¥15 (标签-Python|关键词-socket)
    • ¥15 keil里为什么main.c定义的函数在it.c调用不了
    • ¥50 切换TabTip键盘的输入法