Is it possible to get only some of the data from a PDF document? What I want to do in detail, is to export some data from an advertising paper (pdf format) and get all the products and their prices automatically through PHP.
Would this be possible?
I have already tried, reading the pdf files directly through PHP, but the data returned, is completely messed up.
I have also tried to convert the pdf to html code, but the html code that is generated contains a lot more than just the product name and prices. Also the styling and size of the text is not consistent at all so it is very difficult to check wether it is "some describing text" or a "product name". Here is an example of some html code generated from a pdf file:
// Product name 1
<div style="position:absolute;top:250;left:39"><span class="ft2">MGP CD’en 2015 </span></div>
// Product name 1 end
// Price 1
<div style="position:absolute;top:260;left:39"><span class="ft5">139,-</span></div>
// Price 1 end
<div style="position:absolute;top:71;left:124"><span class="ft8">NYHED</span></div>
// Product name 2
<div style="position:absolute;top:375;left:614"><span class="ft9"> vores </span></div>
<div style="position:absolute;top:397;left:614"><span class="ft9">kyllingeinderfilet </span></div>
// Product name 2 end
<div style="position:absolute;top:422;left:614"><span class="ft3">650 g.</span></div>
<div style="position:absolute;top:437;left:614"><span class="ft7">Pr. kg 69,23</span></div>
<div style="position:absolute;top:447;left:614"><span class="ft10">Frit valg</span></div>
// Price 2
<div style="position:absolute;top:464;left:614"><span class="ft11">4</span></div>
<div style="position:absolute;top:464;left:679"><span class="ft11">5</span></div>
<div style="position:absolute;top:464;left:743"><span class="ft11">,-</span></div>
// Price 2 end
<div style="position:absolute;top:250;left:274"><span class="ft12">ÅBENT ALLE DAGE 8.21</span>
The above PDF can be seen online at (code above is the page on the right): http://www.foetex.dk/ugenstilbud/Pages/Aktuel-tilbudsavis.aspx
I hope someone can give some good suggestions on how to get around this problem. Also, does anyone know a great "pdf to html" converter? The above html code is generated through a free online tool.
Any help will be greatly appreciated.