Example 1:
I have a PDF document and used the PDF Parser (www.pdfparser.org) online to take all its content in text format. Rescued content in a TXT file (manually) and tried to filter some data using regular expression, everything worked normally.
Example 2:
To automate the process, I downloaded the PDF Parser API and made a script that follows the following rules:
1) Transforms the PDF text using the ParseFile () API method.
2) Saves the content of TXT.
3) Try to filter out this TXT using regular expression.
Example 1 -> It worked and returned me:
array (size = 2)
'mora_dia' =>
array (size = 1)
0 => string 'R $ 3.44' (length = 7)
'fine' =>
array (size = 1)
0 => string 'R $ 17.21' (length = 8)
Example 2 -> It did not work.
array (size = 2)
'mora_dia' =>
array (size = 0)
empty
'fine' =>
array (size = 0)
empty
- Data from the two TXT are equal, but because in the second example does not work? * (I've tried to do this without saving in TXT but did not work)
Below are the codes of my two examples:
Example 1:
$data = file_get_contents('exemplo_01.txt');
$regex = [
'mora_dia' => '/R\$ [0-9]{1,}\.[0-9]{1,}/i',
'multa' => '/R\$ [0-9]{1,}\,[0-9]{1,}/i'
];
foreach($regex as $title => $ex)
{
preg_match($ex, $data, $matches[$title]);
}
var_dump($matches);
Example 2:
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($PDFFile);
$pages = $pdf->getPages();
foreach ($pages as $page) {
$PDFParse = $page->getText();
}
$txtName = __DIR__ . '/files/Txt/' . md5(uniqid(rand(), true)) . '.txt';
$file = fopen($txtName, 'w+');
fwrite($file, $PDFParse);
fclose($file);
$dataTxt = file_get_contents($txtName);
$regex = [
'mora_dia' => '/R\$ [0-9]{1,}\.[0-9]{1,}/i',
'multa' => '/R\$ [0-9]{1,}\,[0-9]{1,}/i'
];
foreach($regex as $title => $ex)
{
preg_match($ex, $dataTxt, $matches[$title]);
}