duanmu5039 2014-12-21 22:10
浏览 47
已采纳

preg_match()+ regex在TXT文件中不起作用

Example 1:

I have a PDF document and used the PDF Parser (www.pdfparser.org) online to take all its content in text format. Rescued content in a TXT file (manually) and tried to filter some data using regular expression, everything worked normally.


Example 2:

To automate the process, I downloaded the PDF Parser API and made a script that follows the following rules:

1) Transforms the PDF text using the ParseFile () API method.
2) Saves the content of TXT.
3) Try to filter out this TXT using regular expression.


Example 1 -> It worked and returned me:

array (size = 2)
   'mora_dia' =>
     array (size = 1)
       0 => string 'R $ 3.44' (length = 7)
   'fine' =>
     array (size = 1)
       0 => string 'R $ 17.21' (length = 8)

Example 2 -> It did not work.

array (size = 2)
   'mora_dia' =>
     array (size = 0)
       empty
   'fine' =>
     array (size = 0)
       empty
  • Data from the two TXT are equal, but because in the second example does not work? * (I've tried to do this without saving in TXT but did not work)

Below are the codes of my two examples:

Example 1:

$data = file_get_contents('exemplo_01.txt');

$regex = [
    'mora_dia' => '/R\$ [0-9]{1,}\.[0-9]{1,}/i',
    'multa'    => '/R\$ [0-9]{1,}\,[0-9]{1,}/i'
];

foreach($regex as $title => $ex)
{
    preg_match($ex, $data, $matches[$title]);
}

var_dump($matches);

Example 2:

$parser = new \Smalot\PdfParser\Parser();
    $pdf = $parser->parseFile($PDFFile);
    $pages = $pdf->getPages();

    foreach ($pages as $page) {
        $PDFParse = $page->getText();
    }

    $txtName = __DIR__ . '/files/Txt/' . md5(uniqid(rand(), true)) . '.txt';
    $file  = fopen($txtName, 'w+');
    fwrite($file, $PDFParse);
    fclose($file);

    $dataTxt = file_get_contents($txtName);

    $regex = [
        'mora_dia' => '/R\$ [0-9]{1,}\.[0-9]{1,}/i',
        'multa'    => '/R\$ [0-9]{1,}\,[0-9]{1,}/i'
    ];

    foreach($regex as $title => $ex)
    {
        preg_match($ex, $dataTxt, $matches[$title]);
    }
  • 写回答

2条回答 默认 最新

  • dtk31564 2014-12-21 23:07
    关注

    Your action of copying and pasting the output text manually seems to have actually changed its contents. Based on the pastebin output, the direct to file version contains non-breaking space characters rather than regular spaces. The non-breaking spaces have hex code 0xA0, ascii 160, as opposed to a regular space, hex 0x20 ascii 32.

    In fact, it looks as though all the space characters in the direct to file example are non-breaking 0xA0 spaces.

    To reform your regular expression to be able to accommodate either type of space, you can place the hex code into a [] character class along with the regular space character ' ' as in [ \xA0] to match either type. Further, you will need the /u flag to work with unicode.

    $regex = [
        'mora_dia' => '/R\$[ \xA0][0-9]{1,}\.[0-9]{1,}/iu',
        'multa'    => '/R\$[ \xA0][0-9]{1,},[0-9]{1,}/iu'
    ];
    

    (note, the , comma does not require backslash-escaping)

    This works correctly, using your raw pastebin as input:

    $str = file_get_contents('http://pastebin.com/raw.php?i=H7D5xJBH');
    preg_match('/R\$[ \xa0][0-9]{1,}\.[0-9]{1,}/ui', $str, $matches);
    var_dump($matches);
    
    // Prints:
    array(1) {
      [0] =>
      string(8) "R$ 3.44"
    }
    

    A different solution might be to replace the non-breaking spaces with regular spaces in the entire text before applying your original regular expression:

    // Replace all non-breaking spaces with regular spaces in the
    // text string read from the file...
    // The unicode non-breaking space is represented by 00A0
    // and both are needed to replace this successfully.
    $dataTxt = str_replace("\x00\xA0", " ", $dataTxt);
    

    Whenever you have input you expect to be identical, which appears visually to be identical, be sure to inspect it with a tool capable of displaying each characters hex codes. In this case, I copied your samples from pastebin into files and inspected them with Vim, where I have setup hex and ascii display for the character under the cursor.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 飞机曲面部件如机翼,壁板等具体的孔位模型
  • ¥15 vs2019中数据导出问题
  • ¥20 云服务Linux系统TCP-MSS值修改?
  • ¥20 关于#单片机#的问题:项目:使用模拟iic与ov2640通讯环境:F407问题:读取的ID号总是0xff,自己调了调发现在读从机数据时,SDA线上并未有信号变化(语言-c语言)
  • ¥20 怎么在stm32门禁成品上增加查询记录功能
  • ¥15 Source insight编写代码后使用CCS5.2版本import之后,代码跳到注释行里面
  • ¥50 NT4.0系统 STOP:0X0000007B
  • ¥15 想问一下stata17中这段代码哪里有问题呀
  • ¥15 flink cdc无法实时同步mysql数据
  • ¥100 有人会搭建GPT-J-6B框架吗?有偿