dongping2023 2018-05-07 15:24
浏览 144
已采纳

使用正则表达式和php从html中提取javascript对象

I am trying to extract a specific JavaScript object from a page containing the usual HTML markup.

I have tried to use regex but i don't seem to be able to get it to parse the HTML correctly when the HTML contains a line break.

An example can be seen here: https://regex101.com/r/b8zN8u/2

The HTML i am trying to extract looks like this:

<script>
  DATA.tracking.user = { 
      age: "19", 
      name: "John doe" 
  }
</script>

Using the following regex: DATA.tracking.user=(.*?)}

<?php
$re = '/DATA.tracking.user = (.*?)\}/m';
$str = '<script>
           DATA.tracking.user = { age: "19", name: "John doe" }
        </script>';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

If i parse DATA.tracking.user = { age: "19", name: "John doe" } without any linebreaks, Then it works fine but if i try to parse:

DATA.tracking.user = { 
      age: "19", 
      name: "John doe" 
  }

It does not like dealing with the line breaks.

Any help would be greatly appreciated.

Thanks.

  • 写回答

4条回答 默认 最新

  • douliang1369 2018-05-07 23:48
    关注

    The simple solution to your problem is to use the s pattern modifier to command the . (any character) to also match newline characters -- which it does not by default.

    And you should:

    • escape your literal dots.
    • write the \{ outside of your capture group.
    • omit the m pattern modifier because you aren't using anchors.

    ...BUT...

    If this was my task and I was going to be processing the data from the extracted string, I would probably start breaking up the components at extraction-time with the power of \G.

    Code: (Demo) (Pattern Demo)

    $htmls[] = <<<HTML
    DATA.tracking.user = { age: "19", name: "John doe", int: 55 } // This works
    HTML;
    
    $htmls[] = <<<HTML
    DATA.tracking.user = { 
        age: "20", 
        name: "Jane Doe",
        int: 49
    } // This does not works
    HTML;
    
    foreach ($htmls as $html) {
        var_export(preg_match_all('~(?:\G(?!^),|DATA\.tracking\.user = \{)\s+([^:]+): (\d+|"[^"]*")~', $html, $out, PREG_SET_ORDER) ? $out : []);
        echo "
     --- 
    ";
    }
    

    Output:

    array (
      0 => 
      array (
        0 => 'DATA.tracking.user = { age: "19"',
        1 => 'age',
        2 => '"19"',
      ),
      1 => 
      array (
        0 => ', name: "John doe"',
        1 => 'name',
        2 => '"John doe"',
      ),
      2 => 
      array (
        0 => ', int: 55',
        1 => 'int',
        2 => '55',
      ),
    )
     --- 
    array (
      0 => 
      array (
        0 => 'DATA.tracking.user = { 
        age: "20"',
        1 => 'age',
        2 => '"20"',
      ),
      1 => 
      array (
        0 => ', 
        name: "Jane Doe"',
        1 => 'name',
        2 => '"Jane Doe"',
      ),
      2 => 
      array (
        0 => ',
        int: 49',
        1 => 'int',
        2 => '49',
      ),
    )
     --- 
    

    Now you can simply iterate the matches and work with [1] (the keys) and [2] (the values). This is a basic solution, that can be further tailored to suit your project data. Admittedly, this doesn't account for values that contain an escaped double-quote. Adding this feature would be no trouble. Accounting for more complex value types may be more of a challenge.

    展开全部

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(3条)
编辑
预览

报告相同问题?

悬赏问题

  • ¥15 求遗传算法GAMS代码
  • ¥15 雄安新区高光谱数据集的下载网址打不开
  • ¥66 android运行时native和graphics内存详细信息获取
  • ¥100 求一个c#通过CH341读取数据的Demo,能够读取指定地址值的功能
  • ¥15 rk3566 Android11 USB摄像头 微信
  • ¥15 torch框架下的强化学习DQN训练奖励值浮动过低,希望指导如何调整
  • ¥35 西门子博图v16安装密钥提示CryptAcquireContext MS_DEF_PROV Error of containger opening
  • ¥15 mes系统扫码追溯功能
  • ¥40 selenium访问信用中国
  • ¥20 在搭建fabric网络过程中遇到“无法使用新的生命周期”的报错
手机看
程序员都在用的中文IT技术交流社区

程序员都在用的中文IT技术交流社区

专业的中文 IT 技术社区,与千万技术人共成长

专业的中文 IT 技术社区,与千万技术人共成长

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

客服 返回
顶部