使用正则表达式和php从html中提取javascript对象

I am trying to extract a specific JavaScript object from a page containing the usual HTML markup.

I have tried to use regex but i don't seem to be able to get it to parse the HTML correctly when the HTML contains a line break.

An example can be seen here: https://regex101.com/r/b8zN8u/2

The HTML i am trying to extract looks like this:

<script>
  DATA.tracking.user = { 
      age: "19", 
      name: "John doe" 
  }
</script>

Using the following regex: DATA.tracking.user=(.*?)}

<?php
$re = '/DATA.tracking.user = (.*?)\}/m';
$str = '<script>
           DATA.tracking.user = { age: "19", name: "John doe" }
        </script>';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

If i parse DATA.tracking.user = { age: "19", name: "John doe" } without any linebreaks, Then it works fine but if i try to parse:

DATA.tracking.user = { 
      age: "19", 
      name: "John doe" 
  }

It does not like dealing with the line breaks.

Any help would be greatly appreciated.

Thanks.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

4条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douliang1369 2018-05-07 23:48
关注
The simple solution to your problem is to use the s pattern modifier to command the . (any character) to also match newline characters -- which it does not by default.

And you should:

escape your literal dots.

write the \{ outside of your capture group.

omit the m pattern modifier because you aren't using anchors.

...BUT...

If this was my task and I was going to be processing the data from the extracted string, I would probably start breaking up the components at extraction-time with the power of \G.

Code: (Demo) (Pattern Demo)

$htmls[] = <<<HTML DATA.tracking.user = { age: "19", name: "John doe", int: 55 } // This works HTML; $htmls[] = <<<HTML DATA.tracking.user = { age: "20", name: "Jane Doe", int: 49 } // This does not works HTML; foreach ($htmls as $html) { var_export(preg_match_all('~(?:\G(?!^),|DATA\.tracking\.user = \{)\s+([^:]+): (\d+|"[^"]*")~', $html, $out, PREG_SET_ORDER) ? $out : []); echo " --- "; }

Output:

array ( 0 => array ( 0 => 'DATA.tracking.user = { age: "19"', 1 => 'age', 2 => '"19"', ), 1 => array ( 0 => ', name: "John doe"', 1 => 'name', 2 => '"John doe"', ), 2 => array ( 0 => ', int: 55', 1 => 'int', 2 => '55', ), ) --- array ( 0 => array ( 0 => 'DATA.tracking.user = { age: "20"', 1 => 'age', 2 => '"20"', ), 1 => array ( 0 => ', name: "Jane Doe"', 1 => 'name', 2 => '"Jane Doe"', ), 2 => array ( 0 => ', int: 49', 1 => 'int', 2 => '49', ), ) ---

Now you can simply iterate the matches and work with [1] (the keys) and [2] (the values). This is a basic solution, that can be further tailored to suit your project data. Admittedly, this doesn't account for values that contain an escaped double-quote. Adding this feature would be no trouble. Accounting for more complex value types may be more of a challenge.
展开全部

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报
编辑

预览
轻敲空格完成输入
显示为

卡片

标题

链接
评论

按下Enter换行，Ctrl+Enter发表内容