dpw63348 2013-04-28 06:19
浏览 66
已采纳

PHP ganon如何阅读javascript

I am scraping some html pages using php ganon dom parser but i am stuck where i need to read some javascript from the source my javascript is like.

<script type="text/javascript">
    Event.observe(window, 'load', function() {
        ig_lightbox_main_img=0;
ig_lightbox_img_sequence.push('http://someimageurl.com/image.jpg');
ig_lightbox_img_labels.push("Some text");
ig_lightbox_img_sequence.push('http://someimageurl.com/image2.jpg');
ig_lightbox_img_labels.push("Some text 2");
    });
</script>

I want to read the url form the above script which is coming with html of page i have used this code for now

$html = str_get_dom('some page html here');
     foreach($html('.product-img-box script[type=text/javascript]') as $script){
     echo $script->html();
}

But this is not working. Any idea on how to read script

  • 写回答

1条回答 默认 最新

  • doufan9377 2013-04-28 07:30
    关注

    Try with quotes around type=text/javascript in the string to the $html object.

    I had a look here and they have an example:

    foreach($html('a[href ^= "http://"]') as $element) {
        $element->wrap('center');
    }
    

    I think it was the / that may have made it return the wrong result.

    EDIT

    Was confused by the question before, I thought the issue was that you couldn't get the data inside the script and it was because of your selector. Anyway, after a bit of thinking, if you have a string copy of the script tag with data inside, just run a regular expression over it.

    Here is an example that I tested:

    $string = "<script type=\"text/javascript\">
        Event.observe(window, 'load', function() {
            ig_lightbox_main_img=0;
    ig_lightbox_img_sequence.push('http://someimageurl.com/image.jpg');
    ig_lightbox_img_labels.push(\"Some text\");
    ig_lightbox_img_sequence.push('http://someimageurl.com/image2.jpg');
    ig_lightbox_img_labels.push(\"Some text 2\");
        });
    </script>";
    
    $regex = "/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Za-z0-9+&@#\/%=~_|$?!:,.]*[A-Za-z0-9+&@#\/%=~_|$]/";
    
    $results = array();
    
    preg_match_all($regex,$string,$results);
    
    var_dump($results);
    //Result: array(1) { [0]=> array(2) { [0]=> string(33) "http://someimageurl.com/image.jpg" [1]=> string(34) "http://someimageurl.com/image2.jpg" } } 
    

    $results has the URL data inside of it as returned from preg_match_all (Documentation).

    If it helps, once you have the URL, you can use parse_url (Documentation) in PHP which will split the string URL into something easier to work with.

    Note: The regular expression used is quite a simple expression and won't cover every case. As stated here and here, it is very difficult to get a perfect regular expression for this.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 关于#MATLAB#的问题,如何解决?(相关搜索:信噪比,系统容量)
  • ¥500 52810做蓝牙接受端
  • ¥15 基于PLC的三轴机械手程序
  • ¥15 多址通信方式的抗噪声性能和系统容量对比
  • ¥15 winform的chart曲线生成时有凸起
  • ¥15 msix packaging tool打包问题
  • ¥15 finalshell节点的搭建代码和那个端口代码教程
  • ¥15 Centos / PETSc / PETGEM
  • ¥15 centos7.9 IPv6端口telnet和端口监控问题
  • ¥20 完全没有学习过GAN,看了CSDN的一篇文章,里面有代码但是完全不知道如何操作