douwei1174 2017-12-19 19:43
浏览 100

如何使用PHP ad xpath在HTML页面中获取字符串(POST请求?)

I'm trying to scrape this web page ...

https://www.aslteramo.it/SISWebOnLine/ProntoSoccorso.aspx

s

.... using PHP and XPath to get the number values under the red, yellow, green and white colored circles.

(NOTE: you could see different value in that page if you try to browse it ... it doesn't matter ..,, it change dinamically .... )

I'm trying to use this PHP code sample to print the value ...

<?php
    ini_set('display_errors', 'On');
    error_reporting(E_ALL);

    $url = 'http://www.aslteramo.it/SISWebOnLine/ProntoSoccorso.aspx';

    $xpath_for_parsing = '/html/body/div/form/div[3]/div[2]/div[3]/div/div/div[2]/table/tbody/tr[2]/td[4]/table/tbody/tr[1]/td';


    //#Set CURL parameters: pay attention to the PROXY config !!!!
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_PROXY, '');
    $data = curl_exec($ch);
    curl_close($ch);

    $dom = new DOMDocument();
    @$dom->loadHTML($data);

    $xpath = new DOMXPath($dom);

    $colorWaitingNumber = $xpath->query($xpath_for_parsing);
    $theValue =  'N.D.';
    foreach( $colorWaitingNumber as $node )
    {
      $theValue = $node->nodeValue;
    }

    print $theValue;
?>

Note that, to get the elements XPath, you've to disable javascript in your browser because the mouse right click is disabled.

I've seen that in the page there is a POST request ...

enter image description here

.... but I don't know how to modify my code to do the request and then how to extract my values ...

Any help will be appreciated.

Thank you in advance

  • 写回答

1条回答 默认 最新

  • dtml3340 2017-12-20 08:19
    关注

    I've seen that in the page there is a POST request ...

    You can't get the data is that POST request is fetching it on page load. You need to do the same POST reqeust:

    $curl = curl_init();
    
    curl_setopt_array($curl, array(
      CURLOPT_URL => "https://www.aslteramo.it/SISWebOnLine/ProntoSoccorso.aspx",
      CURLOPT_RETURNTRANSFER => true,
      CURLOPT_ENCODING => "",
      CURLOPT_MAXREDIRS => 10,
      CURLOPT_TIMEOUT => 30,
      CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
      CURLOPT_CUSTOMREQUEST => "POST",
      // this is to emulate the page behavior
      CURLOPT_POSTFIELDS => "ctl00%24ScriptManager1=ctl00%24MainContent%24UpdatePanel1%7Cctl00%24MainContent%24Timer1&__EVENTTARGET=ctl00%24MainContent%24Timer1&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKLTYxOTg2MDY2NA9kFgJmD2QWAgIDD2QWBgIDDzwrAA0CAA8WAh4LXyFEYXRhQm91bmRnZAwUKwAGBRMwOjAsMDoxLDA6MiwwOjMsMDo0FCsAAhYQHgRUZXh0BQ1Ib21lIHBhZ2UgQVNMHgVWYWx1ZQUNSG9tZSBwYWdlIEFTTB4LTmF2aWdhdGVVcmwFF2h0dHA6Ly93d3cuYXNsdGVyYW1vLml0HgdUb29sVGlwBRxQYWdpbmEgaW5pemlhbGUgZGVsIHNpdG8gQVNMHgdFbmFibGVkZx4KU2VsZWN0YWJsZWceCERhdGFQYXRoBRdodHRwOi8vd3d3LmFzbHRlcmFtby5pdB4JRGF0YUJvdW5kZ2QUKwACFhIfBWcfBmcfCGcfBwUhL3Npc3dlYm9ubGluZS9wcm9udG9zb2Njb3Jzby5hc3B4HwEFD1Byb250byBTb2Njb3Jzbx8CBQ9Qcm9udG8gU29jY29yc28fBAUeVGVtcGkgZCdhdHRlc2EgUHJvbnRvIFNvY2NvcnNvHghTZWxlY3RlZGcfAwUhL1NJU1dlYk9uTGluZS9Qcm9udG9Tb2Njb3Jzby5hc3B4ZBQrAAIWEB8BBQ5UZW1waSBkJ2F0dGVzYR8CBQ5UZW1waSBkJ2F0dGVzYR8DBSAvU0lTV2ViT25MaW5lL1RlbXBpRGlhdHRlc2EuYXNweB8EBShUZW1waSBkJ2F0dGVzYSBwcmVzdGF6aW9uaSBhbWJ1bGF0b3JpYWxpHwVnHwZnHwcFIC9zaXN3ZWJvbmxpbmUvdGVtcGlkaWF0dGVzYS5hc3B4HwhnZBQrAAIWEB8BBRZMaXN0YSBkJ0F0dGVzYSBFeC1Qb3N0HwIFFkxpc3RhIGQnQXR0ZXNhIEV4LVBvc3QfAwUpamF2YXNjcmlwdDpvcGVuV2ViRm9ybSgnV2ViRXhQb3N0LmFzcHgnKTsfBAUnTW9uaXRvcmFnZ2lvIExpc3RhIGQnQXR0ZXNhIC0gKEV4LVBvc3QpHwVnHwZnHwcFKWphdmFzY3JpcHQ6b3BlbndlYmZvcm0oJ3dlYmV4cG9zdC5hc3B4Jyk7HwhnZBQrAAIWEB8BBR5BdHRpdml0w6AgbGliZXJvLXByb2Zlc3Npb25hbGUfAgUeQXR0aXZpdMOgIGxpYmVyby1wcm9mZXNzaW9uYWxlHwMFHy9TSVNXZWJPbkxpbmUvQXR0aXZpdGFBbHBpLmFzcHgfBAUeQXR0aXZpdMOgIGxpYmVyby1wcm9mZXNzaW9uYWxlHwVnHwZnHwcFHy9zaXN3ZWJvbmxpbmUvYXR0aXZpdGFhbHBpLmFzcHgfCGdkZAIJDw8WAh8BBQ9Qcm9udG8gU29jY29yc29kZAILD2QWAgIBD2QWAmYPZBYGAgEPFgIfBWdkAgsPPCsADQBkAg0PFgIfBWdkGAMFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBSBjdGwwMCRNYWluQ29udGVudCRJbWdCdG5BZ2dpb3JuYQUVY3RsMDAkTWFpbkNvbnRlbnQkd3d3D2dkBRBjdGwwMCRuYXZpZ2F0aW9uDw9kBQ9Qcm9udG8gU29jY29yc29kTUucCs6%2BZyLbulTAFPNo569%2B%2BDE%3D&__VIEWSTATEGENERATOR=1A2B14D6&__EVENTVALIDATION=%2FwEWAgK27duvDwKDm%2B%2FCCycw%2FWHLOR5AmzLF035J86RYL0wa&__ASYNCPOST=true",
      CURLOPT_HTTPHEADER => array(
        "cache-control: no-cache",
        "content-type: application/x-www-form-urlencoded"
      ),
    ));
    
    $response = curl_exec($curl);
    

    And then your XPATH:

    $dom = new DOMDocument();
    @$dom->loadHTML($data);
    
    $xpath = new DOMXPath($dom);
    

    Hope that helps.

    评论

报告相同问题?

悬赏问题

  • ¥15 求差集那个函数有问题,有无佬可以解决
  • ¥15 【提问】基于Invest的水源涵养
  • ¥20 微信网友居然可以通过vx号找到我绑的手机号
  • ¥15 寻一个支付宝扫码远程授权登录的软件助手app
  • ¥15 解riccati方程组
  • ¥15 display:none;样式在嵌套结构中的已设置了display样式的元素上不起作用?
  • ¥15 使用rabbitMQ 消息队列作为url源进行多线程爬取时,总有几个url没有处理的问题。
  • ¥15 Ubuntu在安装序列比对软件STAR时出现报错如何解决
  • ¥50 树莓派安卓APK系统签名
  • ¥65 汇编语言除法溢出问题