duansha8115 2018-07-12 23:06
浏览 112
已采纳

CURLOPT_RETURNTRANSFER以字符串形式返回HTML

I'm trying to parse HTML using CURL DOMDocument or Xpath, but the CURLOPT_RETURNTRANSFER always returns the url's HTML in string which makes it invalid HTML to be parsed

Returned output:

string(102736) "<!DOCTYPE html>


    <html itemscope itemtype="http://schema.org/QAPage" class="html__responsive">

    <head>

        <title>html - PHP outputting text WITHOUT echo/print? - Stack Overflow</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
        <link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
        <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
        <meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">"

PHP snipe see the output

$cc = $http->get($url);
var_dump($cc);

CURL library used: https://github.com/seikan/HTTP/blob/master/class.HTTP.php

When I remove CURLOPT_RETURNTRANSFER I see the HTML without the string(102736), but it echo the url even if i didn't request (reference: curl_exec printing results when I don't want to)

Here is the PHP snipe I used to parse html:

  $cc = $http->get($url);
  $doc = new \DOMDocument();
  $doc->loadHTML($cc);

  // all links in document
  $links = [];
  $arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
  foreach($arr as $item) { // DOMElement Object
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[
]+/", " ", $item->nodeValue));
    $links[] = [
      'href' => $href,
      'text' => $text
    ];
  }

Any idea?

展开全部

  • 写回答

1条回答 默认 最新

  • duananyantan04633 2018-07-12 23:13
    关注

    Check the return value -

    print_r($cc);
    

    you will probably find that the output is an array (if the code ran successfully). From the library source, the return of get() is...

    return [
        'header' => $headers,
        'body'   => substr($response, $size),
    ];
    

    So you will need to change the load line to be...

    $doc->loadHTML($cc['body']);
    

    Update:

    as an example of the above and using this question as the page to work with...

    $cc = $http->get("https://stackoverflow.com/questions/51319473/curlopt-returntransfer-returns-html-in-string/51319585?noredirect=1#comment89619183_51319585");
    $doc = new \DOMDocument();
    libxml_use_internal_errors(true);
    $doc->loadHTML($cc['body']);
    
    // all links in document
    $links = [];
    $arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
    foreach($arr as $item) { // DOMElement Object
        $href =  $item->getAttribute("href");
        $text = trim(preg_replace("/[
    ]+/", " ", $item->nodeValue));
        $links[] = [
            'href' => $href,
            'text' => $text
        ];
    }
    
    print_r($links);
    

    Outputs...

    Array
    (
        [0] => Array
            (
                [href] => #
                [text] => 
            )
    
        [1] => Array
            (
                [href] => https://stackoverflow.com
                [text] => Stack Overflow
            )
    
        [2] => Array
            (
                [href] => #
                [text] => 
            )
    
        [3] => Array
            (
                [href] => https://stackexchange.com/users/?tab=inbox
    ...
    

    展开全部

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
编辑
预览

报告相同问题?

手机看
程序员都在用的中文IT技术交流社区

程序员都在用的中文IT技术交流社区

专业的中文 IT 技术社区,与千万技术人共成长

专业的中文 IT 技术社区,与千万技术人共成长

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

客服 返回
顶部