duansha8115 2018-07-13 07:06
浏览 111
已采纳

CURLOPT_RETURNTRANSFER以字符串形式返回HTML

I'm trying to parse HTML using CURL DOMDocument or Xpath, but the CURLOPT_RETURNTRANSFER always returns the url's HTML in string which makes it invalid HTML to be parsed

Returned output:

string(102736) "<!DOCTYPE html>


    <html itemscope itemtype="http://schema.org/QAPage" class="html__responsive">

    <head>

        <title>html - PHP outputting text WITHOUT echo/print? - Stack Overflow</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
        <link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
        <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
        <meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">"

PHP snipe see the output

$cc = $http->get($url);
var_dump($cc);

CURL library used: https://github.com/seikan/HTTP/blob/master/class.HTTP.php

When I remove CURLOPT_RETURNTRANSFER I see the HTML without the string(102736), but it echo the url even if i didn't request (reference: curl_exec printing results when I don't want to)

Here is the PHP snipe I used to parse html:

  $cc = $http->get($url);
  $doc = new \DOMDocument();
  $doc->loadHTML($cc);

  // all links in document
  $links = [];
  $arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
  foreach($arr as $item) { // DOMElement Object
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[
]+/", " ", $item->nodeValue));
    $links[] = [
      'href' => $href,
      'text' => $text
    ];
  }

Any idea?

  • 写回答

1条回答 默认 最新

  • duananyantan04633 2018-07-13 07:13
    关注

    Check the return value -

    print_r($cc);
    

    you will probably find that the output is an array (if the code ran successfully). From the library source, the return of get() is...

    return [
        'header' => $headers,
        'body'   => substr($response, $size),
    ];
    

    So you will need to change the load line to be...

    $doc->loadHTML($cc['body']);
    

    Update:

    as an example of the above and using this question as the page to work with...

    $cc = $http->get("https://stackoverflow.com/questions/51319473/curlopt-returntransfer-returns-html-in-string/51319585?noredirect=1#comment89619183_51319585");
    $doc = new \DOMDocument();
    libxml_use_internal_errors(true);
    $doc->loadHTML($cc['body']);
    
    // all links in document
    $links = [];
    $arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
    foreach($arr as $item) { // DOMElement Object
        $href =  $item->getAttribute("href");
        $text = trim(preg_replace("/[
    ]+/", " ", $item->nodeValue));
        $links[] = [
            'href' => $href,
            'text' => $text
        ];
    }
    
    print_r($links);
    

    Outputs...

    Array
    (
        [0] => Array
            (
                [href] => #
                [text] => 
            )
    
        [1] => Array
            (
                [href] => https://stackoverflow.com
                [text] => Stack Overflow
            )
    
        [2] => Array
            (
                [href] => #
                [text] => 
            )
    
        [3] => Array
            (
                [href] => https://stackexchange.com/users/?tab=inbox
    ...
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 YoloV5 第三方库的版本对照问题
  • ¥15 请完成下列相关问题!
  • ¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像,如何解决?
  • ¥15 求daily translation(DT)偏差订正方法的代码
  • ¥15 js调用html页面需要隐藏某个按钮
  • ¥15 ads仿真结果在圆图上是怎么读数的
  • ¥20 Cotex M3的调试和程序执行方式是什么样的?
  • ¥20 java项目连接sqlserver时报ssl相关错误
  • ¥15 一道python难题3
  • ¥15 牛顿斯科特系数表表示