douan5151 2016-03-23 20:22
浏览 55

PHP爬虫异常

Below is my code which outputs the content which is present under Plot tab on a wiki page, I am using getElementById and it is throwing some exception which I pasted below, can someone modify it to work. Thanks in Advance.

<?php
/**
 * Downloads a web page from $url, selects the the element by $id
 * and returns it's xml string representation.
 */
//Taking input
 if(isset($_POST['submit'])) /* i.e. the PHP code is executed only when someone presses Submit button in the below given HTML Form */
{
$var = $_POST['var'];   // Here $var is the input taken from user.
} 
function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();
    @$doc->loadHTMLFile($url);

    if(!$doc) {
        throw new Exception("Failed to load $url");
    }

    // Obtain the element
    $element = $doc->getElementById($id);

    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }

    if($pretty) {
        $doc->formatOutput = true;
    }

    // Return the string representation of the element
    return $doc->saveXML($element);
}

// call it:
echo getElementByIdAsString('https://en.wikipedia.org/wiki/I_Too_Had_a_Love_Story', 'Plot');
?>

Exception is:

Fatal error: Uncaught exception 'Exception' with message 'An element with id Plot was not found' in C:\xampp\htdocs\example2.php:23 Stack trace: #0 C:\xampp\htdocs\example2.php(35): getElementByIdAsString() #1 {main} thrown in C:\xampp\htdocs\example2.php on line 23
  • 写回答

1条回答 默认 最新

  • dsqe46004 2016-03-23 20:57
    关注

    I try your code and it works and return <span class="mw-headline" id="Plot">Plot</span>. I think your problem in using DOMDocument::loadHTMLFile with @:

    @$doc->loadHTMLFile($url);
    

    Because this method returns

    bool true on success or false on failure

    And sometimes it return false (for example 403 from wikipedia for many requests) and your dom element is empty. In this case your $element = $doc->getElementById($id); can't find this element.

    Try to change your code to:

    <?php
    /**
     * Downloads a web page from $url, selects the the element by $id
     * and returns it's xml string representation.
     */
    //Taking input
    if(isset($_POST['submit'])) /* i.e. the PHP code is executed only when someone presses Submit button in the below given HTML Form */
    {
        $var = $_POST['var'];   // Here $var is the input taken from user.
    }
    function getElementByIdAsString($url, $id, $pretty = true) {
        $doc = new DOMDocument();
        $loadResult = @$doc->loadHTMLFile($url);
    
        if(!$doc || !$loadResult) {
            throw new Exception("Failed to load $url");
        }
    
        // Obtain the element
        $element = $doc->getElementById($id);
    
        if(!$element) {
            throw new Exception("An element with id $id was not found");
        }
    
        if($pretty) {
            $doc->formatOutput = true;
        }
    
        // Return the string representation of the element
        return $doc->saveXML($element);
    }
    
    // call it:
    echo getElementByIdAsString('https://en.wikipedia.org/wiki/I_Too_Had_a_Love_Story', 'Plot');
    ?>
    

    Wkipedia can be unavailable for your script (some sites block parser scripts). Try to use curl to get status_code for your response

    $url = 'en.wikipedia.org/wiki/I_Too_Had_a_Love_Story';
    $ch = curl_init(); 
    curl_setopt($ch, CURLOPT_URL,$url); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
    $html = curl_exec($ch); 
    $status_code = curl_getinfo($ch,CURLINFO_HTTP_CODE);
    
    评论

报告相同问题?

悬赏问题

  • ¥15 如何用Labview在myRIO上做LCD显示?(语言-开发语言)
  • ¥15 Vue3地图和异步函数使用
  • ¥15 C++ yoloV5改写遇到的问题
  • ¥20 win11修改中文用户名路径
  • ¥15 win2012磁盘空间不足,c盘正常,d盘无法写入
  • ¥15 用土力学知识进行土坡稳定性分析与挡土墙设计
  • ¥70 PlayWright在Java上连接CDP关联本地Chrome启动失败,貌似是Windows端口转发问题
  • ¥15 帮我写一个c++工程
  • ¥30 Eclipse官网打不开,官网首页进不去,显示无法访问此页面,求解决方法
  • ¥15 关于smbclient 库的使用