douweng1935 2013-07-18 20:29
浏览 50
已采纳

PHP - 如何在Firefox中获取像Reader Mode这样的主要HTML内容

in android Firefox app and safari iPad we can read only main content by "Reader Mode". read more... How to recognize only main content in HTML with PHP?

I need to detect main news like Firefox or safari by php

for example I get news from bbcsite.com/news/123 by this code:

<?php
    $html = file_get_contents('http://bbcsite.com/news/123');
?>

then show only main news without ads and ... like Firefox and safari.

I find fivefilters.org . this site can get content!!!

thank you

  • 写回答

5条回答 默认 最新

  • douyun8674 2013-07-18 22:48
    关注

    Hooray!!!

    I found this source code:

    1) create Readability.php

    2) create JSLikeHTMLElement.php

    3) create index.php by this code:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html>
        <head>
            <title>!</title>
            <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        </head>
    <body dir="rtl">
    <?php
    include_once 'Readability.php';
    
    
    // get latest Medialens alert 
    // (change this URL to whatever you'd like to test)
    $url = 'http://';
    $html = file_get_contents($url);
    
    // Note: PHP Readability expects UTF-8 encoded content.
    // If your content is not UTF-8 encoded, convert it 
    // first before passing it to PHP Readability. 
    // Both iconv() and mb_convert_encoding() can do this.
    
    // If we've got Tidy, let's clean up input.
    // This step is highly recommended - PHP's default HTML parser
    // often doesn't do a great job and results in strange output.
    if (function_exists('tidy_parse_string')) {
        $tidy = tidy_parse_string($html, array(), 'UTF8');
        $tidy->cleanRepair();
        $html = $tidy->value;
    }
    
    // give it to Readability
    $readability = new Readability($html, $url);
    // print debug output? 
    // useful to compare against Arc90's original JS version - 
    // simply click the bookmarklet with FireBug's console window open
    $readability->debug = false;
    // convert links to footnotes?
    $readability->convertLinksToFootnotes = true;
    // process it
    $result = $readability->init();
    // does it look like we found what we wanted?
    if ($result) {
        echo "== Title =====================================
    ";
        echo $readability->getTitle()->textContent, "
    
    ";
        echo "== Body ======================================
    ";
        $content = $readability->getContent()->innerHTML;
        // if we've got Tidy, let's clean it up for output
        if (function_exists('tidy_parse_string')) {
            $tidy = tidy_parse_string($content, array('indent'=>true, 'show-body-only' => true), 'UTF8');
            $tidy->cleanRepair();
            $content = $tidy->value;
        }
        echo $content;
    } else {
        echo 'Looks like we couldn\'t find the content. :(';
    }
    ?>
    </body>
    </html>
    

    in $url = 'http://'; set your site url.

    Thank you;)

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(4条)

报告相同问题?

悬赏问题

  • ¥20 求快手直播间榜单匿名采集ID用户名简单能学会的
  • ¥15 DS18B20内部ADC模数转换器
  • ¥15 做个有关计算的小程序
  • ¥15 MPI读取tif文件无法正常给各进程分配路径
  • ¥15 如何用MATLAB实现以下三个公式(有相互嵌套)
  • ¥30 关于#算法#的问题:运用EViews第九版本进行一系列计量经济学的时间数列数据回归分析预测问题 求各位帮我解答一下
  • ¥15 setInterval 页面闪烁,怎么解决
  • ¥15 如何让企业微信机器人实现消息汇总整合
  • ¥50 关于#ui#的问题:做yolov8的ui界面出现的问题
  • ¥15 如何用Python爬取各高校教师公开的教育和工作经历