douzongmu2543 2016-04-29 12:55
浏览 145

抓取网页会返回加密字符

I have tried quite a few methods of downloading the page below$url = 'https://kat.cr/usearch/life%20of%20pi/'; using PHP. However, I always receive a page with encrypted characters.

I've tried searching for possible solutions prior to posting, and have tried out a few, however, I haven't been able to get any to work yet.

Please see the methods I have tried below and suggest a solution. I am looking for a PHP solution for the same.

Approach 1 - using file_get_contents - returns encrypted characters

<?php
//$contents = file_get_contents($url, $use_include_path, $context, $offset);

include('simple_html_dom.php');

$url = 'https://kat.cr/usearch/life%20of%20pi/';
$html = str_get_html(utf8_encode(file_get_contents($url)));

echo $html;


?>

Approach 2 - using file_get_html - returns encrypted characters

include('simple_html_dom.php');

$url = 'https://kat.cr/usearch/life%20of%20pi/';

$encoded = htmlentities(utf8_encode(file_get_html($url)));
echo $encoded;

?>

Approach 3 - using gzread - returns blank page

<?php

include('simple_html_dom.php');

$url = 'https://kat.cr/usearch/life%20of%20pi/';

$fp = gzopen($url,'r');

$contents = '';

while($html = gzread($fp , 256000))
{
    $contents .= $html;
}

gzclose($fp);

?>

Approach 4 - using gzinflate - returns empty page

<?php

include('simple_html_dom.php');
//function gzdecode($data)
//{
//    return gzinflate(substr($data,10,-8));
//}

//$contents = file_get_contents($url, $use_include_path, $context, $offset);



$url = 'https://kat.cr/usearch/life%20of%20pi/';
$html = str_get_html(utf8_encode(file_get_contents($url)));

echo gzinflate(substr($html,10,-8));


?>

Approach 5 - using fopen and fgets - returns encrypted characters

<?php
$url='https://kat.cr/usearch/life%20of%20pi/';
$handle = fopen($url, "r");

if ($handle)
{
    while (($line = fgets($handle)) !== false)
    {
        echo $line;
    }
}
else
{
    // error opening the file.
    echo "could not open the wikipedia URL!";
}
fclose($handle);
?>

Approach 6 - adding ob_start at the beginning of script - page does not load

<?php
ob_start("ob_gzhandler");

$url = 'https://kat.cr/usearch/life%20of%20pi/';
$handle = fopen($url, "r");

if ($handle)
{
    while (($line = fgets($handle)) !== false)
    {
        echo $line;
    }
}
else
{
    // error opening the file.
    echo "could not open the wikipedia URL!";
}
fclose($handle);
?>

Approach 7 - using curl - returns empty page

<?php    
$url = 'https://kat.cr/usearch/life%20of%20pi/';

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url); // Define target site
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string
curl_setopt($cr, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.3 Safari/533.2');
curl_setopt($ch, CURLOPT_ENCODING , "gzip");
curl_setopt($ch, CURLOPT_TIMEOUT,5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects

$return = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);

$html = str_get_html("$return");
echo $html;

?>

Approach 8 - using R - returns encrypted characters

> thepage = readLines('https://kat.cr/usearch/life%20of%20pi/')
There were 29 warnings (use warnings() to see them)
> thepage[1:5]
[1] "\037‹\b"                                                                                                                                                                                                                                                                                                         
[2] "+SC®\037\035ÕpšÐ\032«F°{¼…àßá$\030±ª\022ù˜ú×Gµ."                                                                                                                                                                                                                                                                
[3] "\023\022&ÒÅdDjÈÉÎŽj\t¹Iꬩ\003ä\fp\024“ä(M<©U«ß×Ðy2\tÈÂæœ8ž­\036â!9ª]ûd<¢QR*>öÝdpä’kß!\022?ÙG~è'>\016¤ØÁ\0019Re¥†\0264æ’؉üQâÓ°Ô^—\016\t¡‹\\:\016\003Š]4¤aLiˆ†8ìS\022Ão€'ðÿ\020a;¦Aš`‚<\032!/\"DF=\034'EåX^ÔˆÚ4‰KDCê‡.¹©¡ˆ\004Gµ4&8r\006EÍÄO\002r|šóóZðóú\026?\0274Š ½\030!\týâ;W8Ž‹k‡õ¬™¬ÉÀ\017¯2b1ÓA< \004„š€&J"
[4] "@ƒˆxGµz\035\032Jpâ;²C‡u\034\004’Ñôp«e^*Wz-Óz!ê\022\001èÌI\023ä;LÖ\v›õ‡¸O⺇¯Y!\031þ\024-mÍ·‡G#°›„¦Î@º¿ÉùÒò(ìó¶³f\177¤?}\017½<Cæ_eÎ\0276\t\035®ûÄœ\025À}rÌ\005òß$t}ï/IºM»µ*íÖšh\006\t#kåd³¡€âȹE÷CÌG·!\017ý°èø‡x†ä\a|³&jLJõìè>\016ú\t™aᾞ[\017—z¹«K¸çeØ¿=/"                                                    
[5] "\035æ\034vÎ÷Gûx?Ú'ûÝý`ßßwö¯v‹bÿFç\177F\177\035±?ÿýß\177þupþ'ƒ\035ösT´°ûï¢<+(Òx°Ó‰\"<‘G\021M(ãEŽ\003pa2¸¬`\aGýtÈFíî.úÏîAQÙ?\032ÉNDpBÎ\002Â"  

Approach 9 - using BeautifulSoup (python) - returns encrypted characters

import urllib

htmltext = urllib.urlopen("https://kat.cr/usearch/life%20of%20pi/").read()
print htmltext

Approach 10 - using wget on the linux terminal - gets a page with encrypted characters

wget -O page https://kat.cr/usearch/Monsoon%20Mangoes%20malayalam/

Approach 11 -

tried manually by pasting the url to the below service - works

https://www.hurl.it/

Approach 12 -

    tried manually by pasting the url to the below service - works

https://www.import.io/

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥15 关于#python#的问题:求帮写python代码
    • ¥20 MATLAB画图图形出现上下震荡的线条
    • ¥15 LiBeAs的带隙等于0.997eV,计算阴离子的N和P
    • ¥15 关于#windows#的问题:怎么用WIN 11系统的电脑 克隆WIN NT3.51-4.0系统的硬盘
    • ¥15 来真人,不要ai!matlab有关常微分方程的问题求解决,
    • ¥15 perl MISA分析p3_in脚本出错
    • ¥15 k8s部署jupyterlab,jupyterlab保存不了文件
    • ¥15 ubuntu虚拟机打包apk错误
    • ¥199 rust编程架构设计的方案 有偿
    • ¥15 回答4f系统的像差计算