I have tried quite a few methods of downloading the page below$url = 'https://kat.cr/usearch/life%20of%20pi/';
using PHP. However, I always receive a page with encrypted characters.
I've tried searching for possible solutions prior to posting, and have tried out a few, however, I haven't been able to get any to work yet.
Please see the methods I have tried below and suggest a solution. I am looking for a PHP solution for the same.
Approach 1 - using file_get_contents - returns encrypted characters
<?php
//$contents = file_get_contents($url, $use_include_path, $context, $offset);
include('simple_html_dom.php');
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$html = str_get_html(utf8_encode(file_get_contents($url)));
echo $html;
?>
Approach 2 - using file_get_html - returns encrypted characters
include('simple_html_dom.php');
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$encoded = htmlentities(utf8_encode(file_get_html($url)));
echo $encoded;
?>
Approach 3 - using gzread - returns blank page
<?php
include('simple_html_dom.php');
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$fp = gzopen($url,'r');
$contents = '';
while($html = gzread($fp , 256000))
{
$contents .= $html;
}
gzclose($fp);
?>
Approach 4 - using gzinflate - returns empty page
<?php
include('simple_html_dom.php');
//function gzdecode($data)
//{
// return gzinflate(substr($data,10,-8));
//}
//$contents = file_get_contents($url, $use_include_path, $context, $offset);
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$html = str_get_html(utf8_encode(file_get_contents($url)));
echo gzinflate(substr($html,10,-8));
?>
Approach 5 - using fopen and fgets - returns encrypted characters
<?php
$url='https://kat.cr/usearch/life%20of%20pi/';
$handle = fopen($url, "r");
if ($handle)
{
while (($line = fgets($handle)) !== false)
{
echo $line;
}
}
else
{
// error opening the file.
echo "could not open the wikipedia URL!";
}
fclose($handle);
?>
Approach 6 - adding ob_start at the beginning of script - page does not load
<?php
ob_start("ob_gzhandler");
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$handle = fopen($url, "r");
if ($handle)
{
while (($line = fgets($handle)) !== false)
{
echo $line;
}
}
else
{
// error opening the file.
echo "could not open the wikipedia URL!";
}
fclose($handle);
?>
Approach 7 - using curl - returns empty page
<?php
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url); // Define target site
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string
curl_setopt($cr, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.3 Safari/533.2');
curl_setopt($ch, CURLOPT_ENCODING , "gzip");
curl_setopt($ch, CURLOPT_TIMEOUT,5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects
$return = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
$html = str_get_html("$return");
echo $html;
?>
Approach 8 - using R - returns encrypted characters
> thepage = readLines('https://kat.cr/usearch/life%20of%20pi/')
There were 29 warnings (use warnings() to see them)
> thepage[1:5]
[1] "\037‹\b"
[2] "+SC®\037\035ÕpšÐ\032«F°{¼…àßá$\030±ª\022ù˜ú×Gµ."
[3] "\023\022&ÒÅdDjÈÉÎŽj\t¹Iꬩ\003ä\fp\024“ä(M<©U«ß×Ðy2\tÈÂæœ8ž\036â!9ª]ûd<¢QR*>öÝdpä’kß!\022?ÙG~è'>\016¤ØÁ\0019Re¥†\0264æ’؉üQâÓ°Ô^—\016\t¡‹\\:\016\003Š]4¤aLiˆ†8ìS\022Ão€'ðÿ\020a;¦Aš`‚<\032!/\"DF=\034'EåX^ÔˆÚ4‰KDCê‡.¹©¡ˆ\004Gµ4&8r\006EÍÄO\002r|šóóZðóú\026?\0274Š ½\030!\týâ;W8Ž‹k‡õ¬™¬ÉÀ\017¯2b1ÓA< \004„š€&J"
[4] "@ƒˆxGµz\035\032Jpâ;²C‡u\034\004’Ñôp«e^*Wz-Óz!ê\022\001èÌI\023ä;LÖ\v›õ‡¸O⺇¯Y!\031þ\024-mÍ·‡G#°›„¦Î@º¿ÉùÒò(ìó¶³f\177¤?}\017½<Cæ_eÎ\0276\t\035®ûÄœ\025À}rÌ\005òß$t}ï/IºM»µ*íÖšh\006\t#kåd³¡€âȹE÷CÌG·!\017ý°èø‡x†ä\a|³&jLJõìè>\016ú\t™aᾞ[\017—z¹«K¸çeØ¿=/"
[5] "\035æ\034vÎ÷Gûx?Ú'ûÝý`ßßwö¯v‹bÿFç\177F\177\035±?ÿýß\177þupþ'ƒ\035ösT´°ûï¢<+(Òx°Ó‰\"<‘G\021M(ãEŽ\003pa2¸¬`\aGýtÈFíî.úÏîAQÙ?\032ÉNDpBÎ\002Â"
Approach 9 - using BeautifulSoup (python) - returns encrypted characters
import urllib
htmltext = urllib.urlopen("https://kat.cr/usearch/life%20of%20pi/").read()
print htmltext
Approach 10 - using wget on the linux terminal - gets a page with encrypted characters
wget -O page https://kat.cr/usearch/Monsoon%20Mangoes%20malayalam/
Approach 11 -
tried manually by pasting the url to the below service - works
Approach 12 -
tried manually by pasting the url to the below service - works