douci1918 2015-08-08 17:23
浏览 296

如何抓取并显示网址的详细信息?

i am currently having problem displaying the image and price of a url from flipkart, what i want is to display the product image and price of a url and continuously crawl all the url's following it and display the details for the same in one page, but i get the error saying redirection limit reached, failed to open stream.

here is my code:

<?php 
ini_set('max_execution_time', 4000);
$to_crawl = "http://www.flipkart.com/apple-iphone-6/p/itme5rf6ewg7trwz?pid=MOBEYGPZAHZQMCKZ&otracker=from-search&srno=t_4&query=apple&al=hplRX0gsd%2BUs3897GU7MA33GdyuXyA9x5heu%2FXnCd8gCFiEqsIXwVoaLq2lx4bRfFLwHQxVDMNU%3D&ref=cfd05202-e814-4bb4-bbe2-422b4ecc6df9";
$c = array();
function getPriceFromFlipkart($url) {

$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 10.10; labnol;) ctrlq.org");
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($curl);
curl_close($curl);

$regex = '/<meta itemprop="price" content="([^"]*)"/';
preg_match($regex, $html, $price);

$regex = '/<h1[^>]*>([^<]*)<\/h1>/';
preg_match($regex, $html, $title);

$regex = '/data-src="([^"]*)"/i';
preg_match($regex, $html, $image);
}
function get_links($url){
    global $c;
    $input = file_get_contents($url);
    $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    preg_match_all("/$regexp/siU", $input, $matches);
    $base_url = parse_url($url, PHP_URL_HOST);
    $l = $matches[2];
    foreach($l as $link) {
        if(strpos($link, "#")) {
            $link = substr($link,0, strpos($link, "#"));
        }
        if(substr($link,0,1) == ".") {
            $link = substr($link, 1);
        }
        if(substr($link,0,7)=="http://") {
            $link = $link;
        }
        else if(substr($link,0,8) =="https://") {
            $link = $link;
        }
        else if(substr($link,0,2) =="//") {
            $link = substr($link, 2);
        }
        else if(substr($link,0,2) =="#") {
            $link = $url;
        }
        else if(substr($link,0,2) =="mailto:") {
            $link = "[".$link."]";
        }
        else {
            if(substr($link,0,1) != "/") {
            $link = $base_url."/".$link;
        }
        else {
            $link = $base_url.$link;
        }
        }
        if(substr($link, 0, 7)=="http://" && substr($link, 0, 8)!="https://" && substr($link, 0, 1)=="[") {
            if(substr($url, 0, 8) == "https://") {
                $link = "https://".$link;
            }
            else {
                $link = "http://".$link;
            }
        }
        //echo $link."<br />";
        if(!in_array($link,$c)) {
            array_push($c,$link);
        }
    }
}
get_links($to_crawl);
foreach ($c as $page) {
    get_links($page);
}
foreach ($c as $page) {
    $response = getPriceFromFlipkart($page);

echo json_encode($response);
    echo $page."<br />";
}
?>
  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥15 R语言Rstudio突然无法启动
    • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
    • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
    • ¥15 用windows做服务的同志有吗
    • ¥60 求一个简单的网页(标签-安全|关键词-上传)
    • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法
    • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
    • ¥100 为什么这个恒流源电路不能恒流?
    • ¥15 有偿求跨组件数据流路径图
    • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值