dongxin8709 2016-09-08 23:38
浏览 102
已采纳

Laravel通过guzzle请求javascript抓RSS

I am trying to grab RSS using below code.

<?php

$client  = new \GuzzleHttp\Client(['User-Agent' => 'idap']);
$content = $client->request('GET', 'alarabiya.net/.mrss/ar.xml');

dd($content->getBody()->getContents());

and it returns the following:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<meta http-equiv="Content-Script-Type" content="text/javascript">

<script type="text/javascript">

function getCookie(c_name) { // Local function for getting a cookie value

    if (document.cookie.length > 0) {

        c_start = document.cookie.indexOf(c_name + "=");

        if (c_start!=-1) {

        c_start=c_start + c_name.length + 1;

        c_end=document.cookie.indexOf(";", c_start);



        if (c_end==-1) 

            c_end = document.cookie.length;



        return unescape(document.cookie.substring(c_start,c_end));

        }

    }

    return "";

}

function setCookie(c_name, value, expiredays) { // Local function for setting a value of a cookie

    var exdate = new Date();

    exdate.setDate(exdate.getDate()+expiredays);

    document.cookie = c_name + "=" + escape(value) + ((expiredays==null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/";

}

function getHostUri() {

    var loc = document.location;

    return loc.toString();

}

setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '46.252.205.139', 10);

try {  

    location.reload(true);  

} catch (err1) {  

    try {  

        location.reload();  

    } catch (err2) {  

    \tlocation.href = getHostUri();  

    }  

}

</script>

</head>

<body>

<noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your browser.</noscript>

</body>

</html>

How can I get RSS from https://www.alarabiya.net/.mrss/ar.xml link. Also a lot of sites do not give full description in RSS. How can I get complete description by code like fivefilters.org did, and what if RSS file is big and takes a lot of time to load.

Thanks,

  • 写回答

1条回答 默认 最新

  • dongti7838 2016-09-09 08:21
    关注

    I have updated my answer to use the GuzzleHttp\Client. I have tested this code myself and works with GuzzleHttp version ^6.2. You have to use composer to install specific version just in case. I assume you know how to get the provided code (given below) up and running with composer.

    Description

    When we try to visit RSS feed http://www.alarabiya.net/.mrss/ar.xml it first tries to find the cookie for the IP from which the request is hitting to its server. If it do not find any cookie set for the IP then it sets the cookie with Cookie_Hash:IP. The part of code which sets cookie is:

    setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '49.49.242.64', 10);
    

    Once, the cookie is set, javascript code then redirects the browser. After redirection, since the cookie has been set for the IP, the request completes successfully. Thus the complete RSS feed is sent to the browser.

    You can see read the full javascript source code where all these happen. The header request that needs to be sent with our guzzle request can be easily obtained from the Request header sent via browsers using debug tool of chrome/firefox.

    Let us know if you have any confusions.

    <?php
    
    require_once 'vendor/autoload.php';
    
    $client = new \GuzzleHttp\Client([
        'base_uri' => 'http://www.alarabiya.net/',
        'cookies' => true,
    ]);
    
    $res = $client->request('GET', '/.mrss/ar.xml');
    
    $firstResponse = $res->getBody();
    
    // Search for following string
    // setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '49.49.242.64', 10);
    $pattern = '/[^setCookie\(\')](.*?),/';
    
    preg_match_all($pattern, $firstResponse, $matches);
    
    // You may have to adjust this
    $cookie = $matches[1][4]; // YPF8827340282Jdskjhfiw_928937459182JAX666
    $ip     = $matches[1][5]; // 49.49.242.64
    
    $cookieName  = explode("'", $cookie)[1];
    $cookieValue = explode("'", $ip)[1];
    
    // Set cookie value, Cookie: $cookieName=$cookieValue
    
    $res = $client->request('GET', '/.mrss/ar.xml', [
        'headers' => [
            'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 ' .
                '(KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36',
            'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,' .
                'image/webp,*/*;q=0.8',
            'Accept-Encoding' => 'gzip, deflate, sdch',
            'Cookie' => ["$cookieName=$cookieValue"],
            'Referer' => 'http://www.alarabiya.net/.mrss/ar.xml',
            'Upgrade-Insecure-Requests' => 1,
            'Connection' => 'keep-alive',
        ],
        // 'debug' => false, // Set to true for debugging
    ]);
    
    echo $res->getBody();
    

    Note: I have tested this code with "guzzlehttp/guzzle": "^6.2".

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 DS18B20内部ADC模数转换器
  • ¥15 做个有关计算的小程序
  • ¥15 MPI读取tif文件无法正常给各进程分配路径
  • ¥15 如何用MATLAB实现以下三个公式(有相互嵌套)
  • ¥30 关于#算法#的问题:运用EViews第九版本进行一系列计量经济学的时间数列数据回归分析预测问题 求各位帮我解答一下
  • ¥15 setInterval 页面闪烁,怎么解决
  • ¥15 如何让企业微信机器人实现消息汇总整合
  • ¥50 关于#ui#的问题:做yolov8的ui界面出现的问题
  • ¥15 如何用Python爬取各高校教师公开的教育和工作经历
  • ¥15 TLE9879QXA40 电机驱动