dongxin8709 2016-09-08 23:38
浏览 102
已采纳

Laravel通过guzzle请求javascript抓RSS

I am trying to grab RSS using below code.

<?php

$client  = new \GuzzleHttp\Client(['User-Agent' => 'idap']);
$content = $client->request('GET', 'alarabiya.net/.mrss/ar.xml');

dd($content->getBody()->getContents());

and it returns the following:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<meta http-equiv="Content-Script-Type" content="text/javascript">

<script type="text/javascript">

function getCookie(c_name) { // Local function for getting a cookie value

    if (document.cookie.length > 0) {

        c_start = document.cookie.indexOf(c_name + "=");

        if (c_start!=-1) {

        c_start=c_start + c_name.length + 1;

        c_end=document.cookie.indexOf(";", c_start);



        if (c_end==-1) 

            c_end = document.cookie.length;



        return unescape(document.cookie.substring(c_start,c_end));

        }

    }

    return "";

}

function setCookie(c_name, value, expiredays) { // Local function for setting a value of a cookie

    var exdate = new Date();

    exdate.setDate(exdate.getDate()+expiredays);

    document.cookie = c_name + "=" + escape(value) + ((expiredays==null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/";

}

function getHostUri() {

    var loc = document.location;

    return loc.toString();

}

setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '46.252.205.139', 10);

try {  

    location.reload(true);  

} catch (err1) {  

    try {  

        location.reload();  

    } catch (err2) {  

    \tlocation.href = getHostUri();  

    }  

}

</script>

</head>

<body>

<noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your browser.</noscript>

</body>

</html>

How can I get RSS from https://www.alarabiya.net/.mrss/ar.xml link. Also a lot of sites do not give full description in RSS. How can I get complete description by code like fivefilters.org did, and what if RSS file is big and takes a lot of time to load.

Thanks,

  • 写回答

1条回答 默认 最新

  • dongti7838 2016-09-09 08:21
    关注

    I have updated my answer to use the GuzzleHttp\Client. I have tested this code myself and works with GuzzleHttp version ^6.2. You have to use composer to install specific version just in case. I assume you know how to get the provided code (given below) up and running with composer.

    Description

    When we try to visit RSS feed http://www.alarabiya.net/.mrss/ar.xml it first tries to find the cookie for the IP from which the request is hitting to its server. If it do not find any cookie set for the IP then it sets the cookie with Cookie_Hash:IP. The part of code which sets cookie is:

    setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '49.49.242.64', 10);
    

    Once, the cookie is set, javascript code then redirects the browser. After redirection, since the cookie has been set for the IP, the request completes successfully. Thus the complete RSS feed is sent to the browser.

    You can see read the full javascript source code where all these happen. The header request that needs to be sent with our guzzle request can be easily obtained from the Request header sent via browsers using debug tool of chrome/firefox.

    Let us know if you have any confusions.

    <?php
    
    require_once 'vendor/autoload.php';
    
    $client = new \GuzzleHttp\Client([
        'base_uri' => 'http://www.alarabiya.net/',
        'cookies' => true,
    ]);
    
    $res = $client->request('GET', '/.mrss/ar.xml');
    
    $firstResponse = $res->getBody();
    
    // Search for following string
    // setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '49.49.242.64', 10);
    $pattern = '/[^setCookie\(\')](.*?),/';
    
    preg_match_all($pattern, $firstResponse, $matches);
    
    // You may have to adjust this
    $cookie = $matches[1][4]; // YPF8827340282Jdskjhfiw_928937459182JAX666
    $ip     = $matches[1][5]; // 49.49.242.64
    
    $cookieName  = explode("'", $cookie)[1];
    $cookieValue = explode("'", $ip)[1];
    
    // Set cookie value, Cookie: $cookieName=$cookieValue
    
    $res = $client->request('GET', '/.mrss/ar.xml', [
        'headers' => [
            'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 ' .
                '(KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36',
            'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,' .
                'image/webp,*/*;q=0.8',
            'Accept-Encoding' => 'gzip, deflate, sdch',
            'Cookie' => ["$cookieName=$cookieValue"],
            'Referer' => 'http://www.alarabiya.net/.mrss/ar.xml',
            'Upgrade-Insecure-Requests' => 1,
            'Connection' => 'keep-alive',
        ],
        // 'debug' => false, // Set to true for debugging
    ]);
    
    echo $res->getBody();
    

    Note: I have tested this code with "guzzlehttp/guzzle": "^6.2".

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥100 关于使用MATLAB中copularnd函数的问题
  • ¥20 在虚拟机的pycharm上
  • ¥15 jupyterthemes 设置完毕后没有效果
  • ¥15 matlab图像高斯低通滤波
  • ¥15 针对曲面部件的制孔路径规划,大家有什么思路吗
  • ¥15 钢筋实图交点识别,机器视觉代码
  • ¥15 如何在Linux系统中,但是在window系统上idea里面可以正常运行?(相关搜索:jar包)
  • ¥50 400g qsfp 光模块iphy方案
  • ¥15 两块ADC0804用proteus仿真时,出现异常
  • ¥15 关于风控系统,如何去选择