dongpiao8821 2019-01-15 04:03
浏览 664
已采纳

PHP CURL脚本在第一次请求后获得502/503服务器错误

I have been working on a clients WP site which lists deals from Groupon. I am using the Groupon's official XML feed, importing via WP All Import. This works without much hassle. Now the issue is Groupon doesn't update that feed frequently but some of their deals get sold out or off the market often. So to get this resolved what I am trying is using a CURL script to crawl the links and check if the deal is available or not then turn the unavailable deals to draft posts (Once a day only).

The custom script is working almost perfectly, only after the first 14/24 requests the server starts responding with 502/503 HTTP status codes. To overcome the issue I have used the below precautions -

  1. Using the proper header (captured from the requests made by the browser)
  2. Parsing cookies from response header and sending back.
  3. Using proper referrer and user agent.
  4. Using proxies.
  5. Trying to send request after a set interval. PHP - sleep(5);

Unfortunately, none of this got me the solution I wanted. I am attaching my code and I would like to request your expert insights on the issue, please.

Thanks in advance for your time. Shahriar

PHP SCRIPT - https://pastebin.com/FF2cNm5q

<?php

// Error supressing and extend maximum execution time
error_reporting(0);
ini_set('max_execution_time', 50000);

// Sitemap URL List
$all_activity_urls = array();
$sitemap_url = array(
     'https://www.groupon.de/sitemaps/deals-local0.xml.gz'
);
$cookies = Array();

// looping through sitemap url for scraping activity urls
for ($u = 0; $u < count($sitemap_url); $u++)
{
     $ch1 = curl_init();
     curl_setopt($ch1, CURLOPT_RETURNTRANSFER, TRUE);
     curl_setopt($ch1, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:38.0) Gecko/20100101 Firefox/38.0');
     curl_setopt($ch1, CURLOPT_REFERER, "https://www.groupon.de/");
     curl_setopt($ch1, CURLOPT_TIMEOUT, 40);
//    curl_setopt($ch1, CURLOPT_COOKIEFILE, "cookie.txt");
     curl_setopt($ch1, CURLOPT_RETURNTRANSFER, true);
     curl_setopt($ch1, CURLOPT_URL, $sitemap_url[$u]);
     curl_setopt($ch1, CURLOPT_SSL_VERIFYPEER, FALSE);
     // Parsing Cookie from the response header
     curl_setopt($ch1, CURLOPT_HEADERFUNCTION, "curlResponseHeaderCallback");
     $activity_url_source = curl_exec($ch1);
     $status_code = curl_getinfo($ch1, CURLINFO_HTTP_CODE);
     curl_close($ch1);

     if ($status_code === 200)
     {
          // Parsing XML sitemap for activity urls
          $activity_url_list = json_decode(json_encode(simplexml_load_string($activity_url_source)));
          for ($a = 0; $a < count($activity_url_list->url); $a++)
          {
               array_push($all_activity_urls, $activity_url_list->url[$a]->loc);
          }
     }
}


if (count($all_activity_urls) > 0)
{
// URL Loop count
     $loop_from = 0;
     $loop_to = (count($all_activity_urls) > 0) ? 100 : 0;
//    $loop_to = count($all_activity_urls);

     $final_data = array();
     echo 'script start - ' . date('h:i:s') . "<br>";

     for ($u = $loop_from; $u < $loop_to; $u++)
     {
          //Pull source from webpage
          $headers = array(
               'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
               'accept-language: en-US,en;q=0.9,bn-BD;q=0.8,bn;q=0.7,it;q=0.6',
               'cache-control: max-age=0',
               'cookie: ' . implode('; ', $cookies),
               'upgrade-insecure-requests: 1',
               'user-agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
          );

          $site = $all_activity_urls[$u];
          $ch = curl_init();
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
          curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
          curl_setopt($ch, CURLOPT_REFERER, "https://www.groupon.de/");
          curl_setopt($ch, CURLOPT_TIMEOUT, 40);
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
          curl_setopt($ch, CURLOPT_URL, $site);
          curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
          // Parsing Cookie from the response header
          curl_setopt($ch, CURLOPT_HEADERFUNCTION, "curlResponseHeaderCallback");
          $data = curl_exec($ch);
          $status_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
          curl_close($ch);

          if ($status_code === 200)
          {
               // Ready data for parsing
               $document = new DOMDocument();
               $document->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">' . $data);
               $xpath = new DOMXpath($document);

               $title = '';     
               $availability = '';
               $price = '';
               $base_price = '';
               $link = '';
               $image = '';

               $link = $all_activity_urls[$u];

               // Scraping Availability
               $raw_availability = $xpath->query('//div[@data-bhw="DealHighlights"]/div[0]/div/div');
               $availability = $raw_availability->item(0)->nodeValue;

               // Scraping Title     
               $raw_title = $xpath->query('//h1[@id="deal-title"]');
               $title = $raw_title->item(0)->nodeValue;

               // Scraping Price
               $raw_price = $xpath->query('//div[@class="price-discount-wrapper"]');
               $price = trim(str_replace(array("$", "€", "US", "&nbsp;"), array("", "", "", ""), $raw_price->item(0)->nodeValue));

               // Scraping Old Price
               $raw_base_price = $xpath->query('//div[contains(@class, "value-source-wrapper")]');
               $base_price = trim(str_replace(array("$", "€", "US", "&nbsp;"), array("", "", "", ""), $raw_base_price->item(0)->nodeValue));

               // Creating Final Data Array
               array_push($final_data, array(
                    'link' => $link,
                    'availability' => $availability,
                    'name' => $title,
                    'price' => $price,
                    'baseprice' => $base_price,
                    'img' => $image,
               ));
          }
          else
          {
               $link = $all_activity_urls[$u];
               if ($status_code === 429)
               {
                    $status_msg = ' - Too Many Requests';
               }
               else
               {
                    $status_msg = '';
               }

               array_push($final_data, array(
                    'link' => $link,
                    'status' => $status_code . $status_msg,
               ));
          }
          echo 'before break - ' . date('h:i:s') . "<br>";
          sleep(5);
          echo 'after break - ' . date('h:i:s') . "<br>";
          flush();
     }
     echo 'script end - ' . date('h:i:s') . "<br>";
     // Converting data to XML
     $activities = new SimpleXMLElement("<?xml version=\"1.0\"?><activities></activities>");
     array_to_xml($final_data, $activities);
     $xml_file = $activities->asXML('activities.xml');
     if ($xml_file)
     {
          echo 'XML file have been generated successfully.';
     }
     else
     {
          echo 'XML file generation error.';
     }
}
else
{
     $activities = new SimpleXMLElement("<?xml version=\"1.0\"?><activities></activities>");
     $activities->addChild("error", htmlspecialchars("No URL scraped from sitemap. Stoping script."));
     $xml_file = $activities->asXML('activities.xml');
     if ($xml_file)
     {
          echo 'XML file have been generated successfully.';
     }
     else
     {
          echo 'XML file generation error.';
     }
}

// Recursive Function for creating XML Nodes
function array_to_xml($array, &$activities)
{
     foreach ($array as $key => $value)
     {
          if (is_array($value))
          {
               if (!is_numeric($key))
               {
                    $subnode = $activities->addChild("$key");
                    array_to_xml($value, $subnode);
               }
               else
               {
                    $subnode = $activities->addChild("activity");
                    array_to_xml($value, $subnode);
               }
          }
          else
          {
               $activities->addChild("$key", htmlspecialchars("$value"));
          }
     }
}

// Cookie Parsing Function
function curlResponseHeaderCallback($ch, $headerLine)
{
     global $cookies;
     if (preg_match('/^Set-Cookie:\s*([^;]*)/mi', $headerLine, $cookie) == 1)
     {
          $cookies[] = $cookie[1];
     }
     return strlen($headerLine); // Needed by curl
}
  • 写回答

1条回答 默认 最新

  • dsf487787 2019-01-15 07:17
    关注

    There is a mess of cookies in your snippet. The callback function just appends cookies to the array regardingless of whether they already exist or not. Here is a new version which at least seems to work in this case since there are no semicolon-seperated multiple cookie definitions. Usually the cookie string should be even parsed. If you have installed the http extension you can use http_parse_cookie.

    // Cookie Parsing Function
    function curlResponseHeaderCallback($ch, $headerLine)
    {
      global $cookies;
    
      if (preg_match('/^Set-Cookie:\s*([^;]+)/mi', $headerLine, $match) == 1)
      {
    
        if(false !== ($p = strpos($match[1], '=')))
        {
          $replaced = false;
          $cname    = substr($match[1], 0, $p+1);
    
          foreach ($cookies as &$cookie)
            if(0 === strpos($cookie, $cname))
            {
              $cookie = $match[1];
              $replaced = true;
              break;
            }
    
          if(!$replaced)
            $cookies[] = $match[1];
        }
    var_dump($cookies);
      }
      return strlen($headerLine); // Needed by curl
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 HFSS 中的 H 场图与 MATLAB 中绘制的 B1 场 部分对应不上
  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?