dongpiao8821 2019-01-15 04:03
浏览 667
已采纳

PHP CURL脚本在第一次请求后获得502/503服务器错误

I have been working on a clients WP site which lists deals from Groupon. I am using the Groupon's official XML feed, importing via WP All Import. This works without much hassle. Now the issue is Groupon doesn't update that feed frequently but some of their deals get sold out or off the market often. So to get this resolved what I am trying is using a CURL script to crawl the links and check if the deal is available or not then turn the unavailable deals to draft posts (Once a day only).

The custom script is working almost perfectly, only after the first 14/24 requests the server starts responding with 502/503 HTTP status codes. To overcome the issue I have used the below precautions -

  1. Using the proper header (captured from the requests made by the browser)
  2. Parsing cookies from response header and sending back.
  3. Using proper referrer and user agent.
  4. Using proxies.
  5. Trying to send request after a set interval. PHP - sleep(5);

Unfortunately, none of this got me the solution I wanted. I am attaching my code and I would like to request your expert insights on the issue, please.

Thanks in advance for your time. Shahriar

PHP SCRIPT - https://pastebin.com/FF2cNm5q

<?php

// Error supressing and extend maximum execution time
error_reporting(0);
ini_set('max_execution_time', 50000);

// Sitemap URL List
$all_activity_urls = array();
$sitemap_url = array(
     'https://www.groupon.de/sitemaps/deals-local0.xml.gz'
);
$cookies = Array();

// looping through sitemap url for scraping activity urls
for ($u = 0; $u < count($sitemap_url); $u++)
{
     $ch1 = curl_init();
     curl_setopt($ch1, CURLOPT_RETURNTRANSFER, TRUE);
     curl_setopt($ch1, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:38.0) Gecko/20100101 Firefox/38.0');
     curl_setopt($ch1, CURLOPT_REFERER, "https://www.groupon.de/");
     curl_setopt($ch1, CURLOPT_TIMEOUT, 40);
//    curl_setopt($ch1, CURLOPT_COOKIEFILE, "cookie.txt");
     curl_setopt($ch1, CURLOPT_RETURNTRANSFER, true);
     curl_setopt($ch1, CURLOPT_URL, $sitemap_url[$u]);
     curl_setopt($ch1, CURLOPT_SSL_VERIFYPEER, FALSE);
     // Parsing Cookie from the response header
     curl_setopt($ch1, CURLOPT_HEADERFUNCTION, "curlResponseHeaderCallback");
     $activity_url_source = curl_exec($ch1);
     $status_code = curl_getinfo($ch1, CURLINFO_HTTP_CODE);
     curl_close($ch1);

     if ($status_code === 200)
     {
          // Parsing XML sitemap for activity urls
          $activity_url_list = json_decode(json_encode(simplexml_load_string($activity_url_source)));
          for ($a = 0; $a < count($activity_url_list->url); $a++)
          {
               array_push($all_activity_urls, $activity_url_list->url[$a]->loc);
          }
     }
}


if (count($all_activity_urls) > 0)
{
// URL Loop count
     $loop_from = 0;
     $loop_to = (count($all_activity_urls) > 0) ? 100 : 0;
//    $loop_to = count($all_activity_urls);

     $final_data = array();
     echo 'script start - ' . date('h:i:s') . "<br>";

     for ($u = $loop_from; $u < $loop_to; $u++)
     {
          //Pull source from webpage
          $headers = array(
               'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
               'accept-language: en-US,en;q=0.9,bn-BD;q=0.8,bn;q=0.7,it;q=0.6',
               'cache-control: max-age=0',
               'cookie: ' . implode('; ', $cookies),
               'upgrade-insecure-requests: 1',
               'user-agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
          );

          $site = $all_activity_urls[$u];
          $ch = curl_init();
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
          curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
          curl_setopt($ch, CURLOPT_REFERER, "https://www.groupon.de/");
          curl_setopt($ch, CURLOPT_TIMEOUT, 40);
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
          curl_setopt($ch, CURLOPT_URL, $site);
          curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
          // Parsing Cookie from the response header
          curl_setopt($ch, CURLOPT_HEADERFUNCTION, "curlResponseHeaderCallback");
          $data = curl_exec($ch);
          $status_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
          curl_close($ch);

          if ($status_code === 200)
          {
               // Ready data for parsing
               $document = new DOMDocument();
               $document->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">' . $data);
               $xpath = new DOMXpath($document);

               $title = '';     
               $availability = '';
               $price = '';
               $base_price = '';
               $link = '';
               $image = '';

               $link = $all_activity_urls[$u];

               // Scraping Availability
               $raw_availability = $xpath->query('//div[@data-bhw="DealHighlights"]/div[0]/div/div');
               $availability = $raw_availability->item(0)->nodeValue;

               // Scraping Title     
               $raw_title = $xpath->query('//h1[@id="deal-title"]');
               $title = $raw_title->item(0)->nodeValue;

               // Scraping Price
               $raw_price = $xpath->query('//div[@class="price-discount-wrapper"]');
               $price = trim(str_replace(array("$", "€", "US", "&nbsp;"), array("", "", "", ""), $raw_price->item(0)->nodeValue));

               // Scraping Old Price
               $raw_base_price = $xpath->query('//div[contains(@class, "value-source-wrapper")]');
               $base_price = trim(str_replace(array("$", "€", "US", "&nbsp;"), array("", "", "", ""), $raw_base_price->item(0)->nodeValue));

               // Creating Final Data Array
               array_push($final_data, array(
                    'link' => $link,
                    'availability' => $availability,
                    'name' => $title,
                    'price' => $price,
                    'baseprice' => $base_price,
                    'img' => $image,
               ));
          }
          else
          {
               $link = $all_activity_urls[$u];
               if ($status_code === 429)
               {
                    $status_msg = ' - Too Many Requests';
               }
               else
               {
                    $status_msg = '';
               }

               array_push($final_data, array(
                    'link' => $link,
                    'status' => $status_code . $status_msg,
               ));
          }
          echo 'before break - ' . date('h:i:s') . "<br>";
          sleep(5);
          echo 'after break - ' . date('h:i:s') . "<br>";
          flush();
     }
     echo 'script end - ' . date('h:i:s') . "<br>";
     // Converting data to XML
     $activities = new SimpleXMLElement("<?xml version=\"1.0\"?><activities></activities>");
     array_to_xml($final_data, $activities);
     $xml_file = $activities->asXML('activities.xml');
     if ($xml_file)
     {
          echo 'XML file have been generated successfully.';
     }
     else
     {
          echo 'XML file generation error.';
     }
}
else
{
     $activities = new SimpleXMLElement("<?xml version=\"1.0\"?><activities></activities>");
     $activities->addChild("error", htmlspecialchars("No URL scraped from sitemap. Stoping script."));
     $xml_file = $activities->asXML('activities.xml');
     if ($xml_file)
     {
          echo 'XML file have been generated successfully.';
     }
     else
     {
          echo 'XML file generation error.';
     }
}

// Recursive Function for creating XML Nodes
function array_to_xml($array, &$activities)
{
     foreach ($array as $key => $value)
     {
          if (is_array($value))
          {
               if (!is_numeric($key))
               {
                    $subnode = $activities->addChild("$key");
                    array_to_xml($value, $subnode);
               }
               else
               {
                    $subnode = $activities->addChild("activity");
                    array_to_xml($value, $subnode);
               }
          }
          else
          {
               $activities->addChild("$key", htmlspecialchars("$value"));
          }
     }
}

// Cookie Parsing Function
function curlResponseHeaderCallback($ch, $headerLine)
{
     global $cookies;
     if (preg_match('/^Set-Cookie:\s*([^;]*)/mi', $headerLine, $cookie) == 1)
     {
          $cookies[] = $cookie[1];
     }
     return strlen($headerLine); // Needed by curl
}
  • 写回答

1条回答 默认 最新

  • dsf487787 2019-01-15 07:17
    关注

    There is a mess of cookies in your snippet. The callback function just appends cookies to the array regardingless of whether they already exist or not. Here is a new version which at least seems to work in this case since there are no semicolon-seperated multiple cookie definitions. Usually the cookie string should be even parsed. If you have installed the http extension you can use http_parse_cookie.

    // Cookie Parsing Function
    function curlResponseHeaderCallback($ch, $headerLine)
    {
      global $cookies;
    
      if (preg_match('/^Set-Cookie:\s*([^;]+)/mi', $headerLine, $match) == 1)
      {
    
        if(false !== ($p = strpos($match[1], '=')))
        {
          $replaced = false;
          $cname    = substr($match[1], 0, $p+1);
    
          foreach ($cookies as &$cookie)
            if(0 === strpos($cookie, $cname))
            {
              $cookie = $match[1];
              $replaced = true;
              break;
            }
    
          if(!$replaced)
            $cookies[] = $match[1];
        }
    var_dump($cookies);
      }
      return strlen($headerLine); // Needed by curl
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 IDEA中圈复杂度如何具体设置
  • ¥50 labview采集不了数据
  • ¥15 请上面代码做什么处理或什么混淆
  • ¥15 英雄联盟自定义房间置顶
  • ¥15 W5500网线插上无反应
  • ¥15 如何用字典的Key,显示在WPF的xaml中
  • ¥15 weautomate读取Excel表格信息然后填写到网页一直报错,如何解决?
  • ¥15 C#如何在Webview2中获取网页验证码
  • ¥15 esp32烧录失败,具体情况在图片上
  • ¥15 selenium安装报错