dongpiao8821
2019-01-15 04:03
浏览 603
已采纳

PHP CURL脚本在第一次请求后获得502/503服务器错误

I have been working on a clients WP site which lists deals from Groupon. I am using the Groupon's official XML feed, importing via WP All Import. This works without much hassle. Now the issue is Groupon doesn't update that feed frequently but some of their deals get sold out or off the market often. So to get this resolved what I am trying is using a CURL script to crawl the links and check if the deal is available or not then turn the unavailable deals to draft posts (Once a day only).

The custom script is working almost perfectly, only after the first 14/24 requests the server starts responding with 502/503 HTTP status codes. To overcome the issue I have used the below precautions -

  1. Using the proper header (captured from the requests made by the browser)
  2. Parsing cookies from response header and sending back.
  3. Using proper referrer and user agent.
  4. Using proxies.
  5. Trying to send request after a set interval. PHP - sleep(5);

Unfortunately, none of this got me the solution I wanted. I am attaching my code and I would like to request your expert insights on the issue, please.

Thanks in advance for your time. Shahriar

PHP SCRIPT - https://pastebin.com/FF2cNm5q

<?php

// Error supressing and extend maximum execution time
error_reporting(0);
ini_set('max_execution_time', 50000);

// Sitemap URL List
$all_activity_urls = array();
$sitemap_url = array(
     'https://www.groupon.de/sitemaps/deals-local0.xml.gz'
);
$cookies = Array();

// looping through sitemap url for scraping activity urls
for ($u = 0; $u < count($sitemap_url); $u++)
{
     $ch1 = curl_init();
     curl_setopt($ch1, CURLOPT_RETURNTRANSFER, TRUE);
     curl_setopt($ch1, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:38.0) Gecko/20100101 Firefox/38.0');
     curl_setopt($ch1, CURLOPT_REFERER, "https://www.groupon.de/");
     curl_setopt($ch1, CURLOPT_TIMEOUT, 40);
//    curl_setopt($ch1, CURLOPT_COOKIEFILE, "cookie.txt");
     curl_setopt($ch1, CURLOPT_RETURNTRANSFER, true);
     curl_setopt($ch1, CURLOPT_URL, $sitemap_url[$u]);
     curl_setopt($ch1, CURLOPT_SSL_VERIFYPEER, FALSE);
     // Parsing Cookie from the response header
     curl_setopt($ch1, CURLOPT_HEADERFUNCTION, "curlResponseHeaderCallback");
     $activity_url_source = curl_exec($ch1);
     $status_code = curl_getinfo($ch1, CURLINFO_HTTP_CODE);
     curl_close($ch1);

     if ($status_code === 200)
     {
          // Parsing XML sitemap for activity urls
          $activity_url_list = json_decode(json_encode(simplexml_load_string($activity_url_source)));
          for ($a = 0; $a < count($activity_url_list->url); $a++)
          {
               array_push($all_activity_urls, $activity_url_list->url[$a]->loc);
          }
     }
}


if (count($all_activity_urls) > 0)
{
// URL Loop count
     $loop_from = 0;
     $loop_to = (count($all_activity_urls) > 0) ? 100 : 0;
//    $loop_to = count($all_activity_urls);

     $final_data = array();
     echo 'script start - ' . date('h:i:s') . "<br>";

     for ($u = $loop_from; $u < $loop_to; $u++)
     {
          //Pull source from webpage
          $headers = array(
               'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
               'accept-language: en-US,en;q=0.9,bn-BD;q=0.8,bn;q=0.7,it;q=0.6',
               'cache-control: max-age=0',
               'cookie: ' . implode('; ', $cookies),
               'upgrade-insecure-requests: 1',
               'user-agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
          );

          $site = $all_activity_urls[$u];
          $ch = curl_init();
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
          curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
          curl_setopt($ch, CURLOPT_REFERER, "https://www.groupon.de/");
          curl_setopt($ch, CURLOPT_TIMEOUT, 40);
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
          curl_setopt($ch, CURLOPT_URL, $site);
          curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
          // Parsing Cookie from the response header
          curl_setopt($ch, CURLOPT_HEADERFUNCTION, "curlResponseHeaderCallback");
          $data = curl_exec($ch);
          $status_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
          curl_close($ch);

          if ($status_code === 200)
          {
               // Ready data for parsing
               $document = new DOMDocument();
               $document->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">' . $data);
               $xpath = new DOMXpath($document);

               $title = '';     
               $availability = '';
               $price = '';
               $base_price = '';
               $link = '';
               $image = '';

               $link = $all_activity_urls[$u];

               // Scraping Availability
               $raw_availability = $xpath->query('//div[@data-bhw="DealHighlights"]/div[0]/div/div');
               $availability = $raw_availability->item(0)->nodeValue;

               // Scraping Title     
               $raw_title = $xpath->query('//h1[@id="deal-title"]');
               $title = $raw_title->item(0)->nodeValue;

               // Scraping Price
               $raw_price = $xpath->query('//div[@class="price-discount-wrapper"]');
               $price = trim(str_replace(array("$", "€", "US", "&nbsp;"), array("", "", "", ""), $raw_price->item(0)->nodeValue));

               // Scraping Old Price
               $raw_base_price = $xpath->query('//div[contains(@class, "value-source-wrapper")]');
               $base_price = trim(str_replace(array("$", "€", "US", "&nbsp;"), array("", "", "", ""), $raw_base_price->item(0)->nodeValue));

               // Creating Final Data Array
               array_push($final_data, array(
                    'link' => $link,
                    'availability' => $availability,
                    'name' => $title,
                    'price' => $price,
                    'baseprice' => $base_price,
                    'img' => $image,
               ));
          }
          else
          {
               $link = $all_activity_urls[$u];
               if ($status_code === 429)
               {
                    $status_msg = ' - Too Many Requests';
               }
               else
               {
                    $status_msg = '';
               }

               array_push($final_data, array(
                    'link' => $link,
                    'status' => $status_code . $status_msg,
               ));
          }
          echo 'before break - ' . date('h:i:s') . "<br>";
          sleep(5);
          echo 'after break - ' . date('h:i:s') . "<br>";
          flush();
     }
     echo 'script end - ' . date('h:i:s') . "<br>";
     // Converting data to XML
     $activities = new SimpleXMLElement("<?xml version=\"1.0\"?><activities></activities>");
     array_to_xml($final_data, $activities);
     $xml_file = $activities->asXML('activities.xml');
     if ($xml_file)
     {
          echo 'XML file have been generated successfully.';
     }
     else
     {
          echo 'XML file generation error.';
     }
}
else
{
     $activities = new SimpleXMLElement("<?xml version=\"1.0\"?><activities></activities>");
     $activities->addChild("error", htmlspecialchars("No URL scraped from sitemap. Stoping script."));
     $xml_file = $activities->asXML('activities.xml');
     if ($xml_file)
     {
          echo 'XML file have been generated successfully.';
     }
     else
     {
          echo 'XML file generation error.';
     }
}

// Recursive Function for creating XML Nodes
function array_to_xml($array, &$activities)
{
     foreach ($array as $key => $value)
     {
          if (is_array($value))
          {
               if (!is_numeric($key))
               {
                    $subnode = $activities->addChild("$key");
                    array_to_xml($value, $subnode);
               }
               else
               {
                    $subnode = $activities->addChild("activity");
                    array_to_xml($value, $subnode);
               }
          }
          else
          {
               $activities->addChild("$key", htmlspecialchars("$value"));
          }
     }
}

// Cookie Parsing Function
function curlResponseHeaderCallback($ch, $headerLine)
{
     global $cookies;
     if (preg_match('/^Set-Cookie:\s*([^;]*)/mi', $headerLine, $cookie) == 1)
     {
          $cookies[] = $cookie[1];
     }
     return strlen($headerLine); // Needed by curl
}
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

1条回答 默认 最新

  • dsf487787 2019-01-15 07:17
    已采纳

    There is a mess of cookies in your snippet. The callback function just appends cookies to the array regardingless of whether they already exist or not. Here is a new version which at least seems to work in this case since there are no semicolon-seperated multiple cookie definitions. Usually the cookie string should be even parsed. If you have installed the http extension you can use http_parse_cookie.

    // Cookie Parsing Function
    function curlResponseHeaderCallback($ch, $headerLine)
    {
      global $cookies;
    
      if (preg_match('/^Set-Cookie:\s*([^;]+)/mi', $headerLine, $match) == 1)
      {
    
        if(false !== ($p = strpos($match[1], '=')))
        {
          $replaced = false;
          $cname    = substr($match[1], 0, $p+1);
    
          foreach ($cookies as &$cookie)
            if(0 === strpos($cookie, $cname))
            {
              $cookie = $match[1];
              $replaced = true;
              break;
            }
    
          if(!$replaced)
            $cookies[] = $match[1];
        }
    var_dump($cookies);
      }
      return strlen($headerLine); // Needed by curl
    }
    
    已采纳该答案
    打赏 评论

相关推荐 更多相似问题