dongpiao8821 2019-01-15 04:03
浏览 667
已采纳

PHP CURL脚本在第一次请求后获得502/503服务器错误

I have been working on a clients WP site which lists deals from Groupon. I am using the Groupon's official XML feed, importing via WP All Import. This works without much hassle. Now the issue is Groupon doesn't update that feed frequently but some of their deals get sold out or off the market often. So to get this resolved what I am trying is using a CURL script to crawl the links and check if the deal is available or not then turn the unavailable deals to draft posts (Once a day only).

The custom script is working almost perfectly, only after the first 14/24 requests the server starts responding with 502/503 HTTP status codes. To overcome the issue I have used the below precautions -

  1. Using the proper header (captured from the requests made by the browser)
  2. Parsing cookies from response header and sending back.
  3. Using proper referrer and user agent.
  4. Using proxies.
  5. Trying to send request after a set interval. PHP - sleep(5);

Unfortunately, none of this got me the solution I wanted. I am attaching my code and I would like to request your expert insights on the issue, please.

Thanks in advance for your time. Shahriar

PHP SCRIPT - https://pastebin.com/FF2cNm5q

<?php

// Error supressing and extend maximum execution time
error_reporting(0);
ini_set('max_execution_time', 50000);

// Sitemap URL List
$all_activity_urls = array();
$sitemap_url = array(
     'https://www.groupon.de/sitemaps/deals-local0.xml.gz'
);
$cookies = Array();

// looping through sitemap url for scraping activity urls
for ($u = 0; $u < count($sitemap_url); $u++)
{
     $ch1 = curl_init();
     curl_setopt($ch1, CURLOPT_RETURNTRANSFER, TRUE);
     curl_setopt($ch1, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:38.0) Gecko/20100101 Firefox/38.0');
     curl_setopt($ch1, CURLOPT_REFERER, "https://www.groupon.de/");
     curl_setopt($ch1, CURLOPT_TIMEOUT, 40);
//    curl_setopt($ch1, CURLOPT_COOKIEFILE, "cookie.txt");
     curl_setopt($ch1, CURLOPT_RETURNTRANSFER, true);
     curl_setopt($ch1, CURLOPT_URL, $sitemap_url[$u]);
     curl_setopt($ch1, CURLOPT_SSL_VERIFYPEER, FALSE);
     // Parsing Cookie from the response header
     curl_setopt($ch1, CURLOPT_HEADERFUNCTION, "curlResponseHeaderCallback");
     $activity_url_source = curl_exec($ch1);
     $status_code = curl_getinfo($ch1, CURLINFO_HTTP_CODE);
     curl_close($ch1);

     if ($status_code === 200)
     {
          // Parsing XML sitemap for activity urls
          $activity_url_list = json_decode(json_encode(simplexml_load_string($activity_url_source)));
          for ($a = 0; $a < count($activity_url_list->url); $a++)
          {
               array_push($all_activity_urls, $activity_url_list->url[$a]->loc);
          }
     }
}


if (count($all_activity_urls) > 0)
{
// URL Loop count
     $loop_from = 0;
     $loop_to = (count($all_activity_urls) > 0) ? 100 : 0;
//    $loop_to = count($all_activity_urls);

     $final_data = array();
     echo 'script start - ' . date('h:i:s') . "<br>";

     for ($u = $loop_from; $u < $loop_to; $u++)
     {
          //Pull source from webpage
          $headers = array(
               'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
               'accept-language: en-US,en;q=0.9,bn-BD;q=0.8,bn;q=0.7,it;q=0.6',
               'cache-control: max-age=0',
               'cookie: ' . implode('; ', $cookies),
               'upgrade-insecure-requests: 1',
               'user-agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
          );

          $site = $all_activity_urls[$u];
          $ch = curl_init();
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
          curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
          curl_setopt($ch, CURLOPT_REFERER, "https://www.groupon.de/");
          curl_setopt($ch, CURLOPT_TIMEOUT, 40);
          curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
          curl_setopt($ch, CURLOPT_URL, $site);
          curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
          // Parsing Cookie from the response header
          curl_setopt($ch, CURLOPT_HEADERFUNCTION, "curlResponseHeaderCallback");
          $data = curl_exec($ch);
          $status_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
          curl_close($ch);

          if ($status_code === 200)
          {
               // Ready data for parsing
               $document = new DOMDocument();
               $document->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">' . $data);
               $xpath = new DOMXpath($document);

               $title = '';     
               $availability = '';
               $price = '';
               $base_price = '';
               $link = '';
               $image = '';

               $link = $all_activity_urls[$u];

               // Scraping Availability
               $raw_availability = $xpath->query('//div[@data-bhw="DealHighlights"]/div[0]/div/div');
               $availability = $raw_availability->item(0)->nodeValue;

               // Scraping Title     
               $raw_title = $xpath->query('//h1[@id="deal-title"]');
               $title = $raw_title->item(0)->nodeValue;

               // Scraping Price
               $raw_price = $xpath->query('//div[@class="price-discount-wrapper"]');
               $price = trim(str_replace(array("$", "€", "US", "&nbsp;"), array("", "", "", ""), $raw_price->item(0)->nodeValue));

               // Scraping Old Price
               $raw_base_price = $xpath->query('//div[contains(@class, "value-source-wrapper")]');
               $base_price = trim(str_replace(array("$", "€", "US", "&nbsp;"), array("", "", "", ""), $raw_base_price->item(0)->nodeValue));

               // Creating Final Data Array
               array_push($final_data, array(
                    'link' => $link,
                    'availability' => $availability,
                    'name' => $title,
                    'price' => $price,
                    'baseprice' => $base_price,
                    'img' => $image,
               ));
          }
          else
          {
               $link = $all_activity_urls[$u];
               if ($status_code === 429)
               {
                    $status_msg = ' - Too Many Requests';
               }
               else
               {
                    $status_msg = '';
               }

               array_push($final_data, array(
                    'link' => $link,
                    'status' => $status_code . $status_msg,
               ));
          }
          echo 'before break - ' . date('h:i:s') . "<br>";
          sleep(5);
          echo 'after break - ' . date('h:i:s') . "<br>";
          flush();
     }
     echo 'script end - ' . date('h:i:s') . "<br>";
     // Converting data to XML
     $activities = new SimpleXMLElement("<?xml version=\"1.0\"?><activities></activities>");
     array_to_xml($final_data, $activities);
     $xml_file = $activities->asXML('activities.xml');
     if ($xml_file)
     {
          echo 'XML file have been generated successfully.';
     }
     else
     {
          echo 'XML file generation error.';
     }
}
else
{
     $activities = new SimpleXMLElement("<?xml version=\"1.0\"?><activities></activities>");
     $activities->addChild("error", htmlspecialchars("No URL scraped from sitemap. Stoping script."));
     $xml_file = $activities->asXML('activities.xml');
     if ($xml_file)
     {
          echo 'XML file have been generated successfully.';
     }
     else
     {
          echo 'XML file generation error.';
     }
}

// Recursive Function for creating XML Nodes
function array_to_xml($array, &$activities)
{
     foreach ($array as $key => $value)
     {
          if (is_array($value))
          {
               if (!is_numeric($key))
               {
                    $subnode = $activities->addChild("$key");
                    array_to_xml($value, $subnode);
               }
               else
               {
                    $subnode = $activities->addChild("activity");
                    array_to_xml($value, $subnode);
               }
          }
          else
          {
               $activities->addChild("$key", htmlspecialchars("$value"));
          }
     }
}

// Cookie Parsing Function
function curlResponseHeaderCallback($ch, $headerLine)
{
     global $cookies;
     if (preg_match('/^Set-Cookie:\s*([^;]*)/mi', $headerLine, $cookie) == 1)
     {
          $cookies[] = $cookie[1];
     }
     return strlen($headerLine); // Needed by curl
}
  • 写回答

1条回答 默认 最新

  • dsf487787 2019-01-15 07:17
    关注

    There is a mess of cookies in your snippet. The callback function just appends cookies to the array regardingless of whether they already exist or not. Here is a new version which at least seems to work in this case since there are no semicolon-seperated multiple cookie definitions. Usually the cookie string should be even parsed. If you have installed the http extension you can use http_parse_cookie.

    // Cookie Parsing Function
    function curlResponseHeaderCallback($ch, $headerLine)
    {
      global $cookies;
    
      if (preg_match('/^Set-Cookie:\s*([^;]+)/mi', $headerLine, $match) == 1)
      {
    
        if(false !== ($p = strpos($match[1], '=')))
        {
          $replaced = false;
          $cname    = substr($match[1], 0, $p+1);
    
          foreach ($cookies as &$cookie)
            if(0 === strpos($cookie, $cname))
            {
              $cookie = $match[1];
              $replaced = true;
              break;
            }
    
          if(!$replaced)
            $cookies[] = $match[1];
        }
    var_dump($cookies);
      }
      return strlen($headerLine); // Needed by curl
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 metadata提取的PDF元数据,如何转换为一个Excel
  • ¥15 关于arduino编程toCharArray()函数的使用
  • ¥100 vc++混合CEF采用CLR方式编译报错
  • ¥15 coze 的插件输入飞书多维表格 app_token 后一直显示错误,如何解决?
  • ¥15 vite+vue3+plyr播放本地public文件夹下视频无法加载
  • ¥15 c#逐行读取txt文本,但是每一行里面数据之间空格数量不同
  • ¥50 如何openEuler 22.03上安装配置drbd
  • ¥20 ING91680C BLE5.3 芯片怎么实现串口收发数据
  • ¥15 无线连接树莓派,无法执行update,如何解决?(相关搜索:软件下载)
  • ¥15 Windows11, backspace, enter, space键失灵