dsilhx5830 2019-03-04 19:43
浏览 84
已采纳

刮不止一页

I am trying to scape data (Name, varietal, format and price) from this site https://aabalat.com/wine/country/france. I have made an array by name $urls and I push every link in the array. For each new curl session, I will get 20 new data about wine. I need to capture format at first and push to the array as you can see on my code below. When I print $french_wines_formats_matches it work successfully. But when I want to print $french_wines_format_array it is not working very well.

I am new in scraping and I am not much experience with that.

    // Array contains 197 links
$urls = array();
array_push($urls, "https://aabalat.com/wine/country/france");


// This for loop makes others links
for($i = 1; $i < 5; $i++)
{
  $urls[] = "https://aabalat.com/wine/country/france?page=".$i;
}

// echo "<pre>";
// print_r($urls);
// echo "</pre>";

$french_wines_array = array();
$french_wines_title_array = array();
$french_wines_varietal_array = array();
$french_wines_format_array = array();
$french_wines_price_array = array();

// Repeat curl session until url exists.
foreach($urls as $url)
{
  $curl = curl_init();
  curl_setopt($curl, CURLOPT_URL, $url);

  curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
  curl_setopt($curl, CURLOPT_VERBOSE, true);

  $output = curl_exec($curl);
  $info = curl_getinfo($curl);
  $err = curl_error($curl);
  $ern = curl_errno($curl);

  $french_wine_formats_pattern = '!<span class="wine-list-item-format">(.*)</span>!mi';
  preg_match_all($french_wine_formats_pattern, $output, $french_wines_formats_matches);

  foreach($french_wines_formats_matches[0] as $french_wines_formats_match)
  {
    $french_wines_format_array[] = $french_wines_formats_match;
  }

  echo "<pre>";
  print_r($french_wines_format_array);
  echo "</pre>";

curl_close($curl);
sleep(rand(2, 5));

}
  • 写回答

1条回答 默认 最新

  • dongsaolian8786 2019-03-04 19:58
    关注

    Your code and regex seem to work (I tried them). I was unable to replicate your cURL call. Try the following instead of just $output = curl_exec($curl), see if you catch any cURL errors:

        if(!$output = curl_exec($curl)){
            if (curl_error($ch)) {
                die(curl_error($ch));
            }
        }
    

    Finally, I tried a simple file_get_contents() and that seemed to work:

        $url = "https://aabalat.com/wine/country/france";
        $output= file_get_contents($url);
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 stm32开发clion时遇到的编译问题
  • ¥15 lna设计 源简并电感型共源放大器
  • ¥15 如何用Labview在myRIO上做LCD显示?(语言-开发语言)