douwei7203 2014-01-14 19:30
浏览 50

正则表达式以匹配元标记

Hi I want to extract the og:image content from a page source. How can I extract og:image meta tag content from source?

This is meta tag:

<meta property="og:image" content="http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg" />

How can I identify the meta tag using regular expression?

This is my current function grab image url from img tags. What modification it needed to work with og:image meta tags?

function feeds_imagegrabber_scrape_images($content, $base_url, array $options = array(), &$error_log = array()) {

// Merge the default options.
$options += array(
  'expression' => '//img',
  'getsize' => TRUE,
  'max_imagesize' => 512000,
  'timeout' => 10,
  'max_redirects' => 3,
  'feeling_lucky' => 0,
);

$doc = new DOMDocument();
if (@$doc->loadXML($content) === FALSE && @$doc->loadHTML($content) === FALSE) {
  $error_log['code'] = -5;
  $error_log['error'] = "unable to parse the xml//html content";
  return FALSE;
}

$xpath = new DOMXPath($doc);
$hrefs = @$xpath->evaluate($options['expression']);//echo '<pre> HREFS : ';print_r($hrefs->length);exit;

if ($options['getsize']) {
  timer_start(__FUNCTION__);
}

$images = array();
$imagesize = 0;
for ($i = 0; $i < $hrefs->length; $i++) {
  $url = $hrefs->item($i)->getAttribute('src');
  if (!isset($url) || empty($url) || $url == '') {
    continue;
  }
  if(function_exists('encode_url')) {
    $url = encode_url($url);
  }
  $url = url_to_absolute($base_url, $url);

  if ($url == FALSE) {
    continue;
  }

  if ($options['getsize']) {
    if (($imagesize = feeds_imagegrabber_validate_download_size($url, $options['max_imagesize'], ($options['timeout'] - timer_read(__FUNCTION__) / 1000))) != -1)   {
      $images[$url] = $imagesize;
      if ($settings['feeling_lucky']) {
        break;
      }
    }
    if (($options['timeout'] - timer_read(__FUNCTION__) / 1000) <= 0) {
      $error_log['code'] = FIG_HTTP_REQUEST_TIMEOUT;
      $error_log['error'] = "timeout occured while scraping the content";
      break;
    }
  }
  else {
    $images[$url] = $imagesize;
    if ($settings['feeling_lucky']) {
      break;
    }
  }
}
echo '<pre>';print_r($images);exit;
return $images;
}
  • 写回答

2条回答 默认 最新

  • dongxun3424 2014-01-14 19:34
    关注

    Make use of DOMDocument Class

    <?php
    $html='<meta property="og:image" content="http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg" />';
    $dom = new DOMDocument;
    $dom->loadHTML($html);
    foreach ($dom->getElementsByTagName('meta') as $tag) {
        if ($tag->getAttribute('property') === 'og:image') {
            echo $tag->getAttribute('content');
        }
    }
    

    OUTPUT :

    http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg
    
    评论

报告相同问题?

悬赏问题

  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度
  • ¥30 关于#r语言#的问题:如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么