dongtan2603 2014-08-17 06:29
浏览 75
已采纳

从RSS源获取图像,没有图像URL

I would just like to to know how other developers manage to properly get/extract the first image in the blog main content of a site from URL in the RSS feed. This is the way I think of since the RSS feeds don't have image URL of the post/blog item in it. Though I keep on seeing

<img src="http://feeds.feedburner.com/~r/CookingLight/EatingSmart/~4/sIG3nePOu-c" />

but it's only 1px image. Does this one has relevant value to the feed item or can I convert this to maybe the actual image? Here's the RSS http://feeds.cookinglight.com/CookingLight/EatingSmart?format=xml

Anyway, here's my attempt to extract the image using the url in the feeds:

function extact_first_image( $url ) {  
  $content = file_get_contents($url);

  // Narrow the html to get the main div with the blog content only.
  // source: http://stackoverflow.com/questions/15643710/php-get-a-div-from-page-x
  $PreMain = explode('<div id="main-content"', $content);
  $main = explode("</div>" , $PreMain[1] );

  // Regex that finds matches with img tags.
  $output = preg_match_all('/<img[^>]+src=[\'"]([^\'"]+)[\'"][^>]*>/i', $main[12], $matches);  

  // Return the img in html format.
  return $matches[0][0];  
}

$url = 'http://www.cookinglight.com/eating-smart/nutrition-101/foods-that-fight-fat'; //Sample URL from the feed.
echo extact_first_image($url);

Obvious downside of this function: It properly explodes if <div id="main-content" is found in the html. When there's another xml to parse with another structure, there will be another explode for that as well. It's very much static.

I guess its worth mentioning also is regarding the load time. When I perform loop through out the items in the feed, its even more longer.

I hope I made clear of the points. Feel free to drop in any ideas that could help optimize the solution perhaps.

  • 写回答

1条回答 默认 最新

  • dongliao1949 2014-08-18 18:07
    关注

    The image urls are in the rss file, so you can get them just by parsing the xml. Each <item> element contains a <media:group> element that contains a <media:content> element. The url to the image for that item is in the "url" attribute of the <media:content> element. Here is some basic code (php) for extracting the image urls into an array:

    $xml = simplexml_load_file("http://feeds.cookinglight.com/CookingLight/EatingSmart?format=xml");
    
    $imageUrls = array();
    
    foreach($xml->channel->item as $item)
    {
        array_push($imageUrls, (string)$item->children('media', true)->group->content->attributes()->url);
    }
    

    Keep in mind, though, that the media doesn't necessarily have to be an image. It can be a video or an audio recording. There might even be more than one <media:group>. You can check the "type" attribute of the <media:content> element to see what it is.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥500 火焰左右视图、视差(基于双目相机)
  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本