dpq39825 2010-08-19 05:30
浏览 35

从HTML中抓取唯一的图片网址

Using PHP to curl a web page (some URL entered by user, let's assume it's valid). Example: http://www.youtube.com/watch?v=Hovbx6rvBaA

I need to parse the HTML and extract all de-duplicated URL's that seem like an image. Not just the ones in img src="" but any URL ending in jpe?g|bmp|gif|png, etc. on that page. (In other words, I don't wanna parse the DOM but wanna use RegEx).

I plan to then curl the URLs for their width and height information and ensure that they are indeed images, so don't worry about security related stuff.

  • 写回答

2条回答 默认 最新

  • doraemon0769 2010-08-19 06:07
    关注

    Collect all image urls into an array, then use array_unique() to remove duplicates.

    $my_image_links = array_unique( $my_image_links );
    // No more duplicates
    

    If you really want to do this w/ a regex, then we can assume each image name will be surrounded by either ', ", or spaces, tabs, or line breaks or beginning of line, >, <, and whatever else you can think of. So, then we can do:

    $pattern = '/[\'" >\t^]([^\'" 
    \t]+\.(jpe?g|bmp|gif|png))[\'" <
    \t]/i';
    preg_match_all($pattern, html_entity_decode($resultFromCurl), $matches);
    $imgs = array_unique($matches[1]);
    

    The above will capture the image link in stuff like:

    <p>Hai guys look at this ==> http://blah.com/lolcats.JPEG</p>
    

    Live example

    评论

报告相同问题?

悬赏问题

  • ¥15 c语言怎么用printf(“\b \b”)与getch()实现黑框里写入与删除?
  • ¥20 怎么用dlib库的算法识别小麦病虫害
  • ¥15 华为ensp模拟器中S5700交换机在配置过程中老是反复重启
  • ¥15 java写代码遇到问题,求帮助
  • ¥15 uniapp uview http 如何实现统一的请求异常信息提示?
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看
  • ¥15 关于#Java#的问题,如何解决?