dongping9475 2013-10-25 22:43
浏览 66
已采纳

前250名imdb详细php抓取器[关闭]

I'm trying to build a personal movie database and i want the data to be fetched from imdb ... Yes i know there are plenty api and grabber out there but none of them is doing what is need,,,

So far i couldn't come up with a solution to parse http://www.imdb.com/chart/top list and get my data from it...

I've tried to do it by a curl script but no luck !

For e.g:

I want to know if The Godfather: Part II is in top 250 ?if yes what is the rank...

  • 写回答

1条回答 默认 最新

  • dph6308 2013-10-25 23:33
    关注

    API

    I would look into whether or not IMDB have an API available... If they do this will likely be as simple as querying a URL and parsing the data returned with json_decode...

    No API available?

    Get the webpage

    No need to use CURL a simple file_get_contents will do the trick...

    Extract the list

    Now you have the web page you then have two options:

    1. Parse the web page with a DOM parser (long winded, not necessary)
    2. Regex to extract the info you're after (simple, short)

    Regex

    A quick look at the source code of the list shows the list is in the format:

    <td class="titleColumn">RANK. <a href="/link/to/film" title="Director/Leads" >FILM TITLE</a>
    

    See CAPS for required information

    Now converting this into a regex is simple; just remove the noise and replace with (non-greedy) wild cards...

    <td class="titleColumn">RANK. <a.*?>FILM TITLE</a>
    

    Add your capture groups:

    <td class="titleColumn">(RANK). <a.*?>(FILM TITLE)</a>
    

    and that's it...

    #<td class="titleColumn">(\d+)\. <a.*?>(.*?)</a>#
    

    Example

    Using this in practice:

    $page = file_get_contents("http://www.imdb.com/chart/top"); //Download the page
    
    preg_match_all('#<td class="titleColumn">(\d+)\. <a.*?>(.*?)</a>#', $page, $matches); //Match ranks and titles
    
    $top250 = array_combine($matches[1], $matches[2]);          //Final array in format RANK=>TITLE
    

    Then you can do something like:

    echo $top250[1];
    
    /**
    Output:
    
    The Shawshank Redemption
    
    */
    
    echo array_search("The Godfather", $top250);
    
    /**
    Output:
    
    2
    
    */
    

    You can then use standard PHP array functions to do things like search for films.

    http://php.net/file_get_contents
    http://php.net/preg_match_all
    http://php.net/array_combine
    http://php.net/array_search


    Side note

    Especially if you use the No API method above you might like to think about storing the results locally and only updating every X Hours/Days/Weeks to save load times etc. I assume that you are already planning on doing this (as you said you wanted a personal movie data base... But just thought I'd mention it anyway!

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 win11家庭中文版安装docker遇到Hyper-V启用失败解决办法整理
  • ¥15 gradio的web端页面格式不对的问题
  • ¥15 求大家看看Nonce如何配置
  • ¥15 Matlab怎么求解含参的二重积分?
  • ¥15 苹果手机突然连不上wifi了?
  • ¥15 cgictest.cgi文件无法访问
  • ¥20 删除和修改功能无法调用
  • ¥15 kafka topic 所有分副本数修改
  • ¥15 小程序中fit格式等运动数据文件怎样实现可视化?(包含心率信息))
  • ¥15 如何利用mmdetection3d中的get_flops.py文件计算fcos3d方法的flops?