dongyangben6144 2017-04-03 15:16
浏览 56

网页抓取:你如何检测列表中的新项目?

I'm working on some PHP code that would grab a music playlist from a remote radio page - which means it is continuously updated. I would like to store the tracks history in my database.

My problem is that I need to detect when new entries have been added to the remote tracklist, knowing that :

  • I don't know how often the remote page will be updated
  • I don't know how many tracks are displayed on the remote page. Sometimes it will be a single track, sometimes it will be a few dozen.
  • A same track could show up several times.

For example, I will get this data when grabbing the page for the first time :

  1. Dead Combo — Esse Olhar Que Era Só Teu
  2. Myron & E — If I Gave You My Love
  3. Hooverphonic — Badaboum
  4. Alain Chamfort — Bambou - Pilooski / Jayvich Reprise
  5. William Onyeabor — Atomic Bomb
  6. Curtis Mayfield — Move on up - Extended version
  7. Mos Def — Ms. Fat Booty
  8. Nicki Minaj — Feeling Myself
  9. Disclosure — You & Me (Flume remix)
  10. Otis Redding — My Girl - Remastered Mono

Then on the second time I'll get :

  1. Charles Aznavour — Emmenez moi
  2. Mos Def — Ms. Fat Booty
  3. Rag'n'Bone Man — Human
  4. Bernard Lavilliers — Idées noires
  5. Julien Clerc — Ma préférence
  6. The Rolling Stones — Just Your Fool
  7. Dead Combo — Esse Olhar Que Era Só Teu
  8. Myron & E — If I Gave You My Love
  9. Hooverphonic — Badaboum
  10. Alain Chamfort — Bambou - Pilooski / Jayvich Reprise

As you can see, the second time, I got entries 7->10 that seems to be the same than the first time (so entries 1->6 are the new ones); and track #2 was already played in the first list but seems to have been replayed since.

The new entries here would be :

  1. Charles Aznavour — Emmenez moi
  2. Mos Def — Ms. Fat Booty
  3. Rag'n'Bone Man — Human
  4. Bernard Lavilliers — Idées noires
  5. Julien Clerc — Ma préférence
  6. The Rolling Stones — Just Your Fool

I store tracks entries in a table, and tracks history in another one.

Structure of the tracks table

| ID |   artist   |     title     |     album     |
--------------------------------------------------
| 12 |   Mos Def  | Ms. Fat Booty |               |

Structure of the tracks history table

| ID |   track ID  |        time         |
------------------------------------------
| 24 |     12      | 2016-07-03 13:40:26 |

Have you got any ideas on how I could handle this ?

Thanks !

  • 写回答

1条回答 默认 最新

  • dongxian7489 2017-04-03 17:40
    关注

    I think you're trying to find the items at the end of the second list that match those at beginning of the first?

    If you can store both lists in an array (the old list in $previous and the new list in $current), this function should help:

    function find_old_tracks($previous, $current)
    {
        for ($i = 0; $i < count($current); $i++)
        {
            if ($previous[$i] == $current[$i]) continue;
            return find_old_tracks($previous, array_slice($current, $i + 1));
        }
        return array_slice($previous, 0, $i);
    }
    

    It scans through $current for contiguous matches to $previous, recursing on the remainder every time it finds a missmatch. When I run this:

    $previous = array(
        'Dead Combo — Esse Olhar Que Era Só Teu',
        'Myron & E — If I Gave You My Love',
        'Hooverphonic — Badaboum',
        'Alain Chamfort — Bambou - Pilooski / Jayvich Reprise',
        'William Onyeabor — Atomic Bomb',
        'Curtis Mayfield — Move on up - Extended version',
        'Mos Def — Ms. Fat Booty',
        'Nicki Minaj — Feeling Myself',
        'Disclosure — You & Me (Flume remix)',
        'Otis Redding — My Girl - Remastered Mono'
    );
    
    $current = array(
        'Charles Aznavour — Emmenez moi',
        'Mos Def — Ms. Fat Booty',
        'Rag Bone Man — Human',
        'Bernard Lavilliers — Idées noires',
        'Julien Clerc — Ma préférence',
        'The Rolling Stones — Just Your Fool',
        'Dead Combo — Esse Olhar Que Era Só Teu',
        'Myron & E — If I Gave You My Love',
        'Hooverphonic — Badaboum',
        'Alain Chamfort — Bambou - Pilooski / Jayvich Reprise'
    );
    
    $old_tracks = find_old_tracks($previous, $current);
    $new_tracks = array_slice($current, 0, count($current) - count($old_tracks));
    
    print "NEW TRACKS: " . implode($new_tracks, '; ');
    print "<br /><br />OLD TRACKS: " . implode($old_tracks, '; ');
    

    my output is:

    NEW TRACKS: Charles Aznavour — Emmenez moi; Mos Def — Ms. Fat Booty; Rag Bone Man — Human; Bernard Lavilliers — Idées noires; Julien Clerc — Ma préférence; The Rolling Stones — Just Your Fool

    OLD TRACKS: Dead Combo — Esse Olhar Que Era Só Teu; Myron & E — If I Gave You My Love; Hooverphonic — Badaboum; Alain Chamfort — Bambou - Pilooski / Jayvich Reprise

    You can do what you like with that info on the database end.

    评论

报告相同问题?

悬赏问题

  • ¥15 关于#hadoop#的问题
  • ¥15 (标签-Python|关键词-socket)
  • ¥15 keil里为什么main.c定义的函数在it.c调用不了
  • ¥50 切换TabTip键盘的输入法
  • ¥15 可否在不同线程中调用封装数据库操作的类
  • ¥15 微带串馈天线阵列每个阵元宽度计算
  • ¥15 keil的map文件中Image component sizes各项意思
  • ¥20 求个正点原子stm32f407开发版的贪吃蛇游戏
  • ¥15 划分vlan后,链路不通了?
  • ¥20 求各位懂行的人,注册表能不能看到usb使用得具体信息,干了什么,传输了什么数据