dongyangben6144 2017-04-03 15:16
浏览 56

网页抓取:你如何检测列表中的新项目?

I'm working on some PHP code that would grab a music playlist from a remote radio page - which means it is continuously updated. I would like to store the tracks history in my database.

My problem is that I need to detect when new entries have been added to the remote tracklist, knowing that :

  • I don't know how often the remote page will be updated
  • I don't know how many tracks are displayed on the remote page. Sometimes it will be a single track, sometimes it will be a few dozen.
  • A same track could show up several times.

For example, I will get this data when grabbing the page for the first time :

  1. Dead Combo — Esse Olhar Que Era Só Teu
  2. Myron & E — If I Gave You My Love
  3. Hooverphonic — Badaboum
  4. Alain Chamfort — Bambou - Pilooski / Jayvich Reprise
  5. William Onyeabor — Atomic Bomb
  6. Curtis Mayfield — Move on up - Extended version
  7. Mos Def — Ms. Fat Booty
  8. Nicki Minaj — Feeling Myself
  9. Disclosure — You & Me (Flume remix)
  10. Otis Redding — My Girl - Remastered Mono

Then on the second time I'll get :

  1. Charles Aznavour — Emmenez moi
  2. Mos Def — Ms. Fat Booty
  3. Rag'n'Bone Man — Human
  4. Bernard Lavilliers — Idées noires
  5. Julien Clerc — Ma préférence
  6. The Rolling Stones — Just Your Fool
  7. Dead Combo — Esse Olhar Que Era Só Teu
  8. Myron & E — If I Gave You My Love
  9. Hooverphonic — Badaboum
  10. Alain Chamfort — Bambou - Pilooski / Jayvich Reprise

As you can see, the second time, I got entries 7->10 that seems to be the same than the first time (so entries 1->6 are the new ones); and track #2 was already played in the first list but seems to have been replayed since.

The new entries here would be :

  1. Charles Aznavour — Emmenez moi
  2. Mos Def — Ms. Fat Booty
  3. Rag'n'Bone Man — Human
  4. Bernard Lavilliers — Idées noires
  5. Julien Clerc — Ma préférence
  6. The Rolling Stones — Just Your Fool

I store tracks entries in a table, and tracks history in another one.

Structure of the tracks table

| ID |   artist   |     title     |     album     |
--------------------------------------------------
| 12 |   Mos Def  | Ms. Fat Booty |               |

Structure of the tracks history table

| ID |   track ID  |        time         |
------------------------------------------
| 24 |     12      | 2016-07-03 13:40:26 |

Have you got any ideas on how I could handle this ?

Thanks !

  • 写回答

1条回答 默认 最新

  • dongxian7489 2017-04-03 17:40
    关注

    I think you're trying to find the items at the end of the second list that match those at beginning of the first?

    If you can store both lists in an array (the old list in $previous and the new list in $current), this function should help:

    function find_old_tracks($previous, $current)
    {
        for ($i = 0; $i < count($current); $i++)
        {
            if ($previous[$i] == $current[$i]) continue;
            return find_old_tracks($previous, array_slice($current, $i + 1));
        }
        return array_slice($previous, 0, $i);
    }
    

    It scans through $current for contiguous matches to $previous, recursing on the remainder every time it finds a missmatch. When I run this:

    $previous = array(
        'Dead Combo — Esse Olhar Que Era Só Teu',
        'Myron & E — If I Gave You My Love',
        'Hooverphonic — Badaboum',
        'Alain Chamfort — Bambou - Pilooski / Jayvich Reprise',
        'William Onyeabor — Atomic Bomb',
        'Curtis Mayfield — Move on up - Extended version',
        'Mos Def — Ms. Fat Booty',
        'Nicki Minaj — Feeling Myself',
        'Disclosure — You & Me (Flume remix)',
        'Otis Redding — My Girl - Remastered Mono'
    );
    
    $current = array(
        'Charles Aznavour — Emmenez moi',
        'Mos Def — Ms. Fat Booty',
        'Rag Bone Man — Human',
        'Bernard Lavilliers — Idées noires',
        'Julien Clerc — Ma préférence',
        'The Rolling Stones — Just Your Fool',
        'Dead Combo — Esse Olhar Que Era Só Teu',
        'Myron & E — If I Gave You My Love',
        'Hooverphonic — Badaboum',
        'Alain Chamfort — Bambou - Pilooski / Jayvich Reprise'
    );
    
    $old_tracks = find_old_tracks($previous, $current);
    $new_tracks = array_slice($current, 0, count($current) - count($old_tracks));
    
    print "NEW TRACKS: " . implode($new_tracks, '; ');
    print "<br /><br />OLD TRACKS: " . implode($old_tracks, '; ');
    

    my output is:

    NEW TRACKS: Charles Aznavour — Emmenez moi; Mos Def — Ms. Fat Booty; Rag Bone Man — Human; Bernard Lavilliers — Idées noires; Julien Clerc — Ma préférence; The Rolling Stones — Just Your Fool

    OLD TRACKS: Dead Combo — Esse Olhar Que Era Só Teu; Myron & E — If I Gave You My Love; Hooverphonic — Badaboum; Alain Chamfort — Bambou - Pilooski / Jayvich Reprise

    You can do what you like with that info on the database end.

    评论

报告相同问题?