dongnue4923 2016-04-22 03:22
浏览 83

在PHP中比较两个字符串的相似性

I'm trying to find the best solution for comparing two similar strings and choosing the most similar it can find.

I have an array of straight movie names. I also have an array of movie names with additional text.

Example:

My straight movie name array contains strings like so:

"Super Troopers", 
"Everest", 
"Star Wars: Episode I The Phantom Menace"

My other array with movie strings are in forms similar to the following:

"Super Troopers (2001) 720P-AC3-x264", 
"Everest - 2015.1080p.DTS mkv", 
"Star Wars - Episode 1: The Phantom Menace 1080p h265 HEVC TrueHD"

What I'm currently doing is looping through my first array comparing each movie with the second array and using strpos() If I find an exact match, great. If not I need to perform some other function to look for which two strings are most similar. I have tried using similar_text() and levenshtein() with mixed results.

In my above examples, strpos() would have matched both Everest and Super Troopers just fine, but for the Star Wars string I need additional checks. Things like hyphens and colons and "I" and "1" used differently along with the additional information that follows the movie name seem to give me sporadic results with similar_text() and levenshtein()

I'm thinking of maybe first substring out the movie names with the additional information by first calculating the strlen() of the movie name plus 5 or so additional characters for good measure before running a similar_text() or levenshtein() function/s, since the one common thing they all have is their movie names are at the start of the string. This could make the string similarity functions maybe a bit more accurate?

Or maybe some function that breaks up each word and checks to see how many match in the other string. Does such a function exist?

I'll mess around with it more, but if anyone has any input on how they might tackle the problem, I'd love to know.

Thanks.

  • 写回答

1条回答 默认 最新

  • duano3557 2016-04-22 04:07
    关注

    I have an idea for an interesting solution. It uses a database. Every time you get a new Movie in your collection, you separate the movie name into words. For instance:

    "Star Wars: Episode I The Phantom Menace"
    

    would be separated into:

    "Star", "Wars:", "Episode", "I", "The", "Phantom", "Menace"
    

    From there, you would have the following tables in your database:

    CREATE TABLE movie_search (
    movie_keyword varchar(255) NOT NULL,
    movie_id INT NOT NULL,
    PRIMARY KEY (movie_keyword)
    )
    
    CREATE TABLE movies (
    movie_id INT NOT NULL AUTO_INCREMENT,
    movie_name varchar(255) NOT NULL,
    PRIMARY KEY (movie_id)
    )
    

    Example of the movie_search table:

    key_word | movie_id
    star -------- 1
    wars -------- 1
    spider ------ 2
    man --------- 2
    

    Example of the movies table:

    movie_id | movie_name
    1 -------- star wars
    2 -------- spider man
    

    Every time someone wants to search for a movie in your website, you would break their phrase into all the words using explode(" ", $searched_name);. From there you would search in your database all the matching key_word matchs in the movie_search table, and if the movie_id repeated, you would be able to increase the count of keyword matches you found for each movie. So after having done a search with some good PHP behind it, your result should be a multidimentional array with 3 elements in each row:

    array (
      [0] => array (
        [movie_id] = 1,
        [movie_name] = star wars,
        [count] = 2),
      [1] => array (...),
        ....
    )
    

    where the movie with the most keywords (highest count) would appear at the top of your array. You can also decide how many results you want to output by placing "ORDER BY 10" in your SQL code

    HOPE THAT HELPS! :)

    评论

报告相同问题?

悬赏问题

  • ¥15 为什么eprime输出的数据会有缺失?
  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?
  • ¥15 错误 LNK2001 无法解析的外部符号
  • ¥50 安装pyaudiokits失败
  • ¥15 计组这些题应该咋做呀
  • ¥60 更换迈创SOL6M4AE卡的时候,驱动要重新装才能使用,怎么解决?
  • ¥15 让node服务器有自动加载文件的功能
  • ¥15 jmeter脚本回放有的是对的有的是错的
  • ¥15 r语言蛋白组学相关问题