douchao1879 2015-03-23 01:56
浏览 18
已采纳

在20mb平面文件数据库(PHP)中搜索整个单词的最快方法

I have 20MB flat file database with about 500k lines, only [a-z0-9-] characters are allowed, average 7 words in line, no empty or duplicate lines:

Flat file database:

put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces

I'm searhcing for whole words only and extracting first 10k results from this db.

So far this code work ok if the 10k matches are found in let's say first 20k lines of the db, but if the word is rare, the script must search all 500k lines and this is 10 times slower.

Settings:

$cats = file("cats.txt", FILE_IGNORE_NEW_LINES);
$search = "end";
$limit = 10000;

Search:

foreach($cats as $cat) {
    if(preg_match("/\b$search\b/", $cat)) {
        $cats_found[] = $cat;
        if(isset($cats_found[$limit])) break;
    }
}

My php skills and knowledge are limited, I cannot and don't know how to use sql, so this is the best I can do it, but I need some advices:

  • Is this the right code to do it, foreach and preg_match are problem?
  • Should I split large file into smaller files, if yes what sizes?
  • And in the end, will sql be faster and how much? (Option for the future)

Thanks for reading this and sorry for bad English, this is my 3rd language.

  • 写回答

2条回答 默认 最新

  • duan0417 2015-03-23 03:19
    关注

    If most of the lines don't contain the searched word, you could execute preg_match() less often, like so:

    foreach ($lines as $line) {
        // fast prefilter...
        if (strpos($line, $word) === false) {
            continue;
        }
        // ... then proper search if the line passed the prefilter
        if (preg_match("/\b{$word}\b/", $line)) {
            // found
        }
    }
    

    Though, it requires benchmarking in practical situation.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 如何绘制动力学系统的相图
  • ¥15 对接wps接口实现获取元数据
  • ¥20 给自己本科IT专业毕业的妹m找个实习工作
  • ¥15 用友U8:向一个无法连接的网络尝试了一个套接字操作,如何解决?
  • ¥30 我的代码按理说完成了模型的搭建、训练、验证测试等工作(标签-网络|关键词-变化检测)
  • ¥50 mac mini外接显示器 画质字体模糊
  • ¥15 TLS1.2协议通信解密
  • ¥40 图书信息管理系统程序编写
  • ¥20 Qcustomplot缩小曲线形状问题
  • ¥15 企业资源规划ERP沙盘模拟