在20mb平面文件数据库（PHP）中搜索整个单词的最快方法

I have 20MB flat file database with about 500k lines, only [a-z0-9-] characters are allowed, average 7 words in line, no empty or duplicate lines:

Flat file database:

put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces

I'm searhcing for whole words only and extracting first 10k results from this db.

So far this code work ok if the 10k matches are found in let's say first 20k lines of the db, but if the word is rare, the script must search all 500k lines and this is 10 times slower.

Settings:

$cats = file("cats.txt", FILE_IGNORE_NEW_LINES);
$search = "end";
$limit = 10000;

Search:

foreach($cats as $cat) {
    if(preg_match("/\b$search\b/", $cat)) {
        $cats_found[] = $cat;
        if(isset($cats_found[$limit])) break;
    }
}

My php skills and knowledge are limited, I cannot and don't know how to use sql, so this is the best I can do it, but I need some advices:

Is this the right code to do it, foreach and preg_match are problem?
Should I split large file into smaller files, if yes what sizes?
And in the end, will sql be faster and how much? (Option for the future)

Thanks for reading this and sorry for bad English, this is my 3rd language.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duan0417 2015-03-23 03:19
关注
If most of the lines don't contain the searched word, you could execute preg_match() less often, like so:

foreach ($lines as $line) { // fast prefilter... if (strpos($line, $word) === false) { continue; } // ... then proper search if the line passed the prefilter if (preg_match("/\b{$word}\b/", $line)) { // found } }

Though, it requires benchmarking in practical situation.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

在20mb平面文件数据库（PHP）中搜索整个单词的最快方法 php
2015-03-23 01:56

回答 2 已采纳 If most of the lines don't contain the searched word, you could execute preg_match() less often, l
无法在PHP中启用mb_string php ubuntu
2019-07-28 20:44

回答 1 已采纳 After over 3,5 hours of search, I found the solution!!! I HOPE IT WILL HELP SOMEONE ELSE !! THE E
php判断数据库是否存在，不存在就新建 mysql php 数据库
2021-08-18 13:58

回答 2 已采纳 <?php $servername = "localhost"; $username = "root"; $password = "123456"; // 创建连接 $conn = new m
php的正则表达式完全手册
2012-08-30 13:53

e421083458的博客如果要查找文件名中有＊的文件，则需要对＊进行转义，即在其前加一个\。ls \*.txt。正则表达式有以下特殊字符。特别字符说明 $ 匹配输入字符串的结尾位置。如果设置了 RegExp 对象的 Multiline 属性...
PHP - 使用预准备语句未将文件上载到数据库中 mysql php
2017-04-03 07:35

回答 2 已采纳 Have you considered using http://php.net/manual/en/mysqli-stmt.send-long-data.php ((PHP 5, PHP 7)
在PHP中使用scandir打开文件时“没有这样的文件或目录” php
2017-12-10 00:33

回答 2 已采纳 scandir returns and array with the filenames - without the folder name. But here readfile($newes
在PHP中比较两个.text文件（~1MB） html php
2014-11-14 18:05

回答 2 已采纳 strcmp() is likely your best bet. It returns 0 if the two strings are the same - so if(strcmp
35 个非主流数据库
2016-09-09 00:18

xlxxcc的博客几乎每个Web开发人员都有自己喜欢的数据库，或自己最熟悉的数据库，但最常见的无外乎以下几种： MySQL 　PostgreSQL 　MSSQL 　SQLite 　MS Access 或是更简单的XML，文本文件等。这些数据库有优秀的文档，背后...
在不同语言的文件中搜索字符串 - PHP - UTF-8 php
2017-02-20 04:38

回答 1 已采纳 stripos() is not multibyte compatible. Instead you should use mb_stripos() which should work bette
MySQL数据库迁移PHP的UTF-8问题 mysql php
2018-06-12 14:23

回答 3 已采纳 To check for double encoding, use SELECT HEX(col)... é should come back C3A9 (proper utf8), but i
在PHP中搜索二进制文件中的字节序列？ php
2015-09-09 18:44

回答 2 已采纳 Doing reads from disk always takes a long time. You can't count on disk caching. That's an OS thin
【个人笔记 | 整理Ing】
2024-01-03 11:10

우 유的博客计算机网络、数据库等
在php中获取文件上传错误 php
2016-10-17 05:31

回答 2 已采纳 Try increasing these settings in the php.ini, the default size ( typically ) is 8MB ; Maximum a
PHP笔记小摘
2018-08-01 14:02

伯爵-ShuaiqiShang的博客 1.Apache文件访问权限控制：关于Deny 和 Allow 的说明 [总则]：影响最终判断结果的只有两点： order语句中allow、deny的先后顺序； allow、deny语句中各自包含的范围。修改完配置后要保存好并重启Apache服务，...
PHPWeb开发入门体验学习笔记
2017-07-13 08:22

weixin_34379433的博客专家（研究论文）编程三要素：声明变量（系统、全局、字段等）、数据算法（应用、比较、计算等）、控制结构（变更、判断、循环等）类三要素：封装、继承、多态PHP web应用开发分两端：前端Html+Css+JavaScript静态...
没有解决我的问题, 去提问

悬赏问题

¥50 有数据，怎么建立模型求影响全要素生产率的因素
¥50 有数据，怎么用matlab求全要素生产率
¥15 TI的insta-spin例程
¥15 完成下列问题完成下列问题
¥15 C#算法问题, 不知道怎么处理这个数据的转换
¥15 YoloV5 第三方库的版本对照问题
¥15 请完成下列相关问题！
¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像，如何解决？
¥15 求daily translation（DT）偏差订正方法的代码
¥15 js调用html页面需要隐藏某个按钮

在20mb平面文件数据库（PHP）中搜索整个单词的最快方法

2条回答 默认 最新

悬赏问题

2条回答默认最新