PHP - 优化数千种模式的preg_match

So I wrote a script to extract data from raw genome files, heres what the raw genome file looks like:

# rsid  chromosome  position    genotype
rs4477212   1   82154   AA
rs3094315   1   752566  AG
rs3131972   1   752721  AG
rs12124819  1   776546  AA
rs11240777  1   798959  AG
rs6681049   1   800007  CC
rs4970383   1   838555  AC
rs4475691   1   846808  CT
rs7537756   1   854250  AG
rs13302982  1   861808  GG
rs1110052   1   873558  TT
rs2272756   1   882033  GG
rs3748597   1   888659  CT
rs13303106  1   891945  AA
rs28415373  1   893981  CC
rs13303010  1   894573  GG
rs6696281   1   903104  CT
rs28391282  1   904165  GG
rs2340592   1   910935  GG

The raw text file has hundreds of thousands of these rows, but I only need specific ones, I need about 10,000 of them. I have a list of rsids. I just need the genotype from each line. So I loop through the rsid list and use preg_match to find the line I need:

    $rawData = file_get_contents('genome_file.txt');
    $rsids = $this->get_snps();

    while ($row = $rsids->fetch_assoc()) {

        $searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)
~i";

        if (preg_match($searchPattern,$rawData,$matchedGene)) {

            $genotype = $matchedGene[3]);

            // Do something with genotype

        }   

    }

NOTE: I stripped out a lot of code to just show the regexp extraction I'm doing. I'm also inserting each row into a database as I go along. Heres the code with the database work included:

    $rawData = file_get_contents('genome_file.txt');
    $rsids = $this->get_snps();

    $query = "INSERT INTO wp_genomics_results (file_id,snp_id,genotype,reputation,zygosity) VALUES (?,?,?,?,?)";
    $stmt = $ngdb->prepare($query);

    $stmt->bind_param("iissi", $file_id,$snp_id,$genotype,$reputation,$zygosity);

    $ngdb->query("START TRANSACTION");

    while ($row = $rsids->fetch_assoc()) {

        $searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)
~i";

        if (preg_match($searchPattern,$rawData,$matchedGene)) {

            $genotype = $matchedGene[3]);

            $stmt->execute();
            $insert++;

        }   


    }   

    $stmt->close();
    $ngdb->query("COMMIT");

    $snps->free();
    $ngdb->close();

}

So unfortunately my script runs very slowly. Running 50 iterations takes 17 seconds. So you can imagine how long running 18,000 iterations is gonna take. I'm looking into ways to optimise this.

Is there a faster way to extract the data I need from this huge text file? What if I explode it into an array of lines, and use preg_grep(), would that be any faster?

Something I tried is combining all 18,000 rsids into a single expression (i.e. (rs123|rs124|rs125) like this:

    $rsids = get_rsids(); 

    $rsid_group = implode('|',$rsids);
    $pattern =  "~({$rsid_group })\t(.*?)\t(.*?)\t(.*?)
~i";
    preg_match($pattern,$rawData,$matches);

But unfortunately it gave me some error message about exceeding the PCRE expression limit. The needle was way too big. Another thing I tried is adding the S modifier to the expression. I read that this analyses the pattern in order to increase performance. It didn't speed things up at all. Maybe maybe pattern isn't compatible with it?

So then the second thing I need to try and optimise is the database inserts. I added a transaction hoping that would speed things up but it didn't speed it up at all. So I'm thinking maybe I should group the inserts together, so that I insert multiple rows at once, rather than inserting them individually.

Then another idea is something I read about, using LOAD DATA INFILE to load rows from a text file. In that case, I just need to generate a text file first. Would it work out faster to generate a text file in this case I wonder.

EDIT: It seems like whats taking up most time is the regular expressions. Running that part of the program by itself, it takes a really long time. 10 rows takes 4 seconds.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douxingsuo8809 2016-10-08 16:45
关注
This is slow because you're searching a vast array of data over and over again.

It looks like you have a text file, not a dbms table, containing lines like these:

rs4477212 1 82154 AA rs3094315 1 752566 AG rs3131972 1 752721 AG rs12124819 1 776546 AA

It looks like you have some other data structure containing a list of values like rs4477212. I think that's already in a table in the dbms.

I think you want exact matches for the rsxxxx values, not prefix or partial matches.

I think you want to process many different files of raw data, and extract the same batch of rsxxxx values from each of them.

So, here's what you do, in pseudocode. Don't load the whole raw data file into memory, rather process it line by line.

Read your rows of rsid values from the dbms, just once, and store them in an associative array.

for each file of raw data....

for each line of data in the file...

split the line of data to obtain the rsid. In php, $array = explode(" ", $line, 2); will yield your rsid in $array[0], and do it fast.

Look in your array of rsid values for this value. In php, if ( array_key_exists( $array[0], $rsid_array )) { ... will do this.

If the key does exist, you have a match.

extract the last column from the raw text line ('GC or whatever)

write it to your dbms.

Notice how this avoids regular expressions, and how it processes your raw data line by line. You only have to touch each line of raw data once. That's good, because your raw data is also your largest quantity of data. It exploits php's associative array feature to do the matching. All that will be much faster than your method.

To speed the process of inserting tens of thousands of rows into a table, read this. Optimizing InnoDB Insert Queries
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(2条)

报告相同问题？

关注问题

PHP preg_match_all不处理大数据 laravel php
2018-05-16 06:59

回答 1 已采纳 The pattern at play matches balanced curly brackets using regex recursion. The pattern itself look
求php一条preg_match_all正则，取指定div的id开头？ php 正则表达式
2021-08-21 14:27

回答 1 已采纳 $reg = "/<div id=\"num_(.*?)_off\".*?>.*?<\/div>/ism";
Php preg_match_all仅匹配最后一个元素 php
2019-07-19 08:34

回答 2 已采纳 Here is another variant using \G that is bit faster and avoids empty matches: (?:{{([\w-]+(?:\h+[
php死循环处理大数据（千张表，上亿级别数据）
2018-10-20 15:24

DoDo-Baron的博客 1,文本记录数据落地点，2,死循环断开点 3,防止sql断开连接（MySQL server has gone away）4,防止php超时 5,日志跟踪 6,递归处理数据（业务需求），7，正则匹配数据（业务需求）增强程序的可执行性 <?...
PHP用preg_match_all正则多个关键字怎么写? php
2017-11-30 05:36

回答 8 已采纳 []改为() ``` $pattaern0='/(你好|中国|国家|新年|娱乐|程序|羁绊|www\\.baidu\\.com|google)+/u'; ```
php preg_match_all简单正则表达式返回空值 php
2015-11-05 10:42

回答 4 已采纳 You need to replace: preg_match_all('/\d*/', $string, $matches); with: preg_match_all('/\d+/',
PHP preg_match_all谜语 php
2018-07-30 22:29

回答 1 已采纳 /<tr>.*?class="DD.*?/ says "find <tr>, then match everything until you find class="D
2019-2020 PHP面试-12家（答案全）
2019-09-03 23:13

筑梦悠然的博客 12、请设计一个短地址的服务(类似于微博的短地址) 13、大数据的搜索引擎——Elasticsearch 四、作业帮前4题是面试的时候考的，后2题是觉得得准备的 1、php设置超时时间 2、nginx有哪几种与php-fpm的通信方式？...
如果模式不匹配，如何使preg_match_all返回一个空数组值？ php
2017-10-23 16:11

回答 2 已采纳 It looks like each iteration can only return a maximum of one match, so preg_match_all with the in
PHP preg_match（）找不到匹配项 php
2015-10-23 10:47

回答 1 已采纳 You use ^ repexp character. It means: (0[1-9]|[12][0-9]|3[01])[-\/](0[1-9]|1[012]) must to be in t
PHP - preg_match正则表达式 php
2016-06-20 16:07

回答 2 已采纳 You need to allow any chars before the /lease with .*?, an end of string anchor $ and regex delim
两千行PHP学习笔记
2019-09-25 13:39

anbu4093的博客 // 例：echo constant('-_-'); /* 常量相关函数 */ defined get_defined_constants /* 预定义常量 */ __FILE__ 所在文件的绝对路径 __LINE__ 文件中的当前行号 __DIR__ 文件所在目录 __FUNCTION__ 函数名称 __...
两千行PHP学习笔记绝对干货！
2018-06-06 08:58

番石榴-452124076的博客 // 例：echo constant('-_-'); /* 常量相关函数 */ defined get_defined_constants /* 预定义常量 */ __FILE__ 所在文件的绝对路径 __LINE__ 文件中的当前行号 __DIR__ 文件所在目录 __...
两千行php学习笔记
2014-10-24 09:19

小_码哥的博客 // 例：echo constant('-_-'); /* 常量相关函数 */ defined get_defined_constants /* 预定义常量 */ __FILE__ 所在文件的绝对路径 __LINE__ 文件中的当前行号 __DIR__ 文件所在目录 __...
PHP笔记
2017-10-10 18:12

code301的博客亲们，如约而至的PHP笔记来啦~绝对干货！...MySQL笔记：一千行MySQL学习笔记 http://www.cnblogs.com/shockerli/p/1000-plus-line-mysql-notes.html //语法错误（syntax error）在语法分析阶段，源
（转）两千行PHP学习笔记
2016-03-30 21:46

weixin_33890499的博客亲们，如约而至的PHP笔记来啦~绝对干货！以下为我以前学PHP时做的笔记，时不时的也会添加一些基础知识点进去，有时还翻出来查查。 MySQL笔记：一千行MySQL学习笔记...
PHP最全笔记
2019-09-22 08:48

allforever2010的博客亲们，如约而至的PHP笔记来啦~绝对干货！以下为我以前学PHP时做的笔记，时不时的也会添加一些基础知识点进去，有时还翻出来查查。 MySQL笔记：一千行MySQL学习笔记...
PHP面试题(一)
2018-03-24 11:56

钟长森的博客 deque，全名double-ended queue，是一种具有队列和栈的性质的数据结构。双端队列中的元素可以从两端弹出，其限定插入和删除操作在表的两端进行。双向队列（双端队列）就像是一个队列，但是你可以在任何一端添加或...
PHP 精典汇总(转载)
2019-10-08 03:11

aileping9605的博客 // 例：echo constant('-_-'); /* 常量相关函数 */ defined get_defined_constants /* 预定义常量 */ __FILE__ 所在文件的绝对路径 __LINE__ 文件中的当前行号 __DIR__ 文件所在目录 __...
一个小时学会PHP
2018-01-12 09:47

AirZH??的博客 PHP（外文名:PHP: Hypertext Preprocessor，中文名：“超文本预处理器”）是一种通用开源脚本语言。语法吸收了C语言、Java和Perl的特点，利于学习，使用广泛，主要适用于Web开发领域。PHP 独特的语法混合了C、Java、...
没有解决我的问题, 去提问

悬赏问题

¥15 脱敏项目合作，ner需求合作
¥30 Matlab打开默认名称带有/的光谱数据
¥50 easyExcel模板动态单元格合并列
¥15 res.rows如何取值使用
¥15 在odoo17开发环境中，怎么实现库存管理系统，或独立模块设计与AGV小车对接？开发方面应如何设计和开发？请详细解释MES或WMS在与AGV小车对接时需完成的设计和开发
¥15 CSP算法实现EEG特征提取，哪一步错了？
¥15 游戏盾如何溯源服务器真实ip?需要30个字。后面的字是凑数的
¥15 vue3前端取消收藏的不会引用collectId
¥15 delphi7 HMAC_SHA256方式加密
¥15 关于#qt#的问题：我想实现qcustomplot完成坐标轴

PHP - 优化数千种模式的preg_match

3条回答 默认 最新

悬赏问题

3条回答默认最新