douzi9211 2016-10-08 16:05
浏览 153

PHP - 优化数千种模式的preg_match

So I wrote a script to extract data from raw genome files, heres what the raw genome file looks like:

# rsid  chromosome  position    genotype
rs4477212   1   82154   AA
rs3094315   1   752566  AG
rs3131972   1   752721  AG
rs12124819  1   776546  AA
rs11240777  1   798959  AG
rs6681049   1   800007  CC
rs4970383   1   838555  AC
rs4475691   1   846808  CT
rs7537756   1   854250  AG
rs13302982  1   861808  GG
rs1110052   1   873558  TT
rs2272756   1   882033  GG
rs3748597   1   888659  CT
rs13303106  1   891945  AA
rs28415373  1   893981  CC
rs13303010  1   894573  GG
rs6696281   1   903104  CT
rs28391282  1   904165  GG
rs2340592   1   910935  GG

The raw text file has hundreds of thousands of these rows, but I only need specific ones, I need about 10,000 of them. I have a list of rsids. I just need the genotype from each line. So I loop through the rsid list and use preg_match to find the line I need:

    $rawData = file_get_contents('genome_file.txt');
    $rsids = $this->get_snps();

    while ($row = $rsids->fetch_assoc()) {

        $searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)

        if (preg_match($searchPattern,$rawData,$matchedGene)) {

            $genotype = $matchedGene[3]);

            // Do something with genotype



NOTE: I stripped out a lot of code to just show the regexp extraction I'm doing. I'm also inserting each row into a database as I go along. Heres the code with the database work included:

    $rawData = file_get_contents('genome_file.txt');
    $rsids = $this->get_snps();

    $query = "INSERT INTO wp_genomics_results (file_id,snp_id,genotype,reputation,zygosity) VALUES (?,?,?,?,?)";
    $stmt = $ngdb->prepare($query);

    $stmt->bind_param("iissi", $file_id,$snp_id,$genotype,$reputation,$zygosity);

    $ngdb->query("START TRANSACTION");

    while ($row = $rsids->fetch_assoc()) {

        $searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)

        if (preg_match($searchPattern,$rawData,$matchedGene)) {

            $genotype = $matchedGene[3]);







So unfortunately my script runs very slowly. Running 50 iterations takes 17 seconds. So you can imagine how long running 18,000 iterations is gonna take. I'm looking into ways to optimise this.

Is there a faster way to extract the data I need from this huge text file? What if I explode it into an array of lines, and use preg_grep(), would that be any faster?

Something I tried is combining all 18,000 rsids into a single expression (i.e. (rs123|rs124|rs125) like this:

    $rsids = get_rsids(); 

    $rsid_group = implode('|',$rsids);
    $pattern =  "~({$rsid_group })\t(.*?)\t(.*?)\t(.*?)

But unfortunately it gave me some error message about exceeding the PCRE expression limit. The needle was way too big. Another thing I tried is adding the S modifier to the expression. I read that this analyses the pattern in order to increase performance. It didn't speed things up at all. Maybe maybe pattern isn't compatible with it?

So then the second thing I need to try and optimise is the database inserts. I added a transaction hoping that would speed things up but it didn't speed it up at all. So I'm thinking maybe I should group the inserts together, so that I insert multiple rows at once, rather than inserting them individually.

Then another idea is something I read about, using LOAD DATA INFILE to load rows from a text file. In that case, I just need to generate a text file first. Would it work out faster to generate a text file in this case I wonder.

EDIT: It seems like whats taking up most time is the regular expressions. Running that part of the program by itself, it takes a really long time. 10 rows takes 4 seconds.

  • 写回答

3条回答 默认 最新

  • douxingsuo8809 2016-10-08 16:45

    This is slow because you're searching a vast array of data over and over again.

    It looks like you have a text file, not a dbms table, containing lines like these:

    rs4477212   1   82154   AA
    rs3094315   1   752566  AG
    rs3131972   1   752721  AG
    rs12124819  1   776546  AA

    It looks like you have some other data structure containing a list of values like rs4477212. I think that's already in a table in the dbms.

    I think you want exact matches for the rsxxxx values, not prefix or partial matches.

    I think you want to process many different files of raw data, and extract the same batch of rsxxxx values from each of them.

    So, here's what you do, in pseudocode. Don't load the whole raw data file into memory, rather process it line by line.

    1. Read your rows of rsid values from the dbms, just once, and store them in an associative array.
    2. for each file of raw data....
      1. for each line of data in the file...
        1. split the line of data to obtain the rsid. In php, $array = explode(" ", $line, 2); will yield your rsid in $array[0], and do it fast.
        2. Look in your array of rsid values for this value. In php, if ( array_key_exists( $array[0], $rsid_array )) { ... will do this.
        3. If the key does exist, you have a match.
        4. extract the last column from the raw text line ('GC or whatever)
        5. write it to your dbms.

    Notice how this avoids regular expressions, and how it processes your raw data line by line. You only have to touch each line of raw data once. That's good, because your raw data is also your largest quantity of data. It exploits php's associative array feature to do the matching. All that will be much faster than your method.

    To speed the process of inserting tens of thousands of rows into a table, read this. Optimizing InnoDB Insert Queries

    本回答被题主选为最佳回答 , 对您是否有帮助呢?



  • ¥15 MddBootstrapInitialize2失败
  • ¥15 LCD Flicker
  • ¥15 Spring MVC项目,访问不到相应的控制器方法
  • ¥15 esp32在micropython环境下使用ssl/tls连接mqtt服务器出现以下报错Connected on发生意外错误: 5无法连接到 MQTT 代理,如何解决?
  • ¥15 关于#genesiscsheel#的问题,如何解决?
  • ¥15 Android aidl for hal
  • ¥15 STM32CubeIDE下载程序报错
  • ¥15 微信好友如何转变为会员系统?(相关搜索:小程序)
  • ¥15 c# 直接使用c++ 类库文件
  • ¥15 一个主机电脑有两个显示器,当前有两个软件主体是网页html打包的exe程序,如何通过cmd命令bat同时打开软件一个在主屏幕显示,另外一个软件在2副屏幕上显示