douzi9211 2016-10-08 16:05
浏览 153
已采纳

PHP - 优化数千种模式的preg_match

So I wrote a script to extract data from raw genome files, heres what the raw genome file looks like:

# rsid  chromosome  position    genotype
rs4477212   1   82154   AA
rs3094315   1   752566  AG
rs3131972   1   752721  AG
rs12124819  1   776546  AA
rs11240777  1   798959  AG
rs6681049   1   800007  CC
rs4970383   1   838555  AC
rs4475691   1   846808  CT
rs7537756   1   854250  AG
rs13302982  1   861808  GG
rs1110052   1   873558  TT
rs2272756   1   882033  GG
rs3748597   1   888659  CT
rs13303106  1   891945  AA
rs28415373  1   893981  CC
rs13303010  1   894573  GG
rs6696281   1   903104  CT
rs28391282  1   904165  GG
rs2340592   1   910935  GG

The raw text file has hundreds of thousands of these rows, but I only need specific ones, I need about 10,000 of them. I have a list of rsids. I just need the genotype from each line. So I loop through the rsid list and use preg_match to find the line I need:

    $rawData = file_get_contents('genome_file.txt');
    $rsids = $this->get_snps();

    while ($row = $rsids->fetch_assoc()) {

        $searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)
~i";

        if (preg_match($searchPattern,$rawData,$matchedGene)) {

            $genotype = $matchedGene[3]);

            // Do something with genotype

        }   

    }   

NOTE: I stripped out a lot of code to just show the regexp extraction I'm doing. I'm also inserting each row into a database as I go along. Heres the code with the database work included:

    $rawData = file_get_contents('genome_file.txt');
    $rsids = $this->get_snps();

    $query = "INSERT INTO wp_genomics_results (file_id,snp_id,genotype,reputation,zygosity) VALUES (?,?,?,?,?)";
    $stmt = $ngdb->prepare($query);

    $stmt->bind_param("iissi", $file_id,$snp_id,$genotype,$reputation,$zygosity);

    $ngdb->query("START TRANSACTION");

    while ($row = $rsids->fetch_assoc()) {

        $searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)
~i";

        if (preg_match($searchPattern,$rawData,$matchedGene)) {

            $genotype = $matchedGene[3]);

            $stmt->execute();
            $insert++;

        }   


    }   

    $stmt->close();
    $ngdb->query("COMMIT");

    $snps->free();
    $ngdb->close();

}

So unfortunately my script runs very slowly. Running 50 iterations takes 17 seconds. So you can imagine how long running 18,000 iterations is gonna take. I'm looking into ways to optimise this.

Is there a faster way to extract the data I need from this huge text file? What if I explode it into an array of lines, and use preg_grep(), would that be any faster?

Something I tried is combining all 18,000 rsids into a single expression (i.e. (rs123|rs124|rs125) like this:

    $rsids = get_rsids(); 

    $rsid_group = implode('|',$rsids);
    $pattern =  "~({$rsid_group })\t(.*?)\t(.*?)\t(.*?)
~i";
    preg_match($pattern,$rawData,$matches);

But unfortunately it gave me some error message about exceeding the PCRE expression limit. The needle was way too big. Another thing I tried is adding the S modifier to the expression. I read that this analyses the pattern in order to increase performance. It didn't speed things up at all. Maybe maybe pattern isn't compatible with it?

So then the second thing I need to try and optimise is the database inserts. I added a transaction hoping that would speed things up but it didn't speed it up at all. So I'm thinking maybe I should group the inserts together, so that I insert multiple rows at once, rather than inserting them individually.

Then another idea is something I read about, using LOAD DATA INFILE to load rows from a text file. In that case, I just need to generate a text file first. Would it work out faster to generate a text file in this case I wonder.

EDIT: It seems like whats taking up most time is the regular expressions. Running that part of the program by itself, it takes a really long time. 10 rows takes 4 seconds.

  • 写回答

3条回答 默认 最新

  • douxingsuo8809 2016-10-08 16:45
    关注

    This is slow because you're searching a vast array of data over and over again.

    It looks like you have a text file, not a dbms table, containing lines like these:

    rs4477212   1   82154   AA
    rs3094315   1   752566  AG
    rs3131972   1   752721  AG
    rs12124819  1   776546  AA
    

    It looks like you have some other data structure containing a list of values like rs4477212. I think that's already in a table in the dbms.

    I think you want exact matches for the rsxxxx values, not prefix or partial matches.

    I think you want to process many different files of raw data, and extract the same batch of rsxxxx values from each of them.

    So, here's what you do, in pseudocode. Don't load the whole raw data file into memory, rather process it line by line.

    1. Read your rows of rsid values from the dbms, just once, and store them in an associative array.
    2. for each file of raw data....
      1. for each line of data in the file...
        1. split the line of data to obtain the rsid. In php, $array = explode(" ", $line, 2); will yield your rsid in $array[0], and do it fast.
        2. Look in your array of rsid values for this value. In php, if ( array_key_exists( $array[0], $rsid_array )) { ... will do this.
        3. If the key does exist, you have a match.
        4. extract the last column from the raw text line ('GC or whatever)
        5. write it to your dbms.

    Notice how this avoids regular expressions, and how it processes your raw data line by line. You only have to touch each line of raw data once. That's good, because your raw data is also your largest quantity of data. It exploits php's associative array feature to do the matching. All that will be much faster than your method.

    To speed the process of inserting tens of thousands of rows into a table, read this. Optimizing InnoDB Insert Queries

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 js调用html页面需要隐藏某个按钮
  • ¥15 ads仿真结果在圆图上是怎么读数的
  • ¥20 Cotex M3的调试和程序执行方式是什么样的?
  • ¥20 java项目连接sqlserver时报ssl相关错误
  • ¥15 一道python难题3
  • ¥15 牛顿斯科特系数表表示
  • ¥15 arduino 步进电机
  • ¥20 程序进入HardFault_Handler
  • ¥15 oracle集群安装出bug
  • ¥15 关于#python#的问题:自动化测试