dsc862009 2015-01-12 20:37
浏览 31
已采纳

使用PHP读取两个文件,数学计算和写出结果的最有效方法[关闭]

I have written a PHP script which opens two very large files (>1gb), both of which contain 4 columns. The script does a calculation on the corresponding values of each row in each file, and writes out the result to a third file.

My method is incredibly slow. I am using the SplFileObject to read the origin files and move the internal pointer line by line, as described in the code below.

It then writes out the result, line by line. However both the calculation and the write-out are very slow (the script is slow even if I disable the writes). I presume my method of file reading/writing are very inefficient and I'd appreciate tips for optimization.

function generate_adjusted($WGFile, $RFile) {

        // File Read objects
        $WGObj = new SplFileObject($WGFile);
        $RObj = new SplFileObject($RFile);

        // File write object
        $adjHandle = fopen("outputfile.txt", 'w+');

        foreach ($WGObj as $line) {
            // Line 0: ID1 (int), 1: ID2 (int), 2: NSNPs (int), 3: Relationship (real)
            $WGline = explode("\t", $WGObj->current());

            // Seek to the same line of second file
            $RObj->seek($WGObj->key());

            $Rline = explode("\t", $RObj->current());
            $A1 = floatval($WGline[2] * $WGline[3]);
            $A2 = floatval($Rline[2] * $Rline[3]);
            $ANSNP = $WGline[2] - $Rline[2];
            $A3 = round(floatval(($A1 - $A2) / $ANSNP), 3);

            // Construct the adjusted line
            $adjLine = $WGline[0] . "\t" . $WGline[1] . "\t" . $ANSNP . "\t" . $A3 . "
";

            fwrite($adjHandle, $adjLine);
        }
        fclose($adjHandle);     
}

generate_adjusted('inputfile1.txt', 'inputfile2.txt');
  • 写回答

1条回答 默认 最新

  • doukuiqian5345 2015-01-12 21:15
    关注

    First and best advice: benchmark it!

    Do not take any advice you get as a definitive fact (not even mine). Performance will vary based on your operating system, hardware and PHP version.

    The following should be a fast approach and directly includes a micro-benchmark for you. Please test it and let us know.

    <?php
    
    $start = microtime(true);
    
    function get_file_handle($file, $mode) {
        $h = fopen(__DIR__ . DIRECTORY_SEPARATOR . $file, "{$mode}b");
        if (!$h) {
            trigger_error("Could not read {$file}.", E_USER_ERROR);
        }
    
        // Make sure nobody else is reading or writing to our file.
        if (flock($h, LOCK_SH | LOCK_EX) === false) {
            trigger_error("Could not acquire lock for {$file}", E_USER_ERROR);
        }
    
        return $h;
    }
    
    // We only want to read and not write.
    $input_handle1 = get_file_handle("input1", "r");
    $input_handle2 = get_file_handle("input2", "r");
    
    // We only want to write and not read.
    $output_handle = get_file_handle("output", "w");
    
    // Read from both files at the same time the next line.
    // NOTE: This only works if lines are always corresponding in both files.
    while (($buffer1 = fgets($input_handle1)) !== false && ($buffer2 = fgets($input_handle2)) !== false) {
        $buffer1 = explode("\t", $buffer1);
        $buffer2 = explode("\t", $buffer2);
    
        // Forget floatval, let PHP do its dynamic casting.
        // NOTE: If precision is important use e.g. bcmath!
        $a1 = $buffer1[2] * $buffer1[3];
        $a2 = $buffer2[2] * $buffer2[3];
        $ansnp = $buffer1[2] - $buffer2[2];
        $a3 = round(($a1 - $a2) / $ansnp, 3);
    
        if (fwrite($output_handle, "{$buffer1[0]}\t{$buffer1[1]}\t{$ansnp}\t{$a3}
    ") === false) {
            trigger_error("Could not write result to output file.", E_USER_ERROR);
        }
    }
    
    // Release locks on and close all file handles.
    foreach (array($input_handle1, $input_handle2, $output_handle) as $delta => $handle) {
        if (flock($handle, LOCK_UN) === false) {
            trigger_error("Could not release lock!", E_USER_ERROR);
        }
        if (fclose($handle) === false) {
            trigger_error("Could not close file handle!", E_USER_ERROR);
        }
    }
    
    echo "Finished processing after " , (microtime(true) - $start) , PHP_EOL;
    

    Of course this could be done in OO fashion as well with exceptions etc.

    Line Buffering

    // Determines how many lines to buffer between each calculation/write.
    $lines_to_buffer = 1000;
    
    while (!feof($input_handle1) && !feof($input_handle2)) {
        $c1 = $c2 = 0;
    
        // Read lines from first handle, then read files from second handle.
        // NOTE: Reading multiple lines from the same file in a row allows us to make best use of the hard disk, if it isn't
        // an SSD, since we consecutively read from the same location which yields minimum seeks. But also keep in mind that
        // this might not be true if multiple processes are running in parallel, since they might read from different files
        // at the same time.
        foreach (array(1 => $input_handle1, 2 => $input_handle2) as $i => $handle) {
            while (($line = fgets($handle)) !== false) {
                ${"buffer{$i}"}[] = explode("\t", $line);
    
                // Break if we read enough lines.
                if (++${"c{$i}"} === $lines_to_buffer) {
                    break;
                }
            }
        }
    
        // Validate?
        if ($c1 !== $c2) {
            trigger_error("Lines from input files differ, aborting.", E_USER_ERROR);
        }
    
        for ($i = 0; $i < $lines_to_buffer; ++$i) {
            $a1 = $buffer1[$i][2] * $buffer1[$i][3];
            $a2 = $buffer2[$i][2] * $buffer2[$i][3];
            $ansnp = $buffer1[$i][2] - $buffer2[$i][2];
            $a3 = round(($a1 - $a2) / $ansnp, 3);
            $result .= "{$buffer1[0]}\t{$buffer1[1]}\t{$ansnp}\t{$a3}
    ";
        }
        fwrite($output_handle, $result);
    
        // Reset
        $result = $buffer1 = $buffer2 = null;
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 phython路径名过长报错 不知道什么问题
  • ¥15 深度学习中模型转换该怎么实现
  • ¥15 HLs设计手写数字识别程序编译通不过
  • ¥15 Stata外部命令安装问题求帮助!
  • ¥15 从键盘随机输入A-H中的一串字符串,用七段数码管方法进行绘制。提交代码及运行截图。
  • ¥15 TYPCE母转母,插入认方向
  • ¥15 如何用python向钉钉机器人发送可以放大的图片?
  • ¥15 matlab(相关搜索:紧聚焦)
  • ¥15 基于51单片机的厨房煤气泄露检测报警系统设计
  • ¥15 Arduino无法同时连接多个hx711模块,如何解决?