drouie2014 2016-10-04 10:23
浏览 159
已采纳

如何使用PHP在大量文件中搜索字符串

First I am new to PHP so I don't have any idea on how to accomplish this. I have a folder that is constantly getting txt files created ranging in size and text. I am trying to create somewhat of a "search engine" on a Linux system written in PHP. So far I am using the code below.

if ( $_SERVER['REQUEST_METHOD'] == 'POST'){
    $path = '/example/files';
    $findThisString = $_POST['text_box'];
    $dir = dir($path);
    while (false !== ($file = $dir->read())){   
        if ($file != '.' && $file != '..'){
            if (is_file($path . '/' . $file)){
                $data = file_get_contents($path . '/' . $file);
                if (stripos($data, $findThisString) !== false){
                    echo '<p></p><font style="color:white; font-family:Arial">Found     Match - <a href="http://test.example.com/files/'. $file .'">'. $file .'</a>    <br>';
                }
            }
        }
    }
}
$dir->close();

Now this code works great! But one problem, once the folder gets around 40,000 files, the search takes a good amount of time to pull any results. Now I can't use any commands such as greb. It has to be written in pure PHP like the code above.

Is there anyway to optimize the code above to work any faster? Or is there a better search function I can use in PHP?

  • 写回答

2条回答 默认 最新

  • dongzhu6900 2016-10-04 11:04
    关注

    There are many reasons for why the script is so slow, and exactly what you need to do in order to decrease the time it takes depends completely upon what exact parts of the code causes the slow down.
    That means that you need to put the code through a profiler, and then tweak the parts of the code that it reports are the cause. Without the profiler, all we can do is guess. Not necessarily correctly.

    As noted in the comments to your question, using an already-made search engine would be the far better solution. Especially something which is purpose made for something like this, as it will cut down the time drastically.
    Even the built-in grep command for Linux shells would be an improvement.

    That said, I do suspect that the reason your code is so slow is because of the fact that you're reading and searching through the contents of all of the files in PHP. stripos() is particularly a likely suspect here, as that's a rather slow search.
    Another factor might be the read() calls in the loop, as I believe they do a IO-operation on each call. Also, having a lot of calls to echo in a script can/will also cause a slow-down, depending upon how many of those you have. Couple of hundred is not really noticeable, but having a few thousand will be.

    Taking these last points into consideration, and some other general changes I recommend to make your code easier to maintain, I've made the following changes to your code.

    <?php
    
    if (isset ($_POST['text_box'])) {
        $path = '/example/files';
        $result = search_files ($_POST['text_box'], $path);
    }
    
    /**
     * Searches through the files in the given path, for the search term.
     *
     * @param string $term The term to search for, only "word characters" as defined by RegExp allowed.
     * @param string $path The path which contains the files to be searched.
     * 
     * @return string Either a list of links to the files, or an error message.
     */
    function search_files ($term, $path) {
        // Ensuring that we have a closing slash at the end of the path, so that
        // we can add a file-descriptor for glob() to use.
        if (substr ($path, -1) != '/') {
            $path .= '/';
        }
    
        // If we don't have a valid/readable path we ened to throw an error now.
        // This only happens if the code itself is wrong, as it's not user-supplied,
        // thus an exception is thrown.
        if (!is_dir ($path) || !is_readable ($path)) {
            throw new InvalidArgumentException ("Not a valid search path!");
        }
    
        // This should be validated to ensure you get sane input,
        // in order to avoid erroneous responses to the user and
        // possible attacks.
        // Addded a simple test to ensure we only accept "word characters".
        if (!preg_match ('/^\w+\\z/', $term)) {
            // Invalid input. Show warning to user.
            return 'Not a valid search string.';
        }
    
        // Using glob so that we retrieve a list of all files in one operation.
        $contents = glob ($path.'*');
    
        // Using a holding variable, as this many echo statements take
        // noticable longer time than just concatenating strings and
        // echoing it out once.
        $output = '';
    
        // Using printf() templates to make the code easier to reach.
        // Ideally the HTML-code shouldn't be in this string either, but adding
        // a templating system is far beyond the reach of this Q&A.
        $outTemplate = '<p class="found">Found Match - <a href="http://test.example.com/files/%1$s">%2$s</a></p>';
    
        foreach ($contents as $file) {
            // Skip the hardlinks for parent and current folder.
            if ($file == '.' || $file == '..') {
                continue;
            }
    
            // Skip if the path isn't a file.
            if (!is_file ($path . '/' . $file)) {
                continue;
            }
    
            // This one is the big issue. Reading all of the files one by one will take time!
            $data = file_get_contents ($path . '/' . $file);
    
            // Same with running a case-insensitive search!
            if (stripos ($data, $term) !== false) {
                // Added output escaping to prevent issues with possible meta-characters.
                // (A problem also known as XSS attacks)
                $output .= sprintf ($outTemplate, htmlspecialchars (rawurlencode($file)), htmlspecialchars($file));
            }
        }
    
        // Lastly, if the output string is empty we haven't found anything.
        if (empty($output)) {
            return "Term not found";
        }
    
        return $output;
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 ansys fluent计算闪退
  • ¥15 有关wireshark抓包的问题
  • ¥15 需要写计算过程,不要写代码,求解答,数据都在图上
  • ¥15 向数据表用newid方式插入GUID问题
  • ¥15 multisim电路设计
  • ¥20 用keil,写代码解决两个问题,用库函数
  • ¥50 ID中开关量采样信号通道、以及程序流程的设计
  • ¥15 U-Mamba/nnunetv2固定随机数种子
  • ¥15 vba使用jmail发送邮件正文里面怎么加图片
  • ¥15 vb6.0如何向数据库中添加自动生成的字段数据。