dongliugu8843 2014-12-03 08:08
浏览 28
已采纳

如何从PHP中的文本文件目录中获取唯一单词的数量?

I have a directory of text files. I want to loop through each of the text files in the directory and get the overall count of unique words (count of vocabulary), not for each individual file, but for ALL the files together. In other words, I want the number of unique words within all the files together, and NOT the number of unique words for each individual file.

For example, I have three text files in a directory. Here are their contents:

file1.txt -> here is some text.

file2.txt -> here is more text.

file3.txt -> even more text.

So the count of unique words for this directory of text files in this case is 6.

I have tried to use this code:

$files = glob("C:\\wamp\\dir");

$out = fopen("mergedFiles.txt", "w");


  foreach($files as $file){
      $in = fopen($file, "r");
      while ($line = fread($in)){
           fwrite($out, $line);
      }
      fclose($in);
  }


  fclose($out);

to merge all the text files and then after using this code I planned to use the array_unique() on mergedFiles.txt. However, the code is not working.

How can I get the unique word count of all the text files in the directory in the best way possible?

  • 写回答

1条回答 默认 最新

  • douyalin0847 2014-12-03 08:18
    关注

    You can try this :

    $allWords = array();
    
    foreach (glob("*.txt") as $filename) // loop on each file
    {
        $contents = file_get_contents($filename); // Get file contents
        $words = explode(' ', $contents); // Make an array with words
    
        if ( $words )
            $allWords = array_merge($allWords, $words); // combine global words array and file words array
    }
    
    var_dump(count(array_unique($allWords)));
    

    EDIT Other version which :

    • remove dots
    • remove multiple spaces
    • match word if missing space between end of sentence and new one.

    function removeDot($string) {
        return rtrim($string, '.');
    }
    
    $words = explode(' ', preg_replace('#\.([a-zA-Z])#', '. $1', preg_replace('/\s+/', ' ',$contents)));
    $words = array_map("removeDot", $words);
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 win11家庭中文版安装docker遇到Hyper-V启用失败解决办法整理
  • ¥15 gradio的web端页面格式不对的问题
  • ¥15 求大家看看Nonce如何配置
  • ¥15 Matlab怎么求解含参的二重积分?
  • ¥15 苹果手机突然连不上wifi了?
  • ¥15 cgictest.cgi文件无法访问
  • ¥20 删除和修改功能无法调用
  • ¥15 kafka topic 所有分副本数修改
  • ¥15 小程序中fit格式等运动数据文件怎样实现可视化?(包含心率信息))
  • ¥15 如何利用mmdetection3d中的get_flops.py文件计算fcos3d方法的flops?