dongtang4954 2017-04-03 09:08
浏览 67

PHP - 文本挖掘慢进程的文本预处理

i make a text preprocess on text mining with large database,, i want make a camus data from all article on database into array, but it take to long process.

$multiMem   = memory_get_usage();
$xstart = microtime(TRUE);
$word = "";
$sql = mysql_query("SELECT * FROM tbl_content");
while($data = mysql_fetch_assoc($sql)){
  $word = $word."".$data['article'];
}

$preprocess = new preprocess($word);
$word= $preprocess->preprocess($word);
print_r($kata);

$xfinish = microtime(TRUE);

here is my class preprocess

class preprocess {

  var $teks;

  function preprocess($teks){
  /*start process segmentation*/
  $teks = trim($teks);

  //menghapus tanda baca
  $teks = str_replace("'", "", $teks);
  $teks = str_replace("-", "", $teks);
  $teks = str_replace(")", "", $teks);
  $teks = str_replace("(", "", $teks);
  $teks = str_replace("=", "", $teks);
  $teks = str_replace(".", "", $teks);
  $teks = str_replace(",", "", $teks);
  $teks = str_replace(":", "", $teks);
  $teks = str_replace(";", "", $teks);
  $teks = str_replace("!", "", $teks);
  $teks = str_replace("?", "", $teks);

  //remove HTML tags
  $teks = strip_tags($teks);
  $teks = preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $teks);
  /*end proses segmentation*/

  /*start case folding*/
  $teks = strtolower($teks);

  $teks = preg_replace('/[0-9]+/', '', $teks);
  /*end case folding*/

  /*start of tokenizing*/
  $teks = explode(" ", $teks);

  /*end of tokenizing*/

  /*start of filtering*/
  //stopword
  $file = file_get_contents('stopword.txt', FILE_USE_INCLUDE_PATH);
  $stopword = explode("
", $file);

  //remove stopword
  $teks = preg_replace('/\b('.implode('|',$stopword).')\b/','',$teks);

  /*end of filtering*/

  /*start of stemming*/
  require_once('stemming.php');
  foreach($teks as $t => $value){
    $teks[$t] = stemming($value);
  }
  /*end of stemming*/

  $teks = array_filter($teks);
  $teks = array_values($teks);

  return $teks;
 }
}

anyone have any idea to make fast process on my program? pls help
thanks for advance

  • 写回答

1条回答 默认 最新

  • dongmaomou4117 2017-04-03 10:01
    关注

    Ther are a couple of things that might be improoved...

    1. After building up the $word you could free the query result $sql and the data

      $word = '';
      $sql = mysql_query("SELECT * FROM tbl_content");
      while($data = mysql_fetch_assoc($sql)){
        $word = $word . $data['article'];
      }
      mysql_free_result($sql);
      unset($sql, $data);
      
    2. This block:

      $teks = str_replace("'", "", $teks);
      $teks = str_replace("-", "", $teks);
      $teks = str_replace(")", "", $teks);
      $teks = str_replace("(", "", $teks);
      $teks = str_replace("=", "", $teks);
      $teks = str_replace(".", "", $teks);
      $teks = str_replace(",", "", $teks);
      $teks = str_replace(":", "", $teks);
      $teks = str_replace(";", "", $teks);
      $teks = str_replace("!", "", $teks);
      $teks = str_replace("?", "", $teks);
      

    can be written like this:

        $teks = str_replace(array('(','-',')',',','.','=',';','!','?'), '', $teks);
    
    1. since you later in the code replace the numbers with a regular expression, you could add numbers in the upper str_replace call, or add the upper chars to the preg_replace

      $teks = str_replace(array('0','1','2','3','4','5','6','7','8','9','(','-',')',',','.','=',';','!','?'), '', $teks);
      

      OR

      $teks = preg_replace('/[0-9,\(\)\-\=\.\,\;\!\?]+/', '', $teks);
      
    2. $teks = strip_tags($teks); should be enough. If it isn'y then use just the preg_replace following, since it's doing kind of the same thing.

    3. use file insted of the file_get_contentsfollowed by theexplodesince thefilereturns an array directly. Also there is no need to explode the $teks

         $stopword = file('stopword.txt');
         array_walk($stopword, function(&$item1){
           $item1 = '/\b' . $item1 . '\b/';
         });
         $teks = preg_replace($stopword, '', $teks);
      
    4. Generally don't use "" since the processor will try to evaluate the content and that takes longer.

    5. If the stopword.txt list is not changing it is better and faster to have it in the code as an array directly then accessing the file system to read it.

    评论

报告相同问题?

悬赏问题

  • ¥20 有偿 写代码 要用特定的软件anaconda 里的jvpyter 用python3写
  • ¥20 cad图纸,chx-3六轴码垛机器人
  • ¥15 移动摄像头专网需要解vlan
  • ¥20 access多表提取相同字段数据并合并
  • ¥20 基于MSP430f5529的MPU6050驱动,求出欧拉角
  • ¥20 Java-Oj-桌布的计算
  • ¥15 powerbuilder中的datawindow数据整合到新的DataWindow
  • ¥20 有人知道这种图怎么画吗?
  • ¥15 pyqt6如何引用qrc文件加载里面的的资源
  • ¥15 安卓JNI项目使用lua上的问题