PHP - 文本挖掘慢进程的文本预处理

i make a text preprocess on text mining with large database,, i want make a camus data from all article on database into array, but it take to long process.

$multiMem   = memory_get_usage();
$xstart = microtime(TRUE);
$word = "";
$sql = mysql_query("SELECT * FROM tbl_content");
while($data = mysql_fetch_assoc($sql)){
  $word = $word."".$data['article'];
}

$preprocess = new preprocess($word);
$word= $preprocess->preprocess($word);
print_r($kata);

$xfinish = microtime(TRUE);

here is my class preprocess

class preprocess {

  var $teks;

  function preprocess($teks){
  /*start process segmentation*/
  $teks = trim($teks);

  //menghapus tanda baca
  $teks = str_replace("'", "", $teks);
  $teks = str_replace("-", "", $teks);
  $teks = str_replace(")", "", $teks);
  $teks = str_replace("(", "", $teks);
  $teks = str_replace("=", "", $teks);
  $teks = str_replace(".", "", $teks);
  $teks = str_replace(",", "", $teks);
  $teks = str_replace(":", "", $teks);
  $teks = str_replace(";", "", $teks);
  $teks = str_replace("!", "", $teks);
  $teks = str_replace("?", "", $teks);

  //remove HTML tags
  $teks = strip_tags($teks);
  $teks = preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $teks);
  /*end proses segmentation*/

  /*start case folding*/
  $teks = strtolower($teks);

  $teks = preg_replace('/[0-9]+/', '', $teks);
  /*end case folding*/

  /*start of tokenizing*/
  $teks = explode(" ", $teks);

  /*end of tokenizing*/

  /*start of filtering*/
  //stopword
  $file = file_get_contents('stopword.txt', FILE_USE_INCLUDE_PATH);
  $stopword = explode("
", $file);

  //remove stopword
  $teks = preg_replace('/\b('.implode('|',$stopword).')\b/','',$teks);

  /*end of filtering*/

  /*start of stemming*/
  require_once('stemming.php');
  foreach($teks as $t => $value){
    $teks[$t] = stemming($value);
  }
  /*end of stemming*/

  $teks = array_filter($teks);
  $teks = array_values($teks);

  return $teks;
 }
}

anyone have any idea to make fast process on my program? pls help
thanks for advance

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongmaomou4117 2017-04-03 10:01
关注
Ther are a couple of things that might be improoved...

After building up the $word you could free the query result $sql and the data

$word = ''; $sql = mysql_query("SELECT * FROM tbl_content"); while($data = mysql_fetch_assoc($sql)){ $word = $word . $data['article']; } mysql_free_result($sql); unset($sql, $data);

This block:

$teks = str_replace("'", "", $teks); $teks = str_replace("-", "", $teks); $teks = str_replace(")", "", $teks); $teks = str_replace("(", "", $teks); $teks = str_replace("=", "", $teks); $teks = str_replace(".", "", $teks); $teks = str_replace(",", "", $teks); $teks = str_replace(":", "", $teks); $teks = str_replace(";", "", $teks); $teks = str_replace("!", "", $teks); $teks = str_replace("?", "", $teks);

can be written like this:

$teks = str_replace(array('(','-',')',',','.','=',';','!','?'), '', $teks);

since you later in the code replace the numbers with a regular expression, you could add numbers in the upper str_replace call, or add the upper chars to the preg_replace

$teks = str_replace(array('0','1','2','3','4','5','6','7','8','9','(','-',')',',','.','=',';','!','?'), '', $teks);

OR

$teks = preg_replace('/[0-9,\-\=\.\,\;\!\?]+/', '', $teks);

$teks = strip_tags($teks); should be enough. If it isn'y then use just the preg_replace following, since it's doing kind of the same thing.

use file insted of the file_get_contentsfollowed by theexplodesince thefilereturns an array directly. Also there is no need to explode the $teks

$stopword = file('stopword.txt'); array_walk($stopword, function(&$item1){ $item1 = '/\b' . $item1 . '\b/'; }); $teks = preg_replace($stopword, '', $teks);

Generally don't use "" since the processor will try to evaluate the content and that takes longer.

If the stopword.txt list is not changing it is better and faster to have it in the code as an array directly then accessing the file system to read it.
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

中文文本分类数据预处理 python 有问必答
2022-04-13 06:36

回答 3 已采纳你的是简单清洗处理，如果要分词，用jieba模块可以满足你需求
php预处理语句抛出null错误的问题 php sql
2018-04-14 21:34

回答 2 已采纳 Bind param is all at once. bind_param("ssi", $string, $string, $number); S for string i for
如何在php预处理语句中显示值 mysql php
2016-09-22 09:08

回答 2 已采纳 It's strictly not needed to use prepared statements when not dealing with user-input, but there's
php psr 编码规范_PSR编码规范
2021-01-12 14:20

屠龙少女玖的博客目前我们国内比较出名的几个框架(Yii,Laravel) 都已经支持Composer并且加入了PHP-FI ... [转]PHP编码规范注:这是10年前的一篇PHP编码规范,最早发布于清华水木BBS,现在好像都找不到完整的版本了,但至今看起来仍是...
训练集文本分类预处理 python 有问必答
2022-04-17 08:58

回答 3 已采纳读取文档为字符串，使用re.sub剔除不需要的特殊符号，然后再剔除停用词，参考例子： import re import jieba data = 'Are these datasets? yes, t
带双引号的PHP / SQL预处理语句 mysql php sql
2016-09-29 15:59

回答 2 已采纳 Yes, indeed it's something very obvious. Just look at the generated HTML code and you will immedia
文本预处理，关键词提取时时报错 python 有问必答
2021-08-30 11:22

回答 3 已采纳 word, freq = line.strip().split(' ')这里报错是因为:一行字符串在分割后多于两个子字符串，所以报错。检查一下line的值，找出word和freq对应的索引，使用lin
PHP必备知识点
2019-07-30 10:03

千古~的博客当然有时候框架中的过滤方法我们不希望采用，比如使用文本编辑器时，我们可以使用自己的过滤方式。 15、mysql优化怎么做的？答：mysql优化主要从以下几个方面来实现： ①设计角度：存储引擎的选择，字段...
使用PHP预处理语句将数组插入MySQL数据库 mysql php
2016-05-04 22:14

回答 1 已采纳 Instead of replacing variable value you have to concatenate $changes = ''; foreach($_POST['chang
php更新查询使用预处理语句不起作用 mysql php
2017-04-09 16:51

回答 2 已采纳 Add $stmt->close(); before your next mysqli interaction. if(sendMail($email,$verify_key)) { /
英文文本统计及预处理 c++ 哈希算法数据结构有问必答
2021-07-01 10:13

回答 1 已采纳 #pragma warning(disable:4786) #include <iostream> #include <vector> #include <f
PHP 最新精品面试题
2019-01-13 21:13

咦呀的博客当然有时候框架中的过滤方法我们不希望采用，比如使用文本编辑器时，我们可以使用自己的过滤方式。 15、mysql优化怎么做的？答：mysql优化主要从以下几个方面来实现： ①设计角度：存储引擎的选择，...
如何在PHP预处理语句的每个步骤中添加一行？ php
2016-10-07 21:04

回答 1 已采纳 Declare a variable $total = 0 outside of the prepared statement and use it in the while() loop to
php面试题汇总
2018-10-12 13:57

faker_wang的博客当然有时候框架中的过滤方法我们不希望采用，比如使用文本编辑器时，我们可以使用自己的过滤方式。 15、mysql优化怎么做的？答：mysql优化主要从以下几个方面来实现： ①设计角度：存储引擎的选择，字段类型...
php面试题汇总（必会）
2017-11-08 19:51

pz-php的博客当然有时候框架中的过滤方法我们不希望采用，比如使用文本编辑器时，我们可以使用自己的过滤方式。 15、mysql优化怎么做的？答： mysql优化主要从以下几个方面来实现： ①设计角度：存储引擎的...
没有解决我的问题, 去提问

悬赏问题

¥20 有偿写代码要用特定的软件anaconda 里的jvpyter 用python3写
¥20 cad图纸，chx-3六轴码垛机器人
¥15 移动摄像头专网需要解vlan
¥20 access多表提取相同字段数据并合并
¥20 基于MSP430f5529的MPU6050驱动，求出欧拉角
¥20 Java-Oj-桌布的计算
¥15 powerbuilder中的datawindow数据整合到新的DataWindow
¥20 有人知道这种图怎么画吗？
¥15 pyqt6如何引用qrc文件加载里面的的资源
¥15 安卓JNI项目使用lua上的问题

PHP - 文本挖掘慢进程的文本预处理

1条回答 默认 最新

悬赏问题

1条回答默认最新