doujiang2643 2011-06-06 03:33
浏览 69
已采纳

从文章中提取关键字

I have articles and keywords stored inside MySQL. The site will preprocess the new articles to find how many matching keywords there are and then update a table which stores the relevant keywords related to the article. This will then be used on the front-end by highlighting keywords within the article and will link users to articles with the same matching keywords.

My concern here is how to do this processing efficiently. My idea is: when processing new articles, it finds the ngrams of the text (up to 3- or 4-gram) and then search each against the keywords table in the MySQL database. This may end up being a slow mess, I haven't tried. But maybe I'm approaching this the wrong way?

Any resources on how to do this efficiently would be awesome. Language used here is primarily PHP.

  • 写回答

2条回答 默认 最新

  • dongling2038 2011-06-06 12:37
    关注

    I've never used PHP to do it, but in .NET, I'll usually do what was alluded to by samxli. I load all keywords into a hashtable. I've done it with up to 120,000 keywords and it works pretty fast.

    The .NET hashtable object has a contains([key]) method. So for each word in the article I'll just call:

    theHashTable.contains(theWord)
    

    If it does contain the word, I'll index it. Has worked pretty well for me without having to use other frameworks. I don't know how hashtables work in PHP. You'd have to google that. I think their normal arrays work like hashtables?

    The key to using a hashtable is that the keys are indexed for fast searching -- I think they use bTrees, but someone may correct me on that. If you're not familiar with the btree concept, you might want to look that up.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥40 图书信息管理系统程序编写
  • ¥15 7-1 jmu-java-m02-使用二维数组存储多元线性方程组
  • ¥20 Qcustomplot缩小曲线形状问题
  • ¥15 企业资源规划ERP沙盘模拟
  • ¥15 树莓派控制机械臂传输命令报错,显示摄像头不存在
  • ¥15 前端echarts坐标轴问题
  • ¥15 ad5933的I2C
  • ¥15 请问RTX4060的笔记本电脑可以训练yolov5模型吗?
  • ¥15 数学建模求思路及代码
  • ¥50 silvaco GaN HEMT有栅极场板的击穿电压仿真问题