doujiang2643 2011-06-06 03:33
浏览 69
已采纳

从文章中提取关键字

I have articles and keywords stored inside MySQL. The site will preprocess the new articles to find how many matching keywords there are and then update a table which stores the relevant keywords related to the article. This will then be used on the front-end by highlighting keywords within the article and will link users to articles with the same matching keywords.

My concern here is how to do this processing efficiently. My idea is: when processing new articles, it finds the ngrams of the text (up to 3- or 4-gram) and then search each against the keywords table in the MySQL database. This may end up being a slow mess, I haven't tried. But maybe I'm approaching this the wrong way?

Any resources on how to do this efficiently would be awesome. Language used here is primarily PHP.

  • 写回答

2条回答 默认 最新

  • dongling2038 2011-06-06 12:37
    关注

    I've never used PHP to do it, but in .NET, I'll usually do what was alluded to by samxli. I load all keywords into a hashtable. I've done it with up to 120,000 keywords and it works pretty fast.

    The .NET hashtable object has a contains([key]) method. So for each word in the article I'll just call:

    theHashTable.contains(theWord)
    

    If it does contain the word, I'll index it. Has worked pretty well for me without having to use other frameworks. I don't know how hashtables work in PHP. You'd have to google that. I think their normal arrays work like hashtables?

    The key to using a hashtable is that the keys are indexed for fast searching -- I think they use bTrees, but someone may correct me on that. If you're not familiar with the btree concept, you might want to look that up.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥20 我要一个分身加定位两个功能的安卓app
  • ¥15 基于FOC驱动器,如何实现卡丁车下坡无阻力的遛坡的效果
  • ¥15 IAR程序莫名变量多重定义
  • ¥15 (标签-UDP|关键词-client)
  • ¥15 关于库卡officelite无法与虚拟机通讯的问题
  • ¥15 目标检测项目无法读取视频
  • ¥15 GEO datasets中基因芯片数据仅仅提供了normalized signal如何进行差异分析
  • ¥100 求采集电商背景音乐的方法
  • ¥15 数学建模竞赛求指导帮助
  • ¥15 STM32控制MAX7219问题求解答