dongsonglian7303 2012-05-23 14:16
浏览 119
已采纳

PHP中的关键字分析

For a web application I'm building I need to analyze a website, retrieve and rank it's most important keywords and display those.

Getting all words, their density and displaying those is relatively simple, but this gives very skewed results (e.g. stopwords ranking very high).

Basically, my question is: How can I create a keyword analysis tool in PHP which results in a list correctly ordered by word importance?

  • 写回答

5条回答 默认 最新

  • douchuang1861 2012-05-23 14:16
    关注

    Recently, I've been working on this myself, and I'll try to explain what I did as best as possible.

    Steps

    1. Filter text
    2. Split into words
    3. Remove 2 character words and stopwords
    4. Determine word frequency + density
    5. Determine word prominence
    6. Determine word containers
      1. Title
      2. Meta description
      3. URL
      4. Headings
      5. Meta keywords
    7. Calculate keyword value

    1. Filter text

    The first thing you need to do is filter make sure the encoding is correct, so convert is to UTF-8:

    iconv ($encoding, "utf-8", $file); // where $encoding is the current encoding
    

    After that, you need to strip all html tags, punctuation, symbols and numbers. Look for functions on how to do this on Google!

    2. Split into words

    $words = mb_split( ' +', $text );
    

    3. Remove 2 character words and stopwords

    Any word consisting of either 1 or 2 characters won't be of any significance, so we remove all of them.

    To remove stopwords, we first need to detect the language. There are a couple of ways we can do this: - Checking the Content-Language HTTP header - Checking lang="" or xml:lang="" attribute - Checking the Language and Content-Language metadata tags If none of those are set, you can use an external API like the AlchemyAPI.

    You will need a list of stopwords per language, which can be easily found on the web. I've been using this one: http://www.ranks.nl/resources/stopwords.html

    4. Determine word frequency + density

    To count the number of occurrences per word, use this:

    $uniqueWords = array_unique ($keywords); // $keywords is the $words array after being filtered as mentioned in step 3
    $uniqueWordCounts = array_count_values ( $words );
    

    Now loop through the $uniqueWords array and calculate the density of each word like this:

    $density = $frequency / count ($words) * 100;
    

    5. Determine word prominence

    The word prominence is defined by the position of the words within the text. For example, the second word in the first sentence is probably more important than the 6th word in the 83th sentence.

    To calculate it, add this code within the same loop from the previous step:'

    $keys = array_keys ($words, $word); // $word is the word we're currently at in the loop
    $positionSum = array_sum ($keys) + count ($keys);
    $prominence = (count ($words) - (($positionSum - 1) / count ($keys))) * (100 /   count ($words));
    

    6. Determine word containers

    A very important part is to determine where a word resides - in the title, description and more.

    First, you need to grab the title, all metadata tags and all headings using something like DOMDocument or PHPQuery (dont try to use regex!) Then you need to check, within the same loop, whether these contain the words.

    7. Calculate keyword value

    The last step is to calculate a keywords value. To do this, you need to weigh each factor - density, prominence and containers. For example:

    $value = (double) ((1 + $density) * ($prominence / 10)) * (1 + (0.5 * count ($containers)));
    

    This calculation is far from perfect, but it should give you decent results.

    Conclusion

    I haven't mentioned every single detail of what I used in my tool, but I hope it offers a good view into keyword analysis.

    N.B. Yes, this was inspired by the today's blogpost about answering your own questions!

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(4条)

报告相同问题?

悬赏问题

  • ¥15 目详情-五一模拟赛详情页
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看
  • ¥15 关于#Java#的问题,如何解决?
  • ¥15 加热介质是液体,换热器壳侧导热系数和总的导热系数怎么算
  • ¥100 嵌入式系统基于PIC16F882和热敏电阻的数字温度计
  • ¥15 cmd cl 0x000007b
  • ¥20 BAPI_PR_CHANGE how to add account assignment information for service line