duanpo7282 2014-03-31 08:08
浏览 32
已采纳

PHP替换我的文件中的常用单词

I've tried to make a tool in which you input a website and when you click the submit button it cURLS all the text.

After all the cURLing, stripping it from tags, and counting the words. It's eventually an array named $frequency. If I echo it using <pre> tags it will show me everything just fine! (NOTE: I'm placing the contents in a file, $homepage = file_get_contents($file); and this is what I work with in my code, I don't know if this matters or not)

However i don't really care if the word or is seen 200 times in a website, I only want the important words. So i have made an array with all the common words. Which is set eventually in the $common_words variable. But i can't seem to find a way to replace all words found in the $frequency to replace them with "" if they are found in the $common_words as well.

I've found this piece of code after some research:

$string = 'sand band or nor and where whereabouts foo';
$wordlist = array("or", "and", "where");

foreach ($wordlist as &$word) {
    $word = '/\b' . preg_quote($word, '/') . '\b/';
}

$string = preg_replace($wordlist, '', $string);
var_dump($string);

If I copy paste this it works fine, removing the or, and, where from the string. But replacing $string with $frequency or replacing $wordlist with $common_words will either not work or throw me an error like: Delimiter must not be alphanumeric or backslash

I hope i've formulated my question properly, if not. Please tell me!

Thanks in advance

EDIT: Alright, i've narrowed down the problem alot. First of all i forgot the & inside the foreach ($wordlist as &$word) {

But as it was counting all the words, the words it has replaced are all still counted. See those 2 screenshots to see what I mean: http://imgur.com/oqqZR3h,xHEZKRz#0

  • 写回答

3条回答 默认 最新

  • dousonghs58612 2014-03-31 08:45
    关注

    If I understand this correctly you wan't to know how many occurrences each word has by ignoring the so called common words.

    Assuming that $url is the page you will be running against and $common_words is your common words array, here is what you can do:

    // Get the page content's and strip the html tags
    $contents = strip_tags( file_get_contents($url) );
    
    // This will split the words from the contents, creating an array with each word in it
    preg_match_all("/([\w]+[']?[\w]*)\W/", $contents, $words);
    
    $common_words = array('or', 'and', 'I', 'where');
    
    $frequency = array();
    
    // Count occurrences
    $frequency = array_count_values($words[0]);
    unset($words); // Release all that memory
    
    var_dump($frequency);
    

    At this point you will have an associative array with each not common word and a count showing the number of occurrences of the given word.

    UPDATE

    A bit more about the RegEx. We need to match word. The easiest way possible is: (\w+). But that won't match words like I've or haven't (Notice the '). That was my point of making it more complicated. Also, \w doesn't support dashes for words like in 6-year-old.

    So I created a subgroup which should match words characters including dashed and single quotes in a word.

    (?:\w'|\w|-)
    

    The ?: part on the beginning is do not match or do not include in the results. That is since all I am doing is grouping the options for word contents. To mach an entire word the RegEx will match one or more of the subgroup above:

    ((?:\w'\w|\w|-)+)
    

    So the RegEx preg_match_all() line should be:

    preg_match_all("/((?:\w'\w|\w|-)+)/", $contents, $words);
    

    Hope this helps.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 单片机学习顺序问题!!
  • ¥15 ikuai客户端多拨vpn,重启总是有个别重拨不上
  • ¥20 关于#anlogic#sdram#的问题,如何解决?(关键词-performance)
  • ¥15 相敏解调 matlab
  • ¥15 求lingo代码和思路
  • ¥15 公交车和无人机协同运输
  • ¥15 stm32代码移植没反应
  • ¥15 matlab基于pde算法图像修复,为什么只能对示例图像有效
  • ¥100 连续两帧图像高速减法
  • ¥15 如何绘制动力学系统的相图