dongshi3605 2012-04-21 14:08
浏览 43
已采纳

查找文本文件中发生的最大字符串

So I've seen questions asked before that are along the lines of finding the maximum occurence of a string within a file but all of those rely on knowing what to look for.

I have what you might almost call a flat file database that grabs a bunch of input data and basically wraps different parts of it in html span tags with referencing ids.

Each line comes out in this kind of fashion:

<p>
<span class="ip">58.106.**.***</span> 
Wrote <span class='text'>some text</span>
<span class='effect1'> and caused seizures </span>
<span class='time'>23:47</span> 
</p>

How would I then go about finding the #test contents that occurs the most times.

i.e if I had

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span id='text'>woof</span>
    <span class='effect1'> and caused seizures </span>
    <span class='time'>23:47</span> 
    </p>

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span class='text'>meow</span>
    <span class='effect1'> and caused mind-splosion </span>
    <span class='time'>23:47</span> 
    </p>

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span class='text'>meow</span>
    <span class='effect1'> and used no effect </span>
    <span class='time'>23:47</span> 
    </p>

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span class='text'>meow</span>
    <span class='effect1'> and used no effect </span>
    <span class='time'>23:47</span> 
    </p>

the output would be 'meow'.

How would I accomplish this in php?

  • 写回答

2条回答 默认 最新

  • duanqu9292 2012-04-21 14:25
    关注

    First off: Your format is not conducive to this type of data manipulation; you might want to consider changing it.

    That said, based on this structure the logical solution would be to leverage DOMXPath as Dani says. This could have been problematic because of all the duplicate ids in there, but in practice it works (after emitting a boatload of warnings, which is one more reason that the data structure affords revision).

    Here's some code to go with the idea:

    $input = '<body>'.get_input().'</body>';
    $doc = new DOMDocument;
    $doc->loadHTML($input); // lots of warnings, duplicate ids!
    $xpath = new DOMXPath($doc);
    $result = $xpath->query("//*[@id='text']/text()");
    
    $occurrences = array();
    foreach ($result as $item) {
        if (!isset($occurrences[$item->wholeText])) {
            $occurrences[$item->wholeText] = 0;
        }
        $occurrences[$item->wholeText]++;
    }
    
    // Sort the results and produce final answer    
    arsort($occurrences);
    reset($occurrences);
    
    echo "The most common text is '".key($occurrences).
         "', which occurs ".current($occurrences)." times.";
    

    See it in action.

    Update (seeing as you fixed the duplicate id issue): You would simply change the xpath query to "//*[@class='text']/text()" so that it continues to match. However this way of doing things remains inefficient, so if one or more of these apply:

    • you are going to do this all the time
    • you have lots of data
    • you need it to be really fast

    then changing the data format is a good idea.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥20 有偿 写代码 要用特定的软件anaconda 里的jvpyter 用python3写
  • ¥20 cad图纸,chx-3六轴码垛机器人
  • ¥15 移动摄像头专网需要解vlan
  • ¥20 access多表提取相同字段数据并合并
  • ¥20 基于MSP430f5529的MPU6050驱动,求出欧拉角
  • ¥20 Java-Oj-桌布的计算
  • ¥15 powerbuilder中的datawindow数据整合到新的DataWindow
  • ¥20 有人知道这种图怎么画吗?
  • ¥15 pyqt6如何引用qrc文件加载里面的的资源
  • ¥15 安卓JNI项目使用lua上的问题