dongshi3605 2012-04-21 14:08
浏览 43
已采纳

查找文本文件中发生的最大字符串

So I've seen questions asked before that are along the lines of finding the maximum occurence of a string within a file but all of those rely on knowing what to look for.

I have what you might almost call a flat file database that grabs a bunch of input data and basically wraps different parts of it in html span tags with referencing ids.

Each line comes out in this kind of fashion:

<p>
<span class="ip">58.106.**.***</span> 
Wrote <span class='text'>some text</span>
<span class='effect1'> and caused seizures </span>
<span class='time'>23:47</span> 
</p>

How would I then go about finding the #test contents that occurs the most times.

i.e if I had

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span id='text'>woof</span>
    <span class='effect1'> and caused seizures </span>
    <span class='time'>23:47</span> 
    </p>

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span class='text'>meow</span>
    <span class='effect1'> and caused mind-splosion </span>
    <span class='time'>23:47</span> 
    </p>

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span class='text'>meow</span>
    <span class='effect1'> and used no effect </span>
    <span class='time'>23:47</span> 
    </p>

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span class='text'>meow</span>
    <span class='effect1'> and used no effect </span>
    <span class='time'>23:47</span> 
    </p>

the output would be 'meow'.

How would I accomplish this in php?

  • 写回答

2条回答 默认 最新

  • duanqu9292 2012-04-21 14:25
    关注

    First off: Your format is not conducive to this type of data manipulation; you might want to consider changing it.

    That said, based on this structure the logical solution would be to leverage DOMXPath as Dani says. This could have been problematic because of all the duplicate ids in there, but in practice it works (after emitting a boatload of warnings, which is one more reason that the data structure affords revision).

    Here's some code to go with the idea:

    $input = '<body>'.get_input().'</body>';
    $doc = new DOMDocument;
    $doc->loadHTML($input); // lots of warnings, duplicate ids!
    $xpath = new DOMXPath($doc);
    $result = $xpath->query("//*[@id='text']/text()");
    
    $occurrences = array();
    foreach ($result as $item) {
        if (!isset($occurrences[$item->wholeText])) {
            $occurrences[$item->wholeText] = 0;
        }
        $occurrences[$item->wholeText]++;
    }
    
    // Sort the results and produce final answer    
    arsort($occurrences);
    reset($occurrences);
    
    echo "The most common text is '".key($occurrences).
         "', which occurs ".current($occurrences)." times.";
    

    See it in action.

    Update (seeing as you fixed the duplicate id issue): You would simply change the xpath query to "//*[@class='text']/text()" so that it continues to match. However this way of doing things remains inefficient, so if one or more of these apply:

    • you are going to do this all the time
    • you have lots of data
    • you need it to be really fast

    then changing the data format is a good idea.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥20 西南科技大学数字信号处理
  • ¥15 有两个非常“自以为是”烦人的问题急期待大家解决!
  • ¥30 STM32 INMP441无法读取数据
  • ¥15 R语言绘制密度图,一个密度曲线内fill不同颜色如何实现
  • ¥100 求汇川机器人IRCB300控制器和示教器同版本升级固件文件升级包
  • ¥15 用visualstudio2022创建vue项目后无法启动
  • ¥15 x趋于0时tanx-sinx极限可以拆开算吗
  • ¥15 pyqt信号槽连接写法
  • ¥500 把面具戴到人脸上,请大家贡献智慧,别用大模型回答,大模型的答案没啥用
  • ¥15 任意一个散点图自己下载其js脚本文件并做成独立的案例页面,不要作在线的,要离线状态。