在Textblock中搜索给定关键字的最短通道

i have a task and i am not sure how i should solve the problem. I have an idead but i do not know if it is the best way to solve it.

Here is the task: Given is a block of text and some keywords to find. We need to find a passage where all words can be found and where least words are used. Only the letters from A-Z and a-z need to be taken into account.

Here is an example:
Textblock:
Ein toller Beispieltext ist der Blindtext. Er hat ein paar Wörter. Dies ist ein Beispieltext, der ein paar Wörter hat und auch noch ein paar mehr, um die Zeile etwas länger zu machen. Darüber hinaus ist er nur dafür da, um genügend Testtext zusammenzubekommen. Dem Text selbst macht das nicht so viel aus. Früher einmal mehr, als er noch nicht so selbstbewusst war. Heute kennt er seine Rolle als Blindtext und fügt sich selbstbewusst ein. Er ist ja irgendwie wichtig. Manchmal jedoch, ganz manchmal, weint er in der Nacht, weil er niemals bis zum Ende gelesen wird. Doch das hat ja jetzt zum Glück ein Ende.

And here the words that need to be found: ein Beispieltext der paar Wörter

The result would be Beispieltext der ein paar Wrter

Following passage would also be a passage where all the words can be found but it has more words inside of the passage and therefore is not the solution: Ein toller Beispieltext ist Blindtext. Er hat ein paar Wörter.

My idea is to cut all unnecessary letters and then split the textblock on spaces to have an array of all words. so i can get the position of the words and calculate how much words are inbetween the first occurence of one of the searched words and the first occurence of all other searched words. this way i would need to go through the whole array and compare all possible lengths of passages and just take the shortest one.

do you think this is the best approach or can you point me to a better idea how to solve this problem?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

douman9420 2016-05-21 12:34

关注

The algorithm you describe is could be OK but it is less clearly specified when it comes to "... this way I would need to go through the whole array".

Once you have done the clean up and splitting into words, It will be easier to create a map for the key words, so you can know quickly if a word from the text matches (with isset()). Then you could just reduce the text array to an array of matching words (using array_filter()), still keeping the index of where they appear in the original array of words.

The algorithm would then walk through that reduced array and keep track of a window (range) of these words. At the right side that window enlarges as long as not all necessary words are inside of it, and it shrinks at the left side when the left side word already occurs elsewhere in the window, or just after you have found a candidate solution. That way your window will travel through the whole (reduced) text array. You'll keep track only of the window that represents the shortest phrase. So at the end you have the optimal solution and just need to translate the window boundaries back to a phrase taken from the original text array.

Case insensitive matching can be achieved by storing things in lower case (with strtolower), and by using the original cased string (in array format) for generating the output.

Here is a function that implements the above described algorithm:

function findFragment($text, $words) {
    // Remove non-A-Z letters
    $text = preg_replace("/[^a-z ]/i", "", $text);
    $words = preg_replace("/[^a-z ]/i", "", $words);
    // Create a map keyed by the words to find, with as value 
    // the number of occurrences in current sub-phrase
    $words_map = array_fill_keys(str_word_count(strtolower($words), 2), 0);
    // Put all words of text in an array
    $text_arr = str_word_count($text, 1);
    $text_low_arr = str_word_count(strtolower($text), 1);
    // Filter only matching words from the text, keeping their original indexes.
    $matches = array_filter($text_low_arr, function ($word) use ($words_map) {
        return isset($words_map[$word]);
    });
    // How many distinct words need to be matched to have a candidate phrase
    $matches_left = count($words_map);
    // Keep track of how long the shortest phrase is
    $min_words = count($text_arr) + 1; // start "infinite"
    // Loop over all matching words as the last word of a possible phrase
    foreach($matches as $i => $match) {
        $phrase[$i] = $match; // Add to the phrase
        $words_map[$match]++; // Increase count for this particular word
        if ($words_map[$match] > 1) continue; // Nothing new was added
        // Additional word found
        $matches_left--;
        if ($matches_left) continue; // Still need more words
        // Phrase has all words
        // Remove words from left which occur elsewhere in the phrase
        while ($words_map[reset($phrase)] > 1) {
            $words_map[reset($phrase)]--;
            unset($phrase[key($phrase)]);
        }
        // How many words are in this phrase?
        $num_words = $i - key($phrase) +1;
        if ($num_words < $min_words) {
            // It is shorter than we had so far
            $min_words = $num_words;
            $best_start = key($phrase);
        }
        // Remove first word from phrase before finding new candidate phrases
        $words_map[reset($phrase)]--;
        unset($phrase[key($phrase)]);
        $matches_left++;
    }
    // return best result
    return implode(" ", array_slice($text_arr, $best_start, $min_words));
}

You would call it like this:

echo findFragment($text, $words);

For the sample data you have given in the question, it returns the desired answer:

Beispieltext der ein paar Wrter

See it run on eval.in.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

在Textblock中搜索给定关键字的最短通道 php
2016-05-20 10:27

回答 2 已采纳 The algorithm you describe is could be OK but it is less clearly specified when it comes to "... t
如何在php中手动添加描述和关键字？ php
2018-03-22 12:57

回答 1 已采纳 If you just want to concatenate your meta to existing meta like with the title, define an array of
wpf textblock复制的时候卡住
2018-07-30 01:40

回答 2 已采纳 textbox赋值，超级大的字符串，的确是需要很长时间。记事本秒开是因为，它只是刷新了客户区可见的文本，而且如果你使用了“自动换行”，记事本同样不能秒开，因为这种情况，没办法知道文本的显示行数。
vue3中的ref，toRef，toRefs三个的作用
2022-08-02 16:34

周亚鑫的博客 ref、toRef、toRefs都可以将某个对象中的属性变成响应式数据ref的本质是拷贝，修改响应式数据，不会影响到原始数据，视图会更新的本质是引用，修改响应式数据，会影响到原始数据，视图不会更新toRef一次仅能设置一个...
在包含文件的文件夹中查找关键字 php
2017-06-21 10:38

回答 2 已采纳 Try this, $findThisString = stripcslashes($_POST["recherchemotcle"]); $path = "tuto"; $dir = open
使用php在数组中查找关键字？ php
2013-10-23 08:53

回答 2 已采纳 A simple preg_grep will do the job: $arr = array( 'Canon sample printing text', 'Captain
php什么是函数定义中的“database”关键字 php
2017-09-12 18:25

回答 1 已采纳 This is type hinting. It means that the variable, in this case $mysqli, should meet at least one o
PHP面试题(一)
2018-03-24 11:56

钟长森的博客双端队列中的元素可以从两端弹出，其限定插入和删除操作在表的两端进行。双向队列（双端队列）就像是一个队列，但是你可以在任何一端添加或移除元素。而双端队列是一种数据结构，定义如下： A deque is a data ...
如何在php mysql搜索中加粗关键字？ ajax javascript mysql php
2010-10-12 03:17

回答 3 已采纳 Just perform a str_replace on the returned text. $search = 'programming'; // $dbContent = the res
PHP - 在类之外使用关键字的区别，并在类中使用关键字 laravel php
2017-04-04 02:44

回答 1 已采纳 When you define a class, it can only have access to other classes within its namespaces. Your cont
如何在PHP中的关键字后面找到以下x个字符？ php
2014-04-24 18:45

回答 3 已采纳 Use strpos() to find the watch?v=, then use the substr() method to get 11 characters after. docs
php base
2010-09-23 01:52

折腾数据折腾代码的博客 PHPmySQL 函数库 ...建议在使用新版本的 PHP 前删除所有这些文件（比如从 Windows 系统文件夹删除 php.ini 和 PHP 相关的 DLL 文件）。务必在删除前备份这些文件，因为这一操作可能破坏整个系统。原
PHP在没有循环的情况下获取数组中最长的字符 php
2019-07-13 08:49

回答 4 已采纳 You could sort the strings by length using for example usort and get the first item using reset.
VLC帮助文档-中文详细版（vlc-2.2.6）
2018-12-24 16:52

回来的博客的博客 2017/8/24 vlc-help.txt ...您可以在命令行中指定多个流。他们将在播放列表中排队。指定的第一项将首先播放。选项样式： –option在程序持续时间内设置的全局选项。 -option全局–option的单字...
数据库系统工程师考点笔记
2021-04-14 10:52

@北海怪兽的博客本人在备战数据库系统工程师的考试，把刷历年真题碰到的知识点记录在这边。（目录是第四版教程）
没有解决我的问题, 去提问

悬赏问题

¥15 程序不包含适用于入口点的静态Main方法
¥15 素材场景中光线烘焙后灯光失效
¥15 请教一下各位，为什么我这个没有实现模拟点击
¥15 执行 virtuoso 命令后，界面没有，cadence 启动不起来
¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
¥20 有关区间dp的问题求解
¥15 多电路系统共用电源的串扰问题
¥15 slam rangenet++配置
¥15 有没有研究水声通信方面的帮我改俩matlab代码
¥15 ubuntu子系统密码忘记

码龄粉丝数原力等级 --

在Textblock中搜索给定关键字的最短通道

2条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

在Textblock中搜索给定关键字的最短通道

2条回答 默认 最新

悬赏问题

2条回答默认最新