dongshuo6503 2016-05-20 10:27
浏览 41
已采纳

在Textblock中搜索给定关键字的最短通道

i have a task and i am not sure how i should solve the problem. I have an idead but i do not know if it is the best way to solve it.

Here is the task: Given is a block of text and some keywords to find. We need to find a passage where all words can be found and where least words are used. Only the letters from A-Z and a-z need to be taken into account.

Here is an example:
Textblock:
Ein toller Beispieltext ist der Blindtext. Er hat ein paar Wörter. Dies ist ein Beispieltext, der ein paar Wörter hat und auch noch ein paar mehr, um die Zeile etwas länger zu machen. Darüber hinaus ist er nur dafür da, um genügend Testtext zusammenzubekommen. Dem Text selbst macht das nicht so viel aus. Früher einmal mehr, als er noch nicht so selbstbewusst war. Heute kennt er seine Rolle als Blindtext und fügt sich selbstbewusst ein. Er ist ja irgendwie wichtig. Manchmal jedoch, ganz manchmal, weint er in der Nacht, weil er niemals bis zum Ende gelesen wird. Doch das hat ja jetzt zum Glück ein Ende.

And here the words that need to be found: ein Beispieltext der paar Wörter

The result would be Beispieltext der ein paar Wrter

Following passage would also be a passage where all the words can be found but it has more words inside of the passage and therefore is not the solution: Ein toller Beispieltext ist Blindtext. Er hat ein paar Wörter.

My idea is to cut all unnecessary letters and then split the textblock on spaces to have an array of all words. so i can get the position of the words and calculate how much words are inbetween the first occurence of one of the searched words and the first occurence of all other searched words. this way i would need to go through the whole array and compare all possible lengths of passages and just take the shortest one.

do you think this is the best approach or can you point me to a better idea how to solve this problem?

  • 写回答

2条回答 默认 最新

  • douman9420 2016-05-21 12:34
    关注

    The algorithm you describe is could be OK but it is less clearly specified when it comes to "... this way I would need to go through the whole array".

    Once you have done the clean up and splitting into words, It will be easier to create a map for the key words, so you can know quickly if a word from the text matches (with isset()). Then you could just reduce the text array to an array of matching words (using array_filter()), still keeping the index of where they appear in the original array of words.

    The algorithm would then walk through that reduced array and keep track of a window (range) of these words. At the right side that window enlarges as long as not all necessary words are inside of it, and it shrinks at the left side when the left side word already occurs elsewhere in the window, or just after you have found a candidate solution. That way your window will travel through the whole (reduced) text array. You'll keep track only of the window that represents the shortest phrase. So at the end you have the optimal solution and just need to translate the window boundaries back to a phrase taken from the original text array.

    Case insensitive matching can be achieved by storing things in lower case (with strtolower), and by using the original cased string (in array format) for generating the output.

    Here is a function that implements the above described algorithm:

    function findFragment($text, $words) {
        // Remove non-A-Z letters
        $text = preg_replace("/[^a-z ]/i", "", $text);
        $words = preg_replace("/[^a-z ]/i", "", $words);
        // Create a map keyed by the words to find, with as value 
        // the number of occurrences in current sub-phrase
        $words_map = array_fill_keys(str_word_count(strtolower($words), 2), 0);
        // Put all words of text in an array
        $text_arr = str_word_count($text, 1);
        $text_low_arr = str_word_count(strtolower($text), 1);
        // Filter only matching words from the text, keeping their original indexes.
        $matches = array_filter($text_low_arr, function ($word) use ($words_map) {
            return isset($words_map[$word]);
        });
        // How many distinct words need to be matched to have a candidate phrase
        $matches_left = count($words_map);
        // Keep track of how long the shortest phrase is
        $min_words = count($text_arr) + 1; // start "infinite"
        // Loop over all matching words as the last word of a possible phrase
        foreach($matches as $i => $match) {
            $phrase[$i] = $match; // Add to the phrase
            $words_map[$match]++; // Increase count for this particular word
            if ($words_map[$match] > 1) continue; // Nothing new was added
            // Additional word found
            $matches_left--;
            if ($matches_left) continue; // Still need more words
            // Phrase has all words
            // Remove words from left which occur elsewhere in the phrase
            while ($words_map[reset($phrase)] > 1) {
                $words_map[reset($phrase)]--;
                unset($phrase[key($phrase)]);
            }
            // How many words are in this phrase?
            $num_words = $i - key($phrase) +1;
            if ($num_words < $min_words) {
                // It is shorter than we had so far
                $min_words = $num_words;
                $best_start = key($phrase);
            }
            // Remove first word from phrase before finding new candidate phrases
            $words_map[reset($phrase)]--;
            unset($phrase[key($phrase)]);
            $matches_left++;
        }
        // return best result
        return implode(" ", array_slice($text_arr, $best_start, $min_words));
    }
    

    You would call it like this:

    echo findFragment($text, $words);
    

    For the sample data you have given in the question, it returns the desired answer:

    Beispieltext der ein paar Wrter

    See it run on eval.in.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 程序不包含适用于入口点的静态Main方法
  • ¥15 素材场景中光线烘焙后灯光失效
  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
  • ¥15 ubuntu子系统密码忘记