dongye9182 2017-10-13 11:13
浏览 69
已采纳

如何找到数组中每个元素中出现的最长子字符串?

I have a collection of texts from some authors. Each author has a unique signature or link that occurs in all of their texts.

Example for Author1:

$texts=['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

Expected output for Author1 is: @jhsad.sadas.com


Example for Author2:

$texts=['This is some random string representative of non-signature text.

This is the
*author\'s* signature.',
'Different message body text.      This is the
*author\'s* signature.

This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];

Expected output for Author2 is:

This is the
 *author's* signature.

Pay particular notice to the fact there there are no reliable identifying characters (or positions) that signify the start or end of the signature. It could be a url, a Twitter mention, any kind of plain text, etc. of any length containing any sequence of characters that occurs at the start, end, or middle of the string.

I am seeking a method that will extract the longest substring that exists in all $text elements for a single author.

It is expected, for the sake of this task, that all authors WILL have a signature substring that exists in every post/text.

IDEA: I'm thinking of converting words to vectors and finding similarity between each texts. We can use cosine similarity to find the signatures. I think the solution must be some thing like this idea.

mickmackusa's commented code captures the essence of what is desired, but I would like to see if there are other ways to achieve the desired result.

  • 写回答

2条回答 默认 最新

  • doulianxi0587 2017-11-07 02:24
    关注

    Here is my thinking:

    1. Sort an author's collection of posts by string length (ascending) so that you are working from smaller texts to larger texts.
    2. Split each post's text on one or more white-space characters, so that you are only handling wholly non-white-space substrings during processing.
    3. Find matching substrings that occur in each subsequent post versus an ever-narrowing array of substrings (overlaps).
    4. Group the consecutive matching substrings by analyzing their index value.
    5. "Reconstitute" the grouped consecutive substrings into their original string form (trimmed of leading and trailing white-space characters, of course).
    6. Sort the reconstituted strings by string length (descending) so that the longest string is assigned the 0 index.
    7. Print to screen the substring that is assumed to be the author's signature (as a best guess) based on commonality and length.

    Code: (Demo)

    $posts['Author1']=['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd
    
    @jhsad.sadas.com sdsdADSA sada',
    'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
    hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
    'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
    @jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
    
    $posts['Author2']=['This is some random string representative of non-signature text.
    
    This is the
     *author\'s* signature.',
            'Different message body text.      This is the
     *author\'s* signature.
    
        This is an afterthought that expresses that a signature is not always at the end.',
            'Finally, this is unwanted stuff. This is the
     *author\'s* signature.'];
    
    foreach($posts as $author=>$texts){
        echo "Author: $author
    ";
    
        usort($texts,function($a,$b){return strlen($a)-strlen($b);}); // sort ASC by strlen; mb_strlen probably isn't advantageous
        var_export($texts);
        echo "
    ";
    
        foreach($texts as $index=>$string){
            if(!$index){
                $overlaps=preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY);  // declare with all non-white-space substrings from first text
            }else{
                $overlaps=array_intersect($overlaps,preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY));  // filter word bank using narrowing number of words
            }
        }
        var_export($overlaps);
        echo "
    ";
    
        // batch consecutive substrings
        $group=null;
        $consecutives=[];  // clear previous iteration's data
        foreach($overlaps as $i=>$word){
            if($group===null || $i-$last>1){
                $group=$i;
            }
            $last=$i;
            $consecutives[$group][]=$word;
        }
        var_export($consecutives);
        echo "
    ";
    
        foreach($consecutives as $words){
            // match potential signatures in first text for measurement:
            if(preg_match_all('/\Q'.implode('\E\s+\Q',$words).'\E/',$texts[0],$out)){  // make alternatives characters literal using \Q & \E
                $potential_signatures=$out[0];
            }
        }
        usort($potential_signatures,function($a,$b){return strlen($b)-strlen($a);}); // sort DESC by strlen; mb_strlen probably isn't advantageous
    
        echo "Assumed Signature: {$potential_signatures[0]}
    
    ";
    }
    

    Output:

    Author: Author1
    array (
      0 => 'sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd
    
    @jhsad.sadas.com sdsdADSA sada',
      1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
    @jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
      2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
    hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
    )
    array (
      11 => '@jhsad.sadas.com',
    )
    array (
      11 => 
      array (
        0 => '@jhsad.sadas.com',
      ),
    )
    Assumed Signature: @jhsad.sadas.com
    
    Author: Author2
    array (
      0 => 'Finally, this is unwanted stuff. This is the
     *author\'s* signature.',
      1 => 'This is some random string representative of non-signature text.
    
    This is the
     *author\'s* signature.',
      2 => 'Different message body text.      This is the
     *author\'s* signature.
    
        This is an afterthought that expresses that a signature is not always at the end.',
    )
    array (
      2 => 'is',
      5 => 'This',
      6 => 'is',
      7 => 'the',
      8 => '*author\'s*',
      9 => 'signature.',
    )
    array (
      2 => 
      array (
        0 => 'is',
      ),
      5 => 
      array (
        0 => 'This',
        1 => 'is',
        2 => 'the',
        3 => '*author\'s*',
        4 => 'signature.',
      ),
    )
    Assumed Signature: This is the
     *author's* signature.
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥50 易语言把MYSQL数据库中的数据添加至组合框
  • ¥20 求数据集和代码#有偿答复
  • ¥15 关于下拉菜单选项关联的问题
  • ¥20 java-OJ-健康体检
  • ¥15 rs485的上拉下拉,不会对a-b<-200mv有影响吗,就是接受时,对判断逻辑0有影响吗
  • ¥15 使用phpstudy在云服务器上搭建个人网站
  • ¥15 应该如何判断含间隙的曲柄摇杆机构,轴与轴承是否发生了碰撞?
  • ¥15 vue3+express部署到nginx
  • ¥20 搭建pt1000三线制高精度测温电路
  • ¥15 使用Jdk8自带的算法,和Jdk11自带的加密结果会一样吗,不一样的话有什么解决方案,Jdk不能升级的情况