如何找到数组中每个元素中出现的最长子字符串？

I have a collection of texts from some authors. Each author has a unique signature or link that occurs in all of their texts.

Example for Author1:

$texts=['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

Expected output for Author1 is: @jhsad.sadas.com

Example for Author2:

$texts=['This is some random string representative of non-signature text.

This is the
*author\'s* signature.',
'Different message body text.      This is the
*author\'s* signature.

This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];

Expected output for Author2 is:

This is the
 *author's* signature.

Pay particular notice to the fact there there are no reliable identifying characters (or positions) that signify the start or end of the signature. It could be a url, a Twitter mention, any kind of plain text, etc. of any length containing any sequence of characters that occurs at the start, end, or middle of the string.

I am seeking a method that will extract the longest substring that exists in all $text elements for a single author.

It is expected, for the sake of this task, that all authors WILL have a signature substring that exists in every post/text.

IDEA: I'm thinking of converting words to vectors and finding similarity between each texts. We can use cosine similarity to find the signatures. I think the solution must be some thing like this idea.

mickmackusa's commented code captures the essence of what is desired, but I would like to see if there are other ways to achieve the desired result.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

doulianxi0587 2017-11-07 02:24

关注

Here is my thinking:

Sort an author's collection of posts by string length (ascending) so that you are working from smaller texts to larger texts.
Split each post's text on one or more white-space characters, so that you are only handling wholly non-white-space substrings during processing.
Find matching substrings that occur in each subsequent post versus an ever-narrowing array of substrings (overlaps).
Group the consecutive matching substrings by analyzing their index value.
"Reconstitute" the grouped consecutive substrings into their original string form (trimmed of leading and trailing white-space characters, of course).
Sort the reconstituted strings by string length (descending) so that the longest string is assigned the 0 index.
Print to screen the substring that is assumed to be the author's signature (as a best guess) based on commonality and length.

Code: (Demo)

$posts['Author1']=['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

$posts['Author2']=['This is some random string representative of non-signature text.

This is the
 *author\'s* signature.',
        'Different message body text.      This is the
 *author\'s* signature.

    This is an afterthought that expresses that a signature is not always at the end.',
        'Finally, this is unwanted stuff. This is the
 *author\'s* signature.'];

foreach($posts as $author=>$texts){
    echo "Author: $author
";

    usort($texts,function($a,$b){return strlen($a)-strlen($b);}); // sort ASC by strlen; mb_strlen probably isn't advantageous
    var_export($texts);
    echo "
";

    foreach($texts as $index=>$string){
        if(!$index){
            $overlaps=preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY);  // declare with all non-white-space substrings from first text
        }else{
            $overlaps=array_intersect($overlaps,preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY));  // filter word bank using narrowing number of words
        }
    }
    var_export($overlaps);
    echo "
";

    // batch consecutive substrings
    $group=null;
    $consecutives=[];  // clear previous iteration's data
    foreach($overlaps as $i=>$word){
        if($group===null || $i-$last>1){
            $group=$i;
        }
        $last=$i;
        $consecutives[$group][]=$word;
    }
    var_export($consecutives);
    echo "
";

    foreach($consecutives as $words){
        // match potential signatures in first text for measurement:
        if(preg_match_all('/\Q'.implode('\E\s+\Q',$words).'\E/',$texts[0],$out)){  // make alternatives characters literal using \Q & \E
            $potential_signatures=$out[0];
        }
    }
    usort($potential_signatures,function($a,$b){return strlen($b)-strlen($a);}); // sort DESC by strlen; mb_strlen probably isn't advantageous

    echo "Assumed Signature: {$potential_signatures[0]}

";
}

Output:

Author: Author1
array (
  0 => 'sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
  1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
  2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
  11 => '@jhsad.sadas.com',
)
array (
  11 => 
  array (
    0 => '@jhsad.sadas.com',
  ),
)
Assumed Signature: @jhsad.sadas.com

Author: Author2
array (
  0 => 'Finally, this is unwanted stuff. This is the
 *author\'s* signature.',
  1 => 'This is some random string representative of non-signature text.

This is the
 *author\'s* signature.',
  2 => 'Different message body text.      This is the
 *author\'s* signature.

    This is an afterthought that expresses that a signature is not always at the end.',
)
array (
  2 => 'is',
  5 => 'This',
  6 => 'is',
  7 => 'the',
  8 => '*author\'s*',
  9 => 'signature.',
)
array (
  2 => 
  array (
    0 => 'is',
  ),
  5 => 
  array (
    0 => 'This',
    1 => 'is',
    2 => 'the',
    3 => '*author\'s*',
    4 => 'signature.',
  ),
)
Assumed Signature: This is the
 *author's* signature.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

如何找到数组中每个元素中出现的最长子字符串？ php
2017-10-13 11:13

回答 2 已采纳 Here is my thinking: Sort an author's collection of posts by string length (ascending) so that y
求最长子序列代码错误如何修改？ c语言数据结构
2022-06-03 16:02

回答 1 已采纳给你写出来了，看我的。 //最大公共子序列的问题 #include <stdio.h> #include <string.h> #define MAXSIZE 200 int
求和为k的最长子序列 c语言 c语言有问必答
2021-08-14 21:42

回答 1 已采纳 #include <stdio.h> #include <math.h> int main() { int input[100]{0}; char in;
最长子串算法 python_python经典算法题：求字符串中最长的回文子串
2021-01-30 20:22

weixin_39640221的博客题目给定一个字符串 s，找到 s 中最长的回文子串。你可以假设 s 的最大长度为 1000。示例 1：输入: “babad”输出: “bab”注意: “aba” 也是一个有效答案。示例 2：输入: “cbbd”输出: “bb”来源：力扣(LeetCode...
C++语言编程单调递增最长子序列 c++ 算法
2018-04-16 14:18

回答 2 已采纳参考：https://blog.csdn.net/lucienduan/article/details/24397949 ``` /*****************************
用c语言解决无重复字符的最长子串问题 c语言
2022-01-30 09:42

回答 2 已采纳供参考：https://blog.csdn.net/qq_41746080/article/details/120195425
搞不懂变长子网的划分 tcp/ip
2021-06-24 21:33

回答 2 已采纳首先192.168.1.0/24，属于C类IP，私有C类地址范围从192.168.0.0 到 192.168.255.255，C类地址默认子网掩码为255.255.255.0。划分为四个子网，只能在主
回文字符串小结（回文串判定+最长回文子序列）
2015-05-29 14:15

Flintx的博客定义：“回文串”是一个正读和反读都一样的字符串，...思路：回文串最基本的判定法是将一个字符串扫一遍，判断第i个字符和倒数第i个字符是否相同，不相同则返回false。也可以用栈来实现，复杂度均为O(n)。 code：/
网络基础，求解惑啊。网络协议
2021-12-08 14:10

回答 1 已采纳这题简单划4个子网，如果不想浪费ip还可以二次划分，这里借两位192.168.1.0/26 ip 1-62 192.168.1.63为广播192.168.1.64/26 ip 65-126 127为广
python最长公共子串 python 有问必答
2021-12-10 19:08

回答 2 已采纳 def maxstr(s,t): res = [] for i in range(len(s)): for j in range(i+1,len(s)+1):
为校园网配置网络及IP地址等网络网络协议
2022-12-14 11:52

回答 2 已采纳 1、根据主机数量确定子网位在ip地址第三段，n为主机位数，2^n-2>所需主机数得出1000台主机n最小为10,所以掩码是32-10=22，第三段借两位网络位，主机位为2，主机位块大小为4，选0
算法练习之合并两个有序链表, 删除排序数组中的重复项,移除元素,实现strStr(),搜索插入位置,无重复字符的最长子串...
2019-09-22 18:49

aituochang1886的博客因此决定找几个简单的算法写，用php和java分别实现 1.合并两个有序链表将两个有序链表合并为一个新的有序链表并返回。新链表是通过拼接给定的两个链表的所有节点组成的。示例：输入：1->2->4, 1->3-...
数据结构二叉树修改BUG c++ 数据结构链表
2022-12-17 18:02

回答 1 已采纳你可以看下这个问题的回答https://ask.csdn.net/questions/7537008
字符串排序
2017-06-13 14:31

じ☆夏妮国婷☆じ的博客给出一个长为len的字符串str，把字符串的首尾相连，然后以每个字符为起点，顺时针遍历每个字符，得到len个新的字符串，然后把这len个字符串按照字典序从小到大的顺序进行排序，取出排完序后的每个字符串的最后一个...
【PHP解法==LeetCode（动态规划4-（最长子序列））】300.最长上升子序列 && 376.摆动序列 && 5.最长回文子串 && 516.最长回文子序列 && 最长公共子序列/串
2019-03-09 23:28

YY-帆S的博客目录 300.最长上升子序列 ...给定一个无序的整数数组，找到其中最长上升子序列的长度。示例: 输入: [10,9,2,5,3,7,101,18] 输出: 4 解释: 最长的上升子序列是[2,3,7,101]，它的长度是 4。说明: ...
程序员刷题 -- 题3 无重复字符的最长子串
2020-07-19 02:42

笨小迪的博客给定一个字符串，请你找出其中不含有重复字符的最长子串的长度。示例1: 输入: "abcabcbb" 输出: 3 解释: 因为无重复字符的最长子串是 "abc"，所以其长度为 3。示例 2: 输入: "bbbbb" 输出: 1 解释: 因为无重复...
php无重复字符的最长子串,无重复字符的最长字串问题
2021-04-10 12:19

weixin_39891694的博客 leetcode3:无重复字符的最长字串问题问题描述给定一个字符串，请你找出其中不含有重复字符的最长子串的长度。示例1:输入: "abcabcbb"输出: 3解释: 因为无重复字符的最长子串是 "abc"，所以其长度为 3。示例 2:输入: ...
没有解决我的问题, 去提问

悬赏问题

¥50 易语言把MYSQL数据库中的数据添加至组合框
¥20 求数据集和代码#有偿答复
¥15 关于下拉菜单选项关联的问题
¥20 java-OJ-健康体检
¥15 rs485的上拉下拉，不会对a-b<-200mv有影响吗，就是接受时，对判断逻辑0有影响吗
¥15 使用phpstudy在云服务器上搭建个人网站
¥15 应该如何判断含间隙的曲柄摇杆机构，轴与轴承是否发生了碰撞？
¥15 vue3+express部署到nginx
¥20 搭建pt1000三线制高精度测温电路
¥15 使用Jdk8自带的算法，和Jdk11自带的加密结果会一样吗，不一样的话有什么解决方案，Jdk不能升级的情况

码龄粉丝数原力等级 --

如何找到数组中每个元素中出现的最长子字符串？

2条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

如何找到数组中每个元素中出现的最长子字符串？

2条回答 默认 最新

悬赏问题

2条回答默认最新