dqkf49487 2010-09-23 08:49
浏览 51
已采纳

比较类似内容的文本块

I have two text blocks, containing company names. Both contain names of hundreds of companies, the newer list has more companies. How do I remove the duplicate company names from the two lists , so that I am left with the new names only? Sample text block one:

Company Name One, Address line 1, line 2, phone, email
..
Random text 
..

Company Name Two 
Address, Phone,email
..
Random text 
..
Company Name 3 
Address, Phone,email

Sample text block two:

..Random Text..
M/s Company Name One Extra Random Text, Address line 1, line 2, phone, 
..random text...
M/s Company Name Two 
Address, Phone
...

The Company name, address etc are similar not same. The second block has the words M/s before all company names. I would like to do this in php, using regex perhaps.

I would like to out put company names which match, for eg. in the example given above I would like to output that Company names: Company name One, Company name two are common to both test blocks.

Update: thanks to @Wrikken, I have the text in two strings. I can explode the second block using the M/s, and get an array. How do I then check each item from this array to match the first text block which is one long string?

Although, I have since done the job manually, I would still like to know how two text blocks can be compared for similarity and hence the bounty.

Update: Output for @Joyce Babu code

..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone ... ... ... ... ... ... ... ... ... ... ... ... 

Output for @nikic

array(2) { [0]=>  string(17) "..Random Text.. " [4]=>  string(16) "Address, Phone " } 

Output for @Joyce Babu second post

andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,andom text...andom text...andom text...andom text...andom text...andom text...andom text...andom text...andom text...andom text...Company Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name Tworess, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phone

@Joyce Babu Final Code

<?php
set_time_limit(500);
$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
$G=0;
    $c=0;

    foreach($arNew as $line){
    if(substr($line, 0, 4) == 'M/s '){
    $c++;   
    echo "<BR/>".$c.".)";
        $line = trim(substr($line, 4));
        foreach($arOld as $old){
            similar_text($line, $old, $percentage);
            if ($percentage > 80){
                continue;
            }
        }
        echo $line;
    }else{
    $G++;
    }
}
echo "<br/>".$G . " DID NOT MATCH";
?>

@ Output From Joyce Babu final code

1.)Company Name One Extra Random Text, Address line 1, line 2, phone,
2.)Company Name Two
4 DID NOT MATCH
  • 写回答

5条回答 默认 最新

  • douxie7738 2010-10-04 11:29
    关注

    Try this

    set_time_limit(500)
    $arOld = file('olddata.txt');
    $arNew = file('newdata.txt');
    foreach($arNew as $line){
        if(substr($line, 0, 3) === 'M/s '){
            $line = trim(substr($line, 3));
            foreach($arOld as $old){
                similar_text($line, $old, $percentage);
                if ($percentage > 80){
                    continue;
                }
            }
            echo $line;
        }
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(4条)

报告相同问题?

悬赏问题

  • ¥15 CSS实现渐隐虚线边框
  • ¥15 thinkphp6配合social login单点登录问题
  • ¥15 HFSS 中的 H 场图与 MATLAB 中绘制的 B1 场 部分对应不上
  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题