dongwai4434 2017-11-13 02:21 采纳率: 0%
浏览 153
已采纳

查找字符串中的相似性,然后将新数组与其他数据分组

I believe I have a complex issue which I'm struggling to find a solution for, and for the life of me I can't seem to achieve it.

I need to run an analysis on thousands of bank transactions to find similarities in their descriptions.

To start, I have an array of transactions, that are grouped by month, here's is a small sample:

$sample_transactions = array(
  ['Oct_2017']=>array(4) {  
    [0] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 34345555 REF NO 21283322"
      ["amount"]=>string(4) "1290"
    }
    [1] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 34345555 REF NO 8765876"
      ["amount"]=>string(4) "1000"
    }
    [2] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212"
      ["amount"]=>string(4) "2500"
    }
    [3] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553"
      ["amount"]=>string(4) "4000"
    }
  }
  ['Nov_2017']=>array(4) {  
    [0] => array(2) {
      ["desc"]=>string(55) "PHONE TRANSFER CREDIT FROM 65765544 REF NO 123444"
      ["amount"]=>string(4) "879"
    }
    [1] => array(2) {
      ["desc"]=>string(55) "EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU"
      ["amount"]=>string(4) "200"
    }
    [2] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212"
      ["amount"]=>string(4) "3200"
    }
    [3] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553"
      ["amount"]=>string(4) "6039"
    }
  }
);

Using the sample transactions above, the desire is to look for similarities in their desc, then group them together, add a count of how many there were, and increment the total figure.

Here is an example of the desired data:

Oct_2017
=================================================================
|        Desc.                            |  Amount   |  Count  | 
=================================================================
|  TRANSFER CREDIT FROM 34345555 REF NO       2290         2    |
-----------------------------------------------------------------
|  TRANSFER CREDIT FROM 785674556 REF NO      6500         2    |
=================================================================

Nov_2017
=================================================================
|        Desc.                            |  Amount   |  Count  | 
=================================================================
|  TRANSFER CREDIT FROM 785674556             9239         2    |
=================================================================

If you notice from the two tables of data, it does the following:

  • Skips the string if it only occurs once
  • Groups strings if they occur twice, but, they can have differences somewhere, e.g. Reference number at the end of the string
  • Increments the "amount" for that group
  • Increments the "count" for each time an occurrence was found.

Edit: These transactions are provided to us from a third party source and saved in our Mysql DB, so the strings could literally be anything. This means we can't have a collection of strings to look for, because we don't know what we are looking for. We need to highlight the "patterns" in the transactions rather than things we are looking for.

Edit2: Some more sample strings could be:
RETURNED CREDIT FROM Mr Nobody 9392 JKK freight ACCOUNT CLOSED
RETURNED CREDIT FROM Mrs Somebody Melbourne Aus INVALID ACCOUNT NUMBER
VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD
EFTPOS DEP Medicare Benefit
DIRECT CREDIT CBA TRANSFER
BPAY REV 3535333 KLM RENEW 4823
AGENT DEPOSIT 87
ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC
DIRECT CREDIT DONTY BENEFITS 23423322 EYWQ

Methods I've tried:

1 - I tried this approach to compare strings, however, using similar_text() is not specific enough - it results in groups being made which shouldn't be as it's just based on the similarity percentage, and doesn't group based on characters like required.

2 - I tried querying from the database using ORDER BY desc ASC which obviously orders them quite well, but includes everything, and doesn't group them like the desired result.

  • 写回答

1条回答 默认 最新

  • duanbiao4035 2017-11-13 04:59
    关注

    The bad news is: php doesn't offer anything out of the box to do this kind of process.

    The good news is: you can roll your own and modify it as you learn more about what you want to call "similar". As your bespoke method matures, its accuracy will ascend ever closer to 100%. Use the "leftovers" to discover similarities that can be built into the code.

    I hope this is enough to give you some traction:

    Code: (Demo)

    $sample_transactions=[
        'Oct_2017'=>[
            ["desc"=>"INTERNET TRANSFER CREDIT FROM 34345555 REF NO 21283322","amount"=>"1290"],
            ["desc"=>"INTERNET TRANSFER CREDIT FROM 34345555 REF NO 8765876","amount"=>"1000"],
            ["desc"=>"INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212","amount"=>"2500"],
            ["desc"=>"INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553","amount"=>"4000"],
            ["desc"=>"PHONE TRANSFER CREDIT FROM 65765544 REF NO 123444","amount"=>"879"],
            ["desc"=>"EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU","amount"=>"200"],
            ["desc"=>"INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212","amount"=>"3200"],
            ["desc"=>"INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553","amount"=>"6039"],
            ["desc"=>"RETURNED CREDIT FROM Mr Nobody 9392 JKK freight ACCOUNT CLOSED","amount"=>"123"],
            ["desc"=>"RETURNED CREDIT FROM Mrs Somebody Melbourne Aus INVALID ACCOUNT NUMBER","amount"=>"124"],
            ["desc"=>"VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD","amount"=>"1234"],
            ["desc"=>"EFTPOS DEP Medicare Benefit","amount"=>"999"],
            ["desc"=>"DIRECT CREDIT CBA TRANSFER","amount"=>"1050"],
            ["desc"=>"BPAY REV 3535333 KLM RENEW 4823","amount"=>"1175"],
            ["desc"=>"AGENT DEPOSIT 87","amount"=>"100"],
            ["desc"=>"ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC","amount"=>"200"],
            ["desc"=>"DIRECT CREDIT DONTY BENEFITS 23423322 EYWQ","amount"=>"300"]
        ]
    ];
    
    foreach($sample_transactions as $mo_year=>$trans_array){
        foreach($trans_array as $trans){
            if(!$key=strstr($trans['desc'],' FROM ',true)){ // declare $key if no ' FROM ' try something else
                if(strpos($trans['desc'],'DIRECT CREDIT')===0){  // try 'DIRECT CREDIT' at start of string
                    $key='DIRECT CREDIT';
                }else{
                    $key=$trans['desc'];  // if all attempts fail, default to fullstring value
                }
            }
            if(!isset($groups[$mo_year][$key])){
                $groups[$mo_year][$key]=['Amount'=>$trans['amount'],'Count'=>1];  //initialize the row
            }else{
                $groups[$mo_year][$key]=[
                    'Amount'=>$groups[$mo_year][$key]['Amount']+$trans['amount'],  // do the sum
                    'Count'=>++$groups[$mo_year][$key]['Count']  // increment by 1
                ];
            }
        }
    }
    
    // split results into two groups based on Count value
    foreach($groups as $mo_year=>$rows){
        foreach($rows as $desc=>$sums){
            if($sums['Count']<2){
                $singletons[$mo_year][$desc]=$sums;
            }else{
                $similarities[$mo_year][$desc]=$sums;
            }
        }
    }
    
    echo "Attempted Consolidation:
    ";
    var_export($groups);
    echo "
    
    Similarities:
    ";
    var_export($similarities);
    echo "
    
    The leftovers to review and try to isolate relevant similarities
    ";
    var_export($singletons);
    

    Output:

    Attempted Consolidation:
    array (
      'Oct_2017' => 
      array (
        'INTERNET TRANSFER CREDIT' => 
        array (
          'Amount' => 18029,
          'Count' => 6,
        ),
        'PHONE TRANSFER CREDIT' => 
        array (
          'Amount' => '879',
          'Count' => 1,
        ),
        'EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU' => 
        array (
          'Amount' => '200',
          'Count' => 1,
        ),
        'RETURNED CREDIT' => 
        array (
          'Amount' => 247,
          'Count' => 2,
        ),
        'VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD' => 
        array (
          'Amount' => '1234',
          'Count' => 1,
        ),
        'EFTPOS DEP Medicare Benefit' => 
        array (
          'Amount' => '999',
          'Count' => 1,
        ),
        'DIRECT CREDIT' => 
        array (
          'Amount' => 1350,
          'Count' => 2,
        ),
        'BPAY REV 3535333 KLM RENEW 4823' => 
        array (
          'Amount' => '1175',
          'Count' => 1,
        ),
        'AGENT DEPOSIT 87' => 
        array (
          'Amount' => '100',
          'Count' => 1,
        ),
        'ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC' => 
        array (
          'Amount' => '200',
          'Count' => 1,
        ),
      ),
    )
    
    Similarities:
    array (
      'Oct_2017' => 
      array (
        'INTERNET TRANSFER CREDIT' => 
        array (
          'Amount' => 18029,
          'Count' => 6,
        ),
        'RETURNED CREDIT' => 
        array (
          'Amount' => 247,
          'Count' => 2,
        ),
        'DIRECT CREDIT' => 
        array (
          'Amount' => 1350,
          'Count' => 2,
        ),
      ),
    )
    
    The leftovers to review and try to isolate relevant similarities
    array (
      'Oct_2017' => 
      array (
        'PHONE TRANSFER CREDIT' => 
        array (
          'Amount' => '879',
          'Count' => 1,
        ),
        'EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU' => 
        array (
          'Amount' => '200',
          'Count' => 1,
        ),
        'VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD' => 
        array (
          'Amount' => '1234',
          'Count' => 1,
        ),
        'EFTPOS DEP Medicare Benefit' => 
        array (
          'Amount' => '999',
          'Count' => 1,
        ),
        'BPAY REV 3535333 KLM RENEW 4823' => 
        array (
          'Amount' => '1175',
          'Count' => 1,
        ),
        'AGENT DEPOSIT 87' => 
        array (
          'Amount' => '100',
          'Count' => 1,
        ),
        'ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC' => 
        array (
          'Amount' => '200',
          'Count' => 1,
        ),
      ),
    )
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 java在应用程序里获取不到扬声器设备
  • ¥15 echarts动画效果的问题,请帮我添加一个动画。不要机器人回答。
  • ¥60 许可证msc licensing软件报错显示已有相同版本软件,但是下一步显示无法读取日志目录。
  • ¥15 Attention is all you need 的代码运行
  • ¥15 一个服务器已经有一个系统了如果用usb再装一个系统,原来的系统会被覆盖掉吗
  • ¥15 使用esm_msa1_t12_100M_UR50S蛋白质语言模型进行零样本预测时,终端显示出了sequence handled的进度条,但是并不出结果就自动终止回到命令提示行了是怎么回事:
  • ¥15 前置放大电路与功率放大电路相连放大倍数出现问题
  • ¥80 部署运行web自动化项目
  • ¥15 腾讯云如何建立同一个项目中物模型之间的联系
  • ¥30 VMware 云桌面水印如何添加