dongwai4434 2017-11-13 02:21 采纳率: 0%
浏览 153
已采纳

查找字符串中的相似性,然后将新数组与其他数据分组

I believe I have a complex issue which I'm struggling to find a solution for, and for the life of me I can't seem to achieve it.

I need to run an analysis on thousands of bank transactions to find similarities in their descriptions.

To start, I have an array of transactions, that are grouped by month, here's is a small sample:

$sample_transactions = array(
  ['Oct_2017']=>array(4) {  
    [0] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 34345555 REF NO 21283322"
      ["amount"]=>string(4) "1290"
    }
    [1] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 34345555 REF NO 8765876"
      ["amount"]=>string(4) "1000"
    }
    [2] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212"
      ["amount"]=>string(4) "2500"
    }
    [3] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553"
      ["amount"]=>string(4) "4000"
    }
  }
  ['Nov_2017']=>array(4) {  
    [0] => array(2) {
      ["desc"]=>string(55) "PHONE TRANSFER CREDIT FROM 65765544 REF NO 123444"
      ["amount"]=>string(4) "879"
    }
    [1] => array(2) {
      ["desc"]=>string(55) "EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU"
      ["amount"]=>string(4) "200"
    }
    [2] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212"
      ["amount"]=>string(4) "3200"
    }
    [3] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553"
      ["amount"]=>string(4) "6039"
    }
  }
);

Using the sample transactions above, the desire is to look for similarities in their desc, then group them together, add a count of how many there were, and increment the total figure.

Here is an example of the desired data:

Oct_2017
=================================================================
|        Desc.                            |  Amount   |  Count  | 
=================================================================
|  TRANSFER CREDIT FROM 34345555 REF NO       2290         2    |
-----------------------------------------------------------------
|  TRANSFER CREDIT FROM 785674556 REF NO      6500         2    |
=================================================================

Nov_2017
=================================================================
|        Desc.                            |  Amount   |  Count  | 
=================================================================
|  TRANSFER CREDIT FROM 785674556             9239         2    |
=================================================================

If you notice from the two tables of data, it does the following:

  • Skips the string if it only occurs once
  • Groups strings if they occur twice, but, they can have differences somewhere, e.g. Reference number at the end of the string
  • Increments the "amount" for that group
  • Increments the "count" for each time an occurrence was found.

Edit: These transactions are provided to us from a third party source and saved in our Mysql DB, so the strings could literally be anything. This means we can't have a collection of strings to look for, because we don't know what we are looking for. We need to highlight the "patterns" in the transactions rather than things we are looking for.

Edit2: Some more sample strings could be:
RETURNED CREDIT FROM Mr Nobody 9392 JKK freight ACCOUNT CLOSED
RETURNED CREDIT FROM Mrs Somebody Melbourne Aus INVALID ACCOUNT NUMBER
VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD
EFTPOS DEP Medicare Benefit
DIRECT CREDIT CBA TRANSFER
BPAY REV 3535333 KLM RENEW 4823
AGENT DEPOSIT 87
ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC
DIRECT CREDIT DONTY BENEFITS 23423322 EYWQ

Methods I've tried:

1 - I tried this approach to compare strings, however, using similar_text() is not specific enough - it results in groups being made which shouldn't be as it's just based on the similarity percentage, and doesn't group based on characters like required.

2 - I tried querying from the database using ORDER BY desc ASC which obviously orders them quite well, but includes everything, and doesn't group them like the desired result.

  • 写回答

1条回答 默认 最新

  • duanbiao4035 2017-11-13 04:59
    关注

    The bad news is: php doesn't offer anything out of the box to do this kind of process.

    The good news is: you can roll your own and modify it as you learn more about what you want to call "similar". As your bespoke method matures, its accuracy will ascend ever closer to 100%. Use the "leftovers" to discover similarities that can be built into the code.

    I hope this is enough to give you some traction:

    Code: (Demo)

    $sample_transactions=[
        'Oct_2017'=>[
            ["desc"=>"INTERNET TRANSFER CREDIT FROM 34345555 REF NO 21283322","amount"=>"1290"],
            ["desc"=>"INTERNET TRANSFER CREDIT FROM 34345555 REF NO 8765876","amount"=>"1000"],
            ["desc"=>"INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212","amount"=>"2500"],
            ["desc"=>"INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553","amount"=>"4000"],
            ["desc"=>"PHONE TRANSFER CREDIT FROM 65765544 REF NO 123444","amount"=>"879"],
            ["desc"=>"EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU","amount"=>"200"],
            ["desc"=>"INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212","amount"=>"3200"],
            ["desc"=>"INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553","amount"=>"6039"],
            ["desc"=>"RETURNED CREDIT FROM Mr Nobody 9392 JKK freight ACCOUNT CLOSED","amount"=>"123"],
            ["desc"=>"RETURNED CREDIT FROM Mrs Somebody Melbourne Aus INVALID ACCOUNT NUMBER","amount"=>"124"],
            ["desc"=>"VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD","amount"=>"1234"],
            ["desc"=>"EFTPOS DEP Medicare Benefit","amount"=>"999"],
            ["desc"=>"DIRECT CREDIT CBA TRANSFER","amount"=>"1050"],
            ["desc"=>"BPAY REV 3535333 KLM RENEW 4823","amount"=>"1175"],
            ["desc"=>"AGENT DEPOSIT 87","amount"=>"100"],
            ["desc"=>"ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC","amount"=>"200"],
            ["desc"=>"DIRECT CREDIT DONTY BENEFITS 23423322 EYWQ","amount"=>"300"]
        ]
    ];
    
    foreach($sample_transactions as $mo_year=>$trans_array){
        foreach($trans_array as $trans){
            if(!$key=strstr($trans['desc'],' FROM ',true)){ // declare $key if no ' FROM ' try something else
                if(strpos($trans['desc'],'DIRECT CREDIT')===0){  // try 'DIRECT CREDIT' at start of string
                    $key='DIRECT CREDIT';
                }else{
                    $key=$trans['desc'];  // if all attempts fail, default to fullstring value
                }
            }
            if(!isset($groups[$mo_year][$key])){
                $groups[$mo_year][$key]=['Amount'=>$trans['amount'],'Count'=>1];  //initialize the row
            }else{
                $groups[$mo_year][$key]=[
                    'Amount'=>$groups[$mo_year][$key]['Amount']+$trans['amount'],  // do the sum
                    'Count'=>++$groups[$mo_year][$key]['Count']  // increment by 1
                ];
            }
        }
    }
    
    // split results into two groups based on Count value
    foreach($groups as $mo_year=>$rows){
        foreach($rows as $desc=>$sums){
            if($sums['Count']<2){
                $singletons[$mo_year][$desc]=$sums;
            }else{
                $similarities[$mo_year][$desc]=$sums;
            }
        }
    }
    
    echo "Attempted Consolidation:
    ";
    var_export($groups);
    echo "
    
    Similarities:
    ";
    var_export($similarities);
    echo "
    
    The leftovers to review and try to isolate relevant similarities
    ";
    var_export($singletons);
    

    Output:

    Attempted Consolidation:
    array (
      'Oct_2017' => 
      array (
        'INTERNET TRANSFER CREDIT' => 
        array (
          'Amount' => 18029,
          'Count' => 6,
        ),
        'PHONE TRANSFER CREDIT' => 
        array (
          'Amount' => '879',
          'Count' => 1,
        ),
        'EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU' => 
        array (
          'Amount' => '200',
          'Count' => 1,
        ),
        'RETURNED CREDIT' => 
        array (
          'Amount' => 247,
          'Count' => 2,
        ),
        'VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD' => 
        array (
          'Amount' => '1234',
          'Count' => 1,
        ),
        'EFTPOS DEP Medicare Benefit' => 
        array (
          'Amount' => '999',
          'Count' => 1,
        ),
        'DIRECT CREDIT' => 
        array (
          'Amount' => 1350,
          'Count' => 2,
        ),
        'BPAY REV 3535333 KLM RENEW 4823' => 
        array (
          'Amount' => '1175',
          'Count' => 1,
        ),
        'AGENT DEPOSIT 87' => 
        array (
          'Amount' => '100',
          'Count' => 1,
        ),
        'ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC' => 
        array (
          'Amount' => '200',
          'Count' => 1,
        ),
      ),
    )
    
    Similarities:
    array (
      'Oct_2017' => 
      array (
        'INTERNET TRANSFER CREDIT' => 
        array (
          'Amount' => 18029,
          'Count' => 6,
        ),
        'RETURNED CREDIT' => 
        array (
          'Amount' => 247,
          'Count' => 2,
        ),
        'DIRECT CREDIT' => 
        array (
          'Amount' => 1350,
          'Count' => 2,
        ),
      ),
    )
    
    The leftovers to review and try to isolate relevant similarities
    array (
      'Oct_2017' => 
      array (
        'PHONE TRANSFER CREDIT' => 
        array (
          'Amount' => '879',
          'Count' => 1,
        ),
        'EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU' => 
        array (
          'Amount' => '200',
          'Count' => 1,
        ),
        'VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD' => 
        array (
          'Amount' => '1234',
          'Count' => 1,
        ),
        'EFTPOS DEP Medicare Benefit' => 
        array (
          'Amount' => '999',
          'Count' => 1,
        ),
        'BPAY REV 3535333 KLM RENEW 4823' => 
        array (
          'Amount' => '1175',
          'Count' => 1,
        ),
        'AGENT DEPOSIT 87' => 
        array (
          'Amount' => '100',
          'Count' => 1,
        ),
        'ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC' => 
        array (
          'Amount' => '200',
          'Count' => 1,
        ),
      ),
    )
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 stm32开发clion时遇到的编译问题
  • ¥15 lna设计 源简并电感型共源放大器