查找字符串中的相似性，然后将新数组与其他数据分组

I believe I have a complex issue which I'm struggling to find a solution for, and for the life of me I can't seem to achieve it.

I need to run an analysis on thousands of bank transactions to find similarities in their descriptions.

To start, I have an array of transactions, that are grouped by month, here's is a small sample:

$sample_transactions = array(
  ['Oct_2017']=>array(4) {  
    [0] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 34345555 REF NO 21283322"
      ["amount"]=>string(4) "1290"
    }
    [1] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 34345555 REF NO 8765876"
      ["amount"]=>string(4) "1000"
    }
    [2] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212"
      ["amount"]=>string(4) "2500"
    }
    [3] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553"
      ["amount"]=>string(4) "4000"
    }
  }
  ['Nov_2017']=>array(4) {  
    [0] => array(2) {
      ["desc"]=>string(55) "PHONE TRANSFER CREDIT FROM 65765544 REF NO 123444"
      ["amount"]=>string(4) "879"
    }
    [1] => array(2) {
      ["desc"]=>string(55) "EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU"
      ["amount"]=>string(4) "200"
    }
    [2] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212"
      ["amount"]=>string(4) "3200"
    }
    [3] => array(2) {
      ["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553"
      ["amount"]=>string(4) "6039"
    }
  }
);

Using the sample transactions above, the desire is to look for similarities in their desc, then group them together, add a count of how many there were, and increment the total figure.

Here is an example of the desired data:

Oct_2017
=================================================================
|        Desc.                            |  Amount   |  Count  | 
=================================================================
|  TRANSFER CREDIT FROM 34345555 REF NO       2290         2    |
-----------------------------------------------------------------
|  TRANSFER CREDIT FROM 785674556 REF NO      6500         2    |
=================================================================

Nov_2017
=================================================================
|        Desc.                            |  Amount   |  Count  | 
=================================================================
|  TRANSFER CREDIT FROM 785674556             9239         2    |
=================================================================

If you notice from the two tables of data, it does the following:

Skips the string if it only occurs once
Groups strings if they occur twice, but, they can have differences somewhere, e.g. Reference number at the end of the string
Increments the "amount" for that group
Increments the "count" for each time an occurrence was found.

Edit: These transactions are provided to us from a third party source and saved in our Mysql DB, so the strings could literally be anything. This means we can't have a collection of strings to look for, because we don't know what we are looking for. We need to highlight the "patterns" in the transactions rather than things we are looking for.

Edit2: Some more sample strings could be:
RETURNED CREDIT FROM Mr Nobody 9392 JKK freight ACCOUNT CLOSED
RETURNED CREDIT FROM Mrs Somebody Melbourne Aus INVALID ACCOUNT NUMBER
VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD
EFTPOS DEP Medicare Benefit
DIRECT CREDIT CBA TRANSFER
BPAY REV 3535333 KLM RENEW 4823
AGENT DEPOSIT 87
ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC
DIRECT CREDIT DONTY BENEFITS 23423322 EYWQ

Methods I've tried:

1 - I tried this approach to compare strings, however, using similar_text() is not specific enough - it results in groups being made which shouldn't be as it's just based on the similarity percentage, and doesn't group based on characters like required.

2 - I tried querying from the database using ORDER BY desc ASC which obviously orders them quite well, but includes everything, and doesn't group them like the desired result.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

duanbiao4035 2017-11-13 04:59

关注

The bad news is: php doesn't offer anything out of the box to do this kind of process.

The good news is: you can roll your own and modify it as you learn more about what you want to call "similar". As your bespoke method matures, its accuracy will ascend ever closer to 100%. Use the "leftovers" to discover similarities that can be built into the code.

I hope this is enough to give you some traction:

Code: (Demo)

$sample_transactions=[
    'Oct_2017'=>[
        ["desc"=>"INTERNET TRANSFER CREDIT FROM 34345555 REF NO 21283322","amount"=>"1290"],
        ["desc"=>"INTERNET TRANSFER CREDIT FROM 34345555 REF NO 8765876","amount"=>"1000"],
        ["desc"=>"INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212","amount"=>"2500"],
        ["desc"=>"INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553","amount"=>"4000"],
        ["desc"=>"PHONE TRANSFER CREDIT FROM 65765544 REF NO 123444","amount"=>"879"],
        ["desc"=>"EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU","amount"=>"200"],
        ["desc"=>"INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212","amount"=>"3200"],
        ["desc"=>"INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553","amount"=>"6039"],
        ["desc"=>"RETURNED CREDIT FROM Mr Nobody 9392 JKK freight ACCOUNT CLOSED","amount"=>"123"],
        ["desc"=>"RETURNED CREDIT FROM Mrs Somebody Melbourne Aus INVALID ACCOUNT NUMBER","amount"=>"124"],
        ["desc"=>"VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD","amount"=>"1234"],
        ["desc"=>"EFTPOS DEP Medicare Benefit","amount"=>"999"],
        ["desc"=>"DIRECT CREDIT CBA TRANSFER","amount"=>"1050"],
        ["desc"=>"BPAY REV 3535333 KLM RENEW 4823","amount"=>"1175"],
        ["desc"=>"AGENT DEPOSIT 87","amount"=>"100"],
        ["desc"=>"ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC","amount"=>"200"],
        ["desc"=>"DIRECT CREDIT DONTY BENEFITS 23423322 EYWQ","amount"=>"300"]
    ]
];

foreach($sample_transactions as $mo_year=>$trans_array){
    foreach($trans_array as $trans){
        if(!$key=strstr($trans['desc'],' FROM ',true)){ // declare $key if no ' FROM ' try something else
            if(strpos($trans['desc'],'DIRECT CREDIT')===0){  // try 'DIRECT CREDIT' at start of string
                $key='DIRECT CREDIT';
            }else{
                $key=$trans['desc'];  // if all attempts fail, default to fullstring value
            }
        }
        if(!isset($groups[$mo_year][$key])){
            $groups[$mo_year][$key]=['Amount'=>$trans['amount'],'Count'=>1];  //initialize the row
        }else{
            $groups[$mo_year][$key]=[
                'Amount'=>$groups[$mo_year][$key]['Amount']+$trans['amount'],  // do the sum
                'Count'=>++$groups[$mo_year][$key]['Count']  // increment by 1
            ];
        }
    }
}

// split results into two groups based on Count value
foreach($groups as $mo_year=>$rows){
    foreach($rows as $desc=>$sums){
        if($sums['Count']<2){
            $singletons[$mo_year][$desc]=$sums;
        }else{
            $similarities[$mo_year][$desc]=$sums;
        }
    }
}

echo "Attempted Consolidation:
";
var_export($groups);
echo "

Similarities:
";
var_export($similarities);
echo "

The leftovers to review and try to isolate relevant similarities
";
var_export($singletons);

Output:

Attempted Consolidation:
array (
  'Oct_2017' => 
  array (
    'INTERNET TRANSFER CREDIT' => 
    array (
      'Amount' => 18029,
      'Count' => 6,
    ),
    'PHONE TRANSFER CREDIT' => 
    array (
      'Amount' => '879',
      'Count' => 1,
    ),
    'EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU' => 
    array (
      'Amount' => '200',
      'Count' => 1,
    ),
    'RETURNED CREDIT' => 
    array (
      'Amount' => 247,
      'Count' => 2,
    ),
    'VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD' => 
    array (
      'Amount' => '1234',
      'Count' => 1,
    ),
    'EFTPOS DEP Medicare Benefit' => 
    array (
      'Amount' => '999',
      'Count' => 1,
    ),
    'DIRECT CREDIT' => 
    array (
      'Amount' => 1350,
      'Count' => 2,
    ),
    'BPAY REV 3535333 KLM RENEW 4823' => 
    array (
      'Amount' => '1175',
      'Count' => 1,
    ),
    'AGENT DEPOSIT 87' => 
    array (
      'Amount' => '100',
      'Count' => 1,
    ),
    'ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC' => 
    array (
      'Amount' => '200',
      'Count' => 1,
    ),
  ),
)

Similarities:
array (
  'Oct_2017' => 
  array (
    'INTERNET TRANSFER CREDIT' => 
    array (
      'Amount' => 18029,
      'Count' => 6,
    ),
    'RETURNED CREDIT' => 
    array (
      'Amount' => 247,
      'Count' => 2,
    ),
    'DIRECT CREDIT' => 
    array (
      'Amount' => 1350,
      'Count' => 2,
    ),
  ),
)

The leftovers to review and try to isolate relevant similarities
array (
  'Oct_2017' => 
  array (
    'PHONE TRANSFER CREDIT' => 
    array (
      'Amount' => '879',
      'Count' => 1,
    ),
    'EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU' => 
    array (
      'Amount' => '200',
      'Count' => 1,
    ),
    'VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD' => 
    array (
      'Amount' => '1234',
      'Count' => 1,
    ),
    'EFTPOS DEP Medicare Benefit' => 
    array (
      'Amount' => '999',
      'Count' => 1,
    ),
    'BPAY REV 3535333 KLM RENEW 4823' => 
    array (
      'Amount' => '1175',
      'Count' => 1,
    ),
    'AGENT DEPOSIT 87' => 
    array (
      'Amount' => '100',
      'Count' => 1,
    ),
    'ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC' => 
    array (
      'Amount' => '200',
      'Count' => 1,
    ),
  ),
)

本回答被题主选为最佳回答 , 对您是否有帮助呢?

报告相同问题？

关注问题

这个是查找字符串数组中开头的最长公共字符串，怎么改成查找字符串数组中最长的公共字符串 python 数据分析
2022-03-31 16:13

回答 1 已采纳 1、调试发现这个代码还是有问题的，在取最短字符串的没有赋值a[0]而是赋值为空，导致第二个测试数据在寻找最小字符串时候为空。2、寻找最长字符串可以嵌套三层循环 news = a[0]for i in
字符串分割后怎么存入字符串数组里啊，不是字符数组哦 c++ c语言
2022-04-09 13:15

回答 1 已采纳用vector就可以 bool strsegmentation(string input, char c, vector <string> & segment) { //input
请问vue中怎么把一个字符串数组转换成数字类型 vue.js 前端前端框架
2022-07-27 11:17

回答 2 已采纳 let array= ["1","2","3"]; let newArr = array.map(item => Number(item)) console.log(newArr)
【PHP】字符串&数组
2023-08-25 21:06

芒果再努力的博客在PHP中系统提供多种定义数组的方式：1、使用array关键字：最常用的$变量 = array(元素1,元素2,元素3..);...//中括号里面的内容称之为下标key，该下标可以是字母（单词）或者数字，与变量命名的规则相似$arr[] = 1;
在matlab中如何把元胞数组或字符串数组替换为数字其他开发语言
2022-10-10 18:18

回答 1 已采纳考虑使用strcmp函数吧。strcmp函数评估字符串是否相同然后第一次，查询A在cell的哪些位置，把这些位置替换成1第二次，查询B在cell的哪些位置，再把这些位置替换成2……依次类推。 m=4;
字符数组与字符串：过滤串中非字母 c++ c语言
2021-07-04 15:33

回答 1 已采纳 int fun(char *ptr) { char *p = ptr; int count = 0; while (*ptr) { if (*ptr &
将数组元素转换为单个字符串（PHP） php
2019-05-20 20:32

回答 2 已采纳 echo $aResult[0]['voornaam']; But this will only work, if you fetch just one row. (Not sure with
php正则匹配字符_php中字符串和正则表达式详解
2021-03-22 19:47

weixin_39625098的博客一、字符串类型的特点1、PHP是弱类型语言，其他数据类型一般都可以直接应用于字符串函数操作。echo substr("123456",2,4); //输出345echo substr(123456,2,4); //输出345echo hello; //先查找hello常量，若没找到，...
查找字符串数组中的最长公共前缀 c++ leetcode
2023-02-24 23:42

回答 2 已采纳该回答引用GPTᴼᴾᴱᴺᴬᴵ在你的代码中，有几个问题：函数返回类型应该是字符串类型，而不是整型0。当出现不匹配的情况时，应该立即返回公共前缀，而不是输出当前已经找到的公共前缀并继续循环。在第二个循环
在javascript中将数组转换为字符串并在php中返回到数组？ javascript php
2014-01-25 03:05

回答 1 已采纳 Looks like you have magic quotes on. Disable it to stop the massive multitude of backslashes.
C++编写一个字符串类 MyString, 其中用字符数组来存储字符串 c++ 开发语言
2022-05-24 22:03

回答 1 已采纳 #include <iostream> using namespace std; class MyString { char str[100]; //
php中的字符串处理,PHP部分---字符串处理（文中例子不断更新）；
2021-04-29 06:25

鲫鱼先生的博客 字符串处理：1.去除 ltrim(),rtrim()函数 trim()函数trim(字符串名)函数；--去掉字符串两边的空格ltrim(字符串名,“特殊字符”);--去除字符串左边的空格和特殊字符(特殊字符是可选参数)rtrim(字符串名,“特殊字符”)...
将一字符串数组的内容复制到结构体中定义的字符串数组 c++ c语言
2022-07-27 15:15

回答 1 已采纳循环赋值就可以了啊for int i=0;i<2;i++) temp.a[i] = a1[i];
php 内核 字符串 比较,PHP内核探索之变量- 不平凡的字符串
2021-05-08 16:52

SEX专家的博客 This is a string \n,My string is $stringSTR;echo $str;...nowdoc的起始标志符是用单引号括起来的，与单引号相似，它不会解析其中的变量，格式控制符等：$s = <<this is $strthis is \t ...
Javascript 获取数组长度, 对象成员个数, 字符串数
2020-04-13 14:51

宁小法的博客获取数组长度, 对象成员个数, 字符串数. 2.学习/操作 TBD 3.问题/补充 TBD 4.参考 https://www.cnblogs.com/sunnywindycloudy/p/7382226.html //js获取数组长度，对象成员个数、字符串数 ...
没有解决我的问题, 去提问

悬赏问题

¥20 java在应用程序里获取不到扬声器设备
¥15 echarts动画效果的问题，请帮我添加一个动画。不要机器人回答。
¥60 许可证msc licensing软件报错显示已有相同版本软件，但是下一步显示无法读取日志目录。
¥15 Attention is all you need 的代码运行
¥15 一个服务器已经有一个系统了如果用usb再装一个系统，原来的系统会被覆盖掉吗
¥15 使用esm_msa1_t12_100M_UR50S蛋白质语言模型进行零样本预测时，终端显示出了sequence handled的进度条，但是并不出结果就自动终止回到命令提示行了是怎么回事：
¥15 前置放大电路与功率放大电路相连放大倍数出现问题
¥80 部署运行web自动化项目
¥15 腾讯云如何建立同一个项目中物模型之间的联系
¥30 VMware 云桌面水印如何添加

码龄粉丝数原力等级 --

查找字符串中的相似性，然后将新数组与其他数据分组

Here is an example of the desired data:

Methods I've tried:

1条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

查找字符串中的相似性，然后将新数组与其他数据分组

Here is an example of the desired data:

Methods I've tried:

1条回答 默认 最新

悬赏问题

1条回答默认最新