I believe I have a complex issue which I'm struggling to find a solution for, and for the life of me I can't seem to achieve it.
I need to run an analysis on thousands of bank transactions to find similarities in their descriptions.
To start, I have an array of transactions, that are grouped by month, here's is a small sample:
$sample_transactions = array(
['Oct_2017']=>array(4) {
[0] => array(2) {
["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 34345555 REF NO 21283322"
["amount"]=>string(4) "1290"
}
[1] => array(2) {
["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 34345555 REF NO 8765876"
["amount"]=>string(4) "1000"
}
[2] => array(2) {
["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212"
["amount"]=>string(4) "2500"
}
[3] => array(2) {
["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553"
["amount"]=>string(4) "4000"
}
}
['Nov_2017']=>array(4) {
[0] => array(2) {
["desc"]=>string(55) "PHONE TRANSFER CREDIT FROM 65765544 REF NO 123444"
["amount"]=>string(4) "879"
}
[1] => array(2) {
["desc"]=>string(55) "EFTPOS JKL REV JANES HAIR MELBOURNE VIC AU"
["amount"]=>string(4) "200"
}
[2] => array(2) {
["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 46312212"
["amount"]=>string(4) "3200"
}
[3] => array(2) {
["desc"]=>string(55) "INTERNET TRANSFER CREDIT FROM 785674556 REF NO 977553"
["amount"]=>string(4) "6039"
}
}
);
Using the sample transactions above, the desire is to look for similarities in their desc, then group them together, add a count of how many there were, and increment the total figure.
Here is an example of the desired data:
Oct_2017
=================================================================
| Desc. | Amount | Count |
=================================================================
| TRANSFER CREDIT FROM 34345555 REF NO 2290 2 |
-----------------------------------------------------------------
| TRANSFER CREDIT FROM 785674556 REF NO 6500 2 |
=================================================================
Nov_2017
=================================================================
| Desc. | Amount | Count |
=================================================================
| TRANSFER CREDIT FROM 785674556 9239 2 |
=================================================================
If you notice from the two tables of data, it does the following:
- Skips the string if it only occurs once
- Groups strings if they occur twice, but, they can have differences somewhere, e.g. Reference number at the end of the string
- Increments the "amount" for that group
- Increments the "count" for each time an occurrence was found.
Edit: These transactions are provided to us from a third party source and saved in our Mysql DB, so the strings could literally be anything. This means we can't have a collection of strings to look for, because we don't know what we are looking for. We need to highlight the "patterns" in the transactions rather than things we are looking for.
Edit2: Some more sample strings could be:RETURNED CREDIT FROM Mr Nobody 9392 JKK freight ACCOUNT CLOSED
RETURNED CREDIT FROM Mrs Somebody Melbourne Aus INVALID ACCOUNT NUMBER
VISA CREDIT HERTZ GOKKO JIMBO 14/08 AU AUD
EFTPOS DEP Medicare Benefit
DIRECT CREDIT CBA TRANSFER
BPAY REV 3535333 KLM RENEW 4823
AGENT DEPOSIT 87
ANZ ATM PORTLAND 26 NOTHING ST PORTLAND VIC
DIRECT CREDIT DONTY BENEFITS 23423322 EYWQ
Methods I've tried:
1 - I tried this approach to compare strings, however, using similar_text()
is not specific enough - it results in groups being made which shouldn't be as it's just based on the similarity percentage, and doesn't group based on characters like required.
2 - I tried querying from the database using ORDER BY desc ASC
which obviously orders them quite well, but includes everything, and doesn't group them like the desired result.