I am trying to implement the levenshtein algorithm with a little addon. I want to prioritize values that have consecutive matching letters. I've tried implementing my own form of it using the code below:
function levenshtein_rating($string1, $string2) {
$GLOBALS['lvn_memo'] = array();
return lev($string1, 0, strlen($string1), $string2, 0, strlen($string2));
}
function lev($s1, $s1x, $s1l, $s2, $s2x, $s2l, $cons = 0) {
$key = $s1x . "," . $s1l . "," . $s2x . "," . $s2l;
if (isset($GLOBALS['lvn_memo'][$key])) return $GLOBALS['lvn_memo'][$key];
if ($s1l == 0) return $s2l;
if ($s2l == 0) return $s1l;
$cost = 0;
if ($s1[$s1x] != $s2[$s2x]) $cost = 1;
else $cons -= 0.1;
$dist = min(
(lev($s1, $s1x + 1, $s1l - 1, $s2, $s2x, $s2l, $cons) + 1),
(lev($s1, $s1x, $s1l, $s2, $s2x + 1, $s2l - 1, $cons) + 1),
(lev($s1, $s1x + 1, $s1l - 1, $s2, $s2x + 1, $s2l - 1, $cons) + $cost)
);
$GLOBALS['lvn_memo'][$key] = $dist + $cons;
return $dist + $cons;
}
You should note the $cons -= 0.1;
is the part where I am adding a value to prioritize consecutive values. This formula will be checking against a large database of strings. (As high as 20,000 - 50,000) I've done a benchmark test with PHP's built in levenshtein
Message Time Change Memory
PHP N/A 9300128
End PHP 1ms 9300864
End Mine 20ms 9310736
Array
(
[0] => 3
[1] => 3
[2] => 0
)
Array
(
[0] => 2.5
[1] => 1.9
[2] => -1.5
)
Benchmark Test Code:
$string1 = "kitten";
$string2 = "sitter";
$string3 = "sitting";
$log = new Logger("PHP");
$distances = array();
$distances[] = levenshtein($string1, $string3);
$distances[] = levenshtein($string2, $string3);
$distances[] = levenshtein($string3, $string3);
$log->log("End PHP");
$distances2 = array();
$distances2[] = levenshtein_rating($string1, $string3);
$distances2[] = levenshtein_rating($string2, $string3);
$distances2[] = levenshtein_rating($string3, $string3);
$log->log("End Mine");
echo $log->status();
echo "<pre>" . print_r($distances, true) . "</pre>";
echo "<pre>" . print_r($distances2, true) . "</pre>";
I recognize that PHP's built in function will probably always be faster than mine by nature. But I am wondering if there is a way to speed mine up?
So the question: Is there a way to speed this up? My alternative here is to run levenshtein and then search through the highest X results of that and prioritize them additionally.
Based on Leigh's comment, copying PHP's built in form of Levenhstein lowered the time down to 3ms. (EDIT: Posted the version with consecutive character deductions. This may need tweaked, by appears to work.)
function levenshtein_rating($s1, $s2, $cons = 0, $cost_ins = 1, $cost_rep = 1, $cost_del = 1) {
$s1l = strlen($s1);
$s2l = strlen($s2);
if ($s1l == 0) return $s2l;
if ($s2l == 0) return $s1l;
$p1 = array();
$p2 = array();
for ($i2 = 0; $i2 <= $s2l; ++$i2) {
$p1[$i2] = $i2 * $cost_ins;
}
$cons = 0;
$cons_count = 0;
$cln = 0;
$tbl = $s1;
$lst = false;
for ($i1 = 0; $i1 < $s1l; ++$i1) {
$p2[0] = $p1[0] + $cost_del;
$srch = true;
for($i2 = 0; $i2 < $s2l; ++ $i2) {
$c0 = $p1[$i2] + (($s1[$i1] == $s2[$i2]) ? 0 : $cost_rep);
if ($srch && $s2[$i2] == $tbl[$i1]) {
$tbl[$i1] = "\0";
$srch = false;
$cln += ($cln == 0) ? 1 : $cln * 1;
}
$c1 = $p1[$i2 + 1] + $cost_del;
if ($c1 < $c0) $c0 = $c1;
$c2 = $p2[$i2] + $cost_ins;
if ($c2 < $c0) $c0 = $c2;
$p2[$i2 + 1] = $c0;
}
if (!$srch && $lst) {
$cons_count += $cln;
$cln = 0;
}
$lst = $srch;
$tmp = $p1;
$p1 = $p2;
$p2 = $tmp;
}
$cons_count += $cln;
$cons = -1 * ($cons_count * 0.1);
return $p1[$s2l] + $cons;
}