doubeng3216
doubeng3216
2010-10-13 23:38

用于PHP的Schinke拉丁语词干算法

  • function
  • php
  • algorithm
已采纳

This website offers the "Schinke Latin stemming algorithm" for download to use it in the Snowball stemming system.

I want to use this algorithm, but I don't want to use Snowball.

The good thing: There's some pseudocode on that page which you could translate to a PHP function. This is what I've tried:

<?php
function stemLatin($word) {
    // output = array(NOUN-BASED STEM, VERB-BASED STEM)
    // DEFINE CLASSES BEGIN
    $queWords = array('atque', 'quoque', 'neque', 'itaque', 'absque', 'apsque', 'abusque', 'adaeque', 'adusque', 'denique', 'deque', 'susque', 'oblique', 'peraeque', 'plenisque', 'quandoque', 'quisque', 'quaeque', 'cuiusque', 'cuique', 'quemque', 'quamque', 'quaque', 'quique', 'quorumque', 'quarumque', 'quibusque', 'quosque', 'quasque', 'quotusquisque', 'quousque', 'ubique', 'undique', 'usque', 'uterque', 'utique', 'utroque', 'utribique', 'torque', 'coque', 'concoque', 'contorque', 'detorque', 'decoque', 'excoque', 'extorque', 'obtorque', 'optorque', 'retorque', 'recoque', 'attorque', 'incoque', 'intorque', 'praetorque');
    $suffixesA = array('ibus, 'ius, 'ae, 'am, 'as, 'em', 'es', ia', 'is', 'nt', 'os', 'ud', 'um', 'us', 'a', 'e', 'i', 'o', 'u');
    $suffixesB = array('iuntur', 'beris', 'erunt', 'untur', 'iunt', 'mini', 'ntur', 'stis', 'bor', 'ero', 'mur', 'mus', 'ris', 'sti', 'tis', 'tur', 'unt', 'bo', 'ns', 'nt', 'ri', 'm', 'r', 's', 't');
    // DEFINE CLASSES END
    $word = strtolower(trim($word)); // make string lowercase + remove white spaces before and behind
    $word = str_replace('j', 'i', $word); // replace all <j> by <i>
    $word = str_replace('v', 'u', $word); // replace all <v> by <u>
    if (substr($word, -3) == 'que') { // if word ends with -que
        if (in_array($word, $queWords)) { // if word is a queWord
            return array($word, $word); // output queWord as both noun-based and verb-based stem
        }
        else {
            $word = substr($word, 0, -3); // remove the -que
        }
    }
    foreach ($suffixesA as $suffixA) { // remove suffixes for noun-based forms (list A)
        if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix
            $word = substr($word, 0, -strlen($suffixA)); // remove the suffix
            break; // remove only one suffix
        }
    }
    if (strlen($word) >= 2) { $nounBased = $word; } else { $nounBased = ''; } // add only if word contains two or more characters
    foreach ($suffixesB as $suffixB) { // remove suffixes for verb-based forms (list B)
        if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix
            switch ($suffixB) {
                case 'iuntur', 'erunt', 'untur', 'iunt', 'unt': $word = substr($word, 0, -strlen($suffixB)).'i'; break; // replace suffix by <i>
                case 'beris', 'bor', 'bo': $word = substr($word, 0, -strlen($suffixB)).'bi'; break; // replace suffix by <bi>
                case 'ero': $word = substr($word, 0, -strlen($suffixB)).'eri'; break; // replace suffix by <eri>
                default: $word = substr($word, 0, -strlen($suffixB)); break; // remove the suffix
            }
            break; // remove only one suffix
        }
    }
    if (strlen($word) >= 2) { $verbBased = $word; } else { $verbBased = ''; } // add only if word contains two or more characters
    return array($nounBased, $verbBased);
}
?>

My questions:

1) Will this code work correctly? Does it follow the algorithm's rules?

2) How could you improve the code (performance)?

Thank you very much in advance!

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

2条回答

  • dongshiru5913 dongshiru5913 11年前

    No, your function will not work, it contains syntax errors. For example you have unclosed quotes and you use a wrong switch syntax.

    Here is my rewrite of the function. As the pseudoalgorithm on that page isn't really precise I had to do some interpreting. I interpreted it in a way that the examples mentioned in this article work.

    I also did some optimizations. The first one is that I define the word and suffix arrays static. Thus all calls to this function share the same arrays which should be good fore performance ;)

    Furthermore I adjusted the arrays so they can be used more effective. I changed the $queWords array so it can be used for a fast hash-table lookup, not a slow in_array. Furthermore I have saved the lengths for the suffixes in the array. Thus you don't need to compute them at runtime (which is really, really slow). I may have made more minor optimizations.

    I don't know how much faster this code is, but it should be much faster. Furthermore it now works on the examples provided.

    Here is the code:

    <?php
        function stemLatin($word) {
            static $queWords = array(
                'atque'         => 1,
                'quoque'        => 1,
                'neque'         => 1,
                'itaque'        => 1,
                'absque'        => 1,
                'apsque'        => 1,
                'abusque'       => 1,
                'adaeque'       => 1,
                'adusque'       => 1,
                'denique'       => 1,
                'deque'         => 1,
                'susque'        => 1,
                'oblique'       => 1,
                'peraeque'      => 1,
                'plenisque'     => 1,
                'quandoque'     => 1,
                'quisque'       => 1,
                'quaeque'       => 1,
                'cuiusque'      => 1,
                'cuique'        => 1,
                'quemque'       => 1,
                'quamque'       => 1,
                'quaque'        => 1,
                'quique'        => 1,
                'quorumque'     => 1,
                'quarumque'     => 1,
                'quibusque'     => 1,
                'quosque'       => 1,
                'quasque'       => 1,
                'quotusquisque' => 1,
                'quousque'      => 1,
                'ubique'        => 1,
                'undique'       => 1,
                'usque'         => 1,
                'uterque'       => 1,
                'utique'        => 1,
                'utroque'       => 1,
                'utribique'     => 1,
                'torque'        => 1,
                'coque'         => 1,
                'concoque'      => 1,
                'contorque'     => 1,
                'detorque'      => 1,
                'decoque'       => 1,
                'excoque'       => 1,
                'extorque'      => 1,
                'obtorque'      => 1,
                'optorque'      => 1,
                'retorque'      => 1,
                'recoque'       => 1,
                'attorque'      => 1,
                'incoque'       => 1,
                'intorque'      => 1,
                'praetorque'    => 1,
            );
            static $suffixesNoun = array(
                'ibus' => 4,
                'ius'  => 3,
                'ae'   => 2,
                'am'   => 2,
                'as'   => 2,
                'em'   => 2,
                'es'   => 2,
                'ia'   => 2,
                'is'   => 2,
                'nt'   => 2,
                'os'   => 2,
                'ud'   => 2,
                'um'   => 2,
                'us'   => 2,
                'a'    => 1,
                'e'    => 1,
                'i'    => 1,
                'o'    => 1,
                'u'    => 1,
            );
            static $suffixesVerb = array(
                'iuntur' => 6,
                'beris'  => 5,
                'erunt'  => 5,
                'untur'  => 5,
                'iunt'   => 4,
                'mini'   => 4,
                'ntur'   => 4,
                'stis'   => 4,
                'bor'    => 3,
                'ero'    => 3,
                'mur'    => 3,
                'mus'    => 3,
                'ris'    => 3,
                'sti'    => 3,
                'tis'    => 3,
                'tur'    => 3,
                'unt'    => 3,
                'bo'     => 2,
                'ns'     => 2,
                'nt'     => 2,
                'ri'     => 2,
                'm'      => 1,
                'r'      => 1,
                's'      => 1,
                't'      => 1,
            );
    
            $stems = array($word, $word);
    
            $word = strtr(strtolower(trim($word)), 'jv', 'iu'); // trim, lowercase and j => i, v => u
    
            if (substr($word, -3) == 'que') {
                if (isset($queWords[$word])) {
                    return array($word, $word);
                }
                $word = substr($word, 0, -3);
            }
    
            foreach ($suffixesNoun as $suffix => $length) {
                if (substr($word, -$length) == $suffix) {
                    $tmp = substr($word, 0, -$length);
    
                    if (isset($tmp[1]))
                        $stems[0] = $tmp;
                    break;
                }
            }
    
            foreach ($suffixesVerb as $suffix => $length) {
                if (substr($word, -$length) == $suffix) {
                    switch ($suffix) {
                        case 'iuntur':
                        case 'erunt':
                        case 'untur':
                        case 'iunt':
                        case 'unt':
                            $tmp = substr_replace($word, 'i', -$length, $length);
                        break;
                        case 'beris':
                        case 'bor':
                        case 'bo':
                            $tmp = substr_replace($word, 'bi', -$length, $length);
                        break;
                        case 'ero':
                            $tmp = substr_replace($word, 'eri', -$length, $length);
                        break;
                        default:
                            $tmp = substr($word, 0, -$length);
                    }
    
                    if (isset($tmp[1]))
                        $stems[1] = $tmp;
                    break;
                }
            }
    
            return $stems;
        }
    
        var_dump(stemLatin('aquila'));
        var_dump(stemLatin('portat'));
        var_dump(stemLatin('portis'));
    
    点赞 评论 复制链接分享
  • dpj83664 dpj83664 11年前

    As far as I can tell, this follows the algorithm described in your link, and should work correctly. (Apart from the syntax error you have in the definition of $suffixesA - you're missing a couple of apostrophes.)

    Performance-wise, it doesn't look like there's much to gain here, but there are a few things that come to mind.

    If this is going to get called many times during a single execution of the script, there might be something gained by defining these arrays outside of the function - I don't think PHP is smart enough to cache those arrays between calls to the function.

    You can also combine those two str_replaces into one: $word = str_replace(array('j','v'), array('i','u'), $word);, or, since you're replacing single characters with single characters, you can use $word = strtr($word,'jv','iu'); - but I don't think that will make much difference in practice. You'll have to try it out to be certain.

    点赞 评论 复制链接分享

为你推荐