dongye9191
dongye9191
2014-04-27 04:13

GoLang PoS Tagger脚本花费的时间比应该的要长,并且终端没有输出

已采纳

This script is compling without errors in play.golang.org: http://play.golang.org/p/Hlr-IAc_1f

But when I run in on my machine, much longer than I expect happens with nothing happening in the terminal.

What I am trying to build is a PartOfSpeech Tagger.

I think the longest part is loading lexicon.txt into a map and then comparing each word with every word there to see if it has already been tagged in the lexicon. The lexicon only contains verbs. But doesn't every word need to be checked to see if it is a verb.

The larger problem is that I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

2条回答

  • dsadsadsa1231 dsadsadsa1231 7年前

    You've got a large array argument in this function:

    func stringInArray(a string, list [214]string) bool{
        for _, b := range list{
            if b == a{
                return true;
            }
        }
        return false
    }
    

    The array of stopwords gets copied each time you call this function.

    Mostly in Go, you should uses slices rather than arrays most of the time. Change the definition of this to be list []string and define stopWords as a slice rather than an array:

    stopWords := []string{
        "and", "or", ...
    }
    

    Probably an even better approach would be to build a map of the stopWords:

    isStopWord := map[string]bool{}
    for _, sw := range stopWords {
        isStopWord[sw] = true
    }
    

    and then you can check if a word is a stopword quickly:

    if isStopWord[word] { ... }
    
    点赞 评论 复制链接分享
  • dth20986 dth20986 7年前

    (Quoting):

    I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.

    I can't speak to any issues in your Go implementation, but I'll address the larger problem of POS tagging in general. It sounds like you're attempting to build a rule-based unigram tagger. To elaborate a bit on those terms:

    • "unigram" means you're considering each word in the sentence separately. Note that a unigram tagger is inherently limited, in that it cannot disambiguate words which can take on multiple POS tags. E.g., should you tag 'fish' as a noun or a verb? Is 'last' a verb or an adverb?
    • "rule-based" means exactly what it sounds like: a set of rules to determine the tag for each word. Rule-based tagging is limited in a different way - it requires considerable development effort to assemble a ruleset that will handle a reasonable portion of the ambiguity in common language. This effort might be appropriate if you're working in a language for which we don't have good training resources, but in most common languages, we now have enough tagged text to train high-accuracy tagging models.

    State-of-the-art for POS tagging is above 97% accuracy on well-formed newswire text (accuracy on less formal genres is naturally lower). A rule-based tagger will probably perform considerably worse (you'll have to determine the accuracy level needed to meet your requirements). If you want to continue down the rule-based path, I'd recommend reading this tutorial. The code is based on Haskell, but it will help you learn the concepts and issues in rule-based tagging.

    That said, I'd strongly recommend you look at other tagging methods. I mentioned the weaknesses of unigram tagging. Related approaches would be 'bigram', meaning that we consider the previous word when tagging word n, 'trigram' (usually the previous 2 words, or the previous word, the current word, and the following word); more generally, 'n-gram' refers to considering a sequence of n words (often, a sliding window around the word we're currently tagging). That context can help us disambiguate 'fish', 'last', 'flies', etc.

    E.g., in

    We fish

    we probably want to tag fish as a verb, whereas in

    ate fish

    it's certainly a noun.

    The NLTK tutorial might be a good reference here. An solid n-gram tagger should get you above 90% accuracy; likely above 95% (again on newswire text).

    More sophisticated methods (known as 'structured inference') consider the entire tag sequence as a whole. That is, instead of trying to predict the most probable tag for each word separately, they attempt to predict the most probable sequence of tags for the entire input sequence. Structured inference is of course more difficult to implement and train, but will usually improve accuracy vs. n-gram approaches. If you want to read up on this area, I suggest Sutton and McCallum's excellent introduction.

    点赞 评论 复制链接分享

相关推荐