I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.
I can't speak to any issues in your Go implementation, but I'll address the larger problem of POS tagging in general. It sounds like you're attempting to build a rule-based unigram tagger. To elaborate a bit on those terms:
- "unigram" means you're considering each word in the sentence separately. Note that a unigram tagger is inherently limited, in that it cannot disambiguate words which can take on multiple POS tags. E.g., should you tag 'fish' as a noun or a verb? Is 'last' a verb or an adverb?
- "rule-based" means exactly what it sounds like: a set of rules to determine the tag for each word. Rule-based tagging is limited in a different way - it requires considerable development effort to assemble a ruleset that will handle a reasonable portion of the ambiguity in common language. This effort might be appropriate if you're working in a language for which we don't have good training resources, but in most common languages, we now have enough tagged text to train high-accuracy tagging models.
State-of-the-art for POS tagging is above 97% accuracy on well-formed newswire text (accuracy on less formal genres is naturally lower). A rule-based tagger will probably perform considerably worse (you'll have to determine the accuracy level needed to meet your requirements). If you want to continue down the rule-based path, I'd recommend reading this tutorial. The code is based on Haskell, but it will help you learn the concepts and issues in rule-based tagging.
That said, I'd strongly recommend you look at other tagging methods. I mentioned the weaknesses of unigram tagging. Related approaches would be 'bigram', meaning that we consider the previous word when tagging word n, 'trigram' (usually the previous 2 words, or the previous word, the current word, and the following word); more generally, 'n-gram' refers to considering a sequence of n words (often, a sliding window around the word we're currently tagging). That context can help us disambiguate 'fish', 'last', 'flies', etc.
we probably want to tag fish as a verb, whereas in
it's certainly a noun.
The NLTK tutorial might be a good reference here. An solid n-gram tagger should get you above 90% accuracy; likely above 95% (again on newswire text).
More sophisticated methods (known as 'structured inference') consider the entire tag sequence as a whole. That is, instead of trying to predict the most probable tag for each word separately, they attempt to predict the most probable sequence of tags for the entire input sequence. Structured inference is of course more difficult to implement and train, but will usually improve accuracy vs. n-gram approaches. If you want to read up on this area, I suggest Sutton and McCallum's excellent introduction.