Parts of Speech Tagging and Brill Tagging
For some, part of speech (POS) tagging is quite obviously a useful NLP feature like myself in middle school when I suffered from extremely hard grammar tests. I still shudder today when I recall those dreadful times learning and continuously testing on the 4 levels of grammar: parts of speech, sentences, phrases, and clauses.
Yet here I am in this blog trying to learn, understand, and explain the manifestation of the first level of grammar analysis (parts of speech) in computers and machines, without human assistance.
Developing a part of speech tagging program is certainly interesting because it gives certain insight to how we learned these concepts in the first place. The method of learning for the human mind is not at all similar to that for computer program, so interestingly we must make sure to observe patterns and apply accordingly.
However, there’s a limit to how far human-identified patterns and rules can take us. This forms the line between the types of part of speech taggers:
- Rule-based POS taggers
- Stochastic POS taggers
Rule-based POS Taggers
Rule-base part of speech taggers involve a set of rules that a person personally curates in order to identify a word’s role in the sentence grammar-wise. However, these rules cannot be applicable to every use case because the role of a word in a sentence is not absolute and instead depending on its context and how it is used within a sentence. Thus, it can be extremely tedious to create a rule-based system that activates upon only certain conditions because it not guaranteed to be very accurate nor is it easy to analyze pure grammatical structural and rules.
For example, one rule could be based on first identifying articles. Articles are parts of speech like ‘a’, ‘an’, and ‘the’. Articles are composed of a small, strict list of words that are known to precede nouns. Thus, they are a great starting point for running all words through an articles list and subsequently identifying not only articles but also nouns.
Brill Tagging
This is an interesting rule-based POS tagger in which there exist multiple different sets of POS tagging rules. Despite being rule-based, Brill’s tagger acknowledges its limitations by “automatically recognizing and remedying its weaknesses” in the words of Eric Brill himself.
Brill tagging has 2 initial procedures:
- Any words that are not in the training corpus (collection of words and texts) and are capitalized generally tend to be proper nouns, and should be tagged accordingly.
- For the other words not in the training corpus, assign them to the tag most common for such words ending in the same 3 letters.
Ex. ______ous = adjective — enormous, numerous, tremendous, …
Ex. ______ion = noun — rejection, objection, inflammation, …
Ex. _______ly = adjective
According to Eric Brill, this simple algorithm itself has an error rate of about 5–8%!
Next, there is a patch acquisition procedure in which there is a separate corpus which the tagger trains on as well as an already correctly annotated patch corpus. A list of tagging errors is compiled by comparing the output of the tagger to the correct tagging of the patch corpus. The list consists of elements <tag_a, tag_b, number> in which tag_a is mistagged for tag_b for the specified number of times.
As I first mentioned in regards to the Brill tagger, there are different sets of POS tagging rules. For each error in the list, the Brill tagger computes the error loss for each set of POS tagging rules and selects the most optimal POS tagging rule for each error. The rules which result in the greatest improvement to the patch corpus is added to the set of POS tagger rules for that specific Brill model. Once the ultimate set of POS tagger rules has been acquired by training the Brill model, new text can be tested / tagged by applying those optimal rules only. Note that there is no need to be too cautious of specific rules because naturally only the best rules will statistically show the best results / least error loss. This also makes it easy to experiment with new rules and ascertain results.
Although rule-based taggers are usually subpar to stochastic taggers, the Brill tagger uses statistics as a means of matching the results of stochastic taggers.