Skip to content
Related Articles

Related Articles

NLP | Training Tagger Based Chunker | Set 2

Improve Article
Save Article
  • Last Updated : 25 Jul, 2022
Improve Article
Save Article

Conll2000 corpus

defines the chunks using IOB tags.

  • It specifies where the chunk begins and ends, along with its types.
  • A part-of-speech tagger can be trained on these IOB tags to further power a ChunkerI subclass.
  • First using the chunked_sents() method of corpus, a tree is obtained and is then transformed to a format usable by a part-of-speech tagger.
  • conll_tag_chunks() uses tree2conlltags() to convert a sentence Tree into a list of three tuples of the form (word, pos, iob).
    • pos: part-of-speech tag
    • IOB: IOB tag for example – B_NP, I_NP to tell that work is in the beginning and inside the noun phrase respectively.
  • conlltags2tree() is reversal of tree2conlltags()
  • 3-tuples are then converted into 2-tuples that the tagger can recognize.
  • RegexpParser class uses part-of-speech tags for chunk patterns, so part-of-speech tags are used as if they were words to tag.
  • conll_tag_chunks() function takes 3-tuples (word, pos, iob) and returns a list of 2-tuples of the form (pos, iob)

Code #1: Let’s understand 


from nltk.chunk.util import tree2conlltags, conlltags2tree
from nltk.tree import Tree
t = Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')])])
print ("Tree2conlltags : \n", tree2conlltags(t))
c = conlltags2tree([('the', 'DT', 'B-NP'), ('book', 'NN', 'I-NP')])
print ("\nconlltags2tree : \n", c)
# Converting 3 tuples to 2 tuples.
print ("\nconll_tag_chunnks for tree : \n", conll_tag_chunks([t]))

Output : 

Tree2conlltags : 
[('the', 'DT', 'B-NP'), ('book', 'NN', 'I-NP')]

conlltags2tree : 
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')])])

conll_tag_chunnks for tree : 
[[('DT', 'B-NP'), ('NN', 'I-NP')]]

Code #2: TagChunker class using the conll2000 corpus 


from chunkers import TagChunker
from nltk.corpus import conll2000
# data
conll_train = conll2000.chunked_sents('train.txt')
conll_test = conll2000.chunked_sents('test.txt')
# initializing the chunker
chunker = TagChunker(conll_train)
# testing
score = chunker.evaluate(conll_test)
a = score.accuracy()
p = score.precision()
r = recall
print ("Accuracy of TagChunker : ", a)
print ("\nPrecision of TagChunker : ", p)
print ("\nRecall of TagChunker : ", r)

Output : 

Accuracy of TagChunker : 0.8950545623403762

Precision of TagChunker : 0.8114841974355675

Recall of TagChunker : 0.8644191676944863

Note: The performance of conll2000 is not too good as treebank_chunk but conll2000 is a much larger corpus. 
Code #3 : TagChunker using UnigramTagger Class  


# loading libraries
from chunkers import TagChunker
from nltk.tag import UnigramTagger
uni_chunker = TagChunker(train_chunks,
                         tagger_classes =[UnigramTagger])
score = uni_chunker.evaluate(test_chunks)
a = score.accuracy()
print ("Accuracy of TagChunker : ", a)

Output : 

Accuracy of TagChunker : 0.9674925924335466

The tagger_classes argument is passed directly to the backoff_tagger() function, so that means they must be subclasses of SequentialBackoffTagger. In testing, the default of tagger_classes = [UnigramTagger, BigramTagger] generally produces the best results, but it can vary with different corpuses.

My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!