Skip to content
Related Articles

Related Articles

NLP | Location Tags Extraction

Improve Article
Save Article
  • Last Updated : 26 Feb, 2019
Improve Article
Save Article

Different kind of ChunkParserI subclass can be used to identify the LOCATION chunks. As it uses the gazetteers corpus to identify location words. The gazetteers corpus is a WordListCorpusReader class that contains the following location words:

  • Country names
  • U.S. states and abbreviations
  • Mexican states
  • Major U.S. cities
  • Canadian provinces

LocationChunker class looking for words that are found in the gazetteers corpus by iterating over a tagged sentence. It creates a LOCATION chunk using IOB tags when it finds one or more location words. The IOB LOCATION tags are produced in the iob_locations() and the parse() method converts the IOB tags to Tree.

Code #1 : LocationChunker class

from nltk.chunk import ChunkParserI
from nltk.chunk.util import conlltags2tree
from nltk.corpus import gazetteers
class LocationChunker(ChunkParserI):
    def __init__(self):
        self.locations = set(gazetteers.words())
        self.lookahead = 0
        for loc in self.locations:
            nwords = loc.count(' ')
        if nwords > self.lookahead:
            self.lookahead = nwords

Code #2 : iob_locations() method

def iob_locations(self, tagged_sent):
    i = 0
    l = len(tagged_sent)
    inside = False
    while i < l:
        word, tag = tagged_sent[i]
        j = i + 1
        k = j + self.lookahead
        nextwords, nexttags = [], []
        loc = False
    while j < k:
        if ' '.join([word] + nextwords) in self.locations:
            if inside:
                yield word, tag, 'I-LOCATION'
                yield word, tag, 'B-LOCATION'
            for nword, ntag in zip(nextwords, nexttags):
                yield nword, ntag, 'I-LOCATION'
                loc, inside = True, True
                i = j
        if j < l:
            nextword, nexttag = tagged_sent[j]
            j += 1
        if not loc:
            inside = False
            i += 1
            yield word, tag, 'O'
    def parse(self, tagged_sent):
        iobs = self.iob_locations(tagged_sent)
        return conlltags2tree(iobs)

Code #3 : use the LocationChunker class to parse the sentence

from nltk.chunk import ChunkParserI
from chunkers import sub_leaves
from chunkers import LocationChunker
t = loc.parse([('San', 'NNP'), ('Francisco', 'NNP'),
               ('CA', 'NNP'), ('is', 'BE'), ('cold', 'JJ'), 
               ('compared', 'VBD'), ('to', 'TO'), ('San', 'NNP'),
               ('Jose', 'NNP'), ('CA', 'NNP')])
print ("Location : \n", sub_leaves(t, 'LOCATION'))

Output :

Location : 
[[('San', 'NNP'), ('Francisco', 'NNP'), ('CA', 'NNP')], 
[('San', 'NNP'), ('Jose', 'NNP'), ('CA', 'NNP')]]

My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!