POS Tagging

Parts of speech Tagging is responsible for reading the text in a language and assigning some specific token (Parts of Speech) to each word.

e.g.

Input: Everything to permit us.

Output: [('Everything', NN),('to', TO), ('permit', VB), ('us', PRP)]

Steps Involved:

  • Tokenize text (word_tokenize)
  • apply pos_tag to above step that is nltk.pos_tag(tokenize_text)

Some examples are as below:

Abbreviation Meaning
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there
FW foreign word
IN preposition/subordinating conjunction
JJ adjective (large)
JJR adjective, comparative (larger)
JJS adjective, superlative (largest)
LS list market
MD modal (could, will)
NN noun, singular (cat, tree)
NNS noun plural (desks)
NNP proper noun, singular (sarah)
NNPS proper noun, plural (indians or americans)
PDT predeterminer (all, both, half)
POS possessive ending (parent\ 's)
PRP personal pronoun (hers, herself, him,himself)
PRP$ possessive pronoun (her, his, mine, my, our )
RB adverb (occasionally, swiftly)
RBR adverb, comparative (greater)
RBS adverb, superlative (biggest)
RP particle (about)
TO infinite marker (to)
UH interjection (goodbye)
VB verb (ask)
VBG verb gerund (judging)
VBD verb past tense (pleaded)
VBN verb past participle (reunified)
VBP verb, present tense not 3rd person singular(wrap)
VBZ verb, present tense with 3rd person singular (bases)
WDT wh-determiner (that, what)
WP wh- pronoun (who)
WRB wh- adverb (how)

POS tagger is used to assign grammatical information of each word of the sentence. Installing, Importing and downloading all the packages of NLTK is complete.

Chunking

Chunking is used to add more structure to the sentence by following parts of speech (POS) tagging. It is also known as shallow parsing. The resulted group of words is called "chunks." In shallow parsing, there is maximum one level between roots and leaves while deep parsing comprises of more than one level. Shallow Parsing is also called light parsing or chunking.

The primary usage of chunking is to make a group of "noun phrases." The parts of speech are combined with regular expressions.

Rules for Chunking:

There are no pre-defined rules, but you can combine them according to need and requirement.

For example, you need to tag Noun, verb (past tense), adjective, and coordinating junction from the sentence. You can use the rule as below

chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}

Following table shows what the various symbol means:

Name of symbol Description
. Any character except new line
* Match 0 or more repetitions
? Match 0 or 1 repetitions

Now Let us write the code to understand rule better

from nltk import pos_tag
from nltk import RegexpParser
text ="learn php from guru99 and make study easy".split()
print("After Split:",text)
tokens_tag = pos_tag(text)
print("After Token:",tokens_tag)
patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)

Output

After Split: ['learn', 'php', 'from', 'guru99', 'and', 'make', 'study', 'easy']
After Token: [('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN'), ('and', 'CC'), ('make', 'VB'), ('study', 'NN'), ('easy', 'JJ')]
After Regex: chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
       <ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'>
After Chunking (S
  (mychunk learn/JJ)
  (mychunk php/NN)
  from/IN
  (mychunk guru99/NN and/CC)
  make/VB
  (mychunk study/NN easy/JJ))

The conclusion from the above example: "make" is a verb which is not included in the rule, so it is not tagged as mychunk

Use Case of Chunking

Chunking is used for entity detection. An entity is that part of the sentence by which machine get the value for any intention

Example: 
Temperature of New York. 
Here Temperature is the intention and New York is an entity. 

In other words, chunking is used as selecting the subsets of tokens. Please follow the below code to understand how chunking is used to select the tokens. In this example, you will see the graph which will correspond to a chunk of a noun phrase. We will write the code and draw the graph for better understanding.

Code to Demonstrate Use Case

 import nltk
text = "learn php from guru99"
tokens = nltk.word_tokenize(text)
print(tokens)
tag = nltk.pos_tag(tokens)
print(tag)
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp  =nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw()    # It will draw the pattern graphically which can be seen in Noun Phrase chunking 

Output:

['learn', 'php', 'from', 'guru99']  -- These are the tokens
[('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN')]   -- These are the pos_tag
(S (NP learn/JJ php/NN) from/IN (NP guru99/NN))        -- Noun Phrase Chunking

Graph

Noun Phrase chunking Graph

From the graph, we can conclude that "learn" and "guru99" are two different tokens but are categorized as Noun Phrase whereas token "from" does not belong to Noun Phrase.

Chunking is used to categorize different tokens into the same chunk. The result will depend on grammar which has been selected. Further chunking is used to tag patterns and to explore text corpora.

 

YOU MIGHT LIKE: