Natural Language Processing (NLP) is a complex topic and there are books devoted only to this subject. In this book, an introductive survey will be provided based on the NLTK a python Natural Language Toolkit. Let us start.
Text is made up of sentences and sentences are composed of words. So the first step in NLP is frequently to separate those basic units according to the rules of the chosen language. Often very frequent words carry little information and they should be filtered out as stopwords. The first code fragment split text into sentences and then sentences into words where stop words are then removed.
In addition to that, it could be interesting to find out the meaning of words and here wordnet can help with its organization of terms into synsets, which are organized into inheritance tree where the most abstract terms are hypernyms and the more specific terms are hyponyms. Wordnet can also help in finding synonyms and antonyms (opposite words) of a given terms. The code fragment finds the synonyms of the word love in English.
Moreover, words can be stemmed and the rules for stemming are very different from language to language. NLTK supports the SnowballStemmer that supports multiple idioms. The code fragment finds the stem of the word volvi in Spanish.
In certain situations, it could be convenient to understand whether a word is a noun, an adjective, a verb and so on. This is the process of part-of-speech tagging and NLTK provides a convenient support for this type of analysis as illustrated in the below code fragment.
text = "Poetry is the record of the best and happiest moments \
of the happiest and best minds. Poetry is a sword of lightning, \
ever unsheathed, which consumes the scabbard that would contain it."
# download stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')
# download the punkt package
# load the sentences' tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(text)
# tokenize in words
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
for sentence in sentences:
words = tokenizer.tokenize(sentence)
words = [w for w in words if w not in stop]
from nltk.corpus import wordnet
for i,j in enumerate(wordnet.synsets('love')):
print "Synonyms:", ", ".join(j.lemma_names())
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('spanish')
print "Spanish stemmer"
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
trainSenteces = treebank.tagged_sents()[:5000]
tagger = UnigramTagger(trainSenteces)
tagged = tagger.tag(words)
['Poetry is the record of the best and happiest moments of the happiest and best minds.', 'Poetry is a sword of lightning, ever unsh
eathed, which consumes the scabbard that would contain it.']
['Poetry', 'record', 'best', 'happiest', 'moments', 'happiest', 'best', 'minds', '.']
['Poetry', 'sword', 'lightning', ',', 'ever', 'unsheathed', ',', 'consumes', 'scabbard', 'would', 'contain', '.']
Synonyms: love, passion
Synonyms: beloved, dear, dearest, honey, love
Synonyms: love, sexual_love, erotic_love
Synonyms: sexual_love, lovemaking, making_love, love, love_life
Synonyms: love, enjoy
Synonyms: sleep_together, roll_in_the_hay, love, make_out, make_love, sleep_with, get_laid, have_sex, know, do_it, be_intimate, have
_intercourse, have_it_away, have_it_off, screw, fuck, jazz, eff, hump, lie_with, bed, have_a_go_at_it, bang, get_it_on, bonk
[('Poetry', None), ('sword', None), ('lightning', None), (',', u','), ('ever', u'RB'), ('unsheathed', None), (',', u','), ('consumes
', None), ('scabbard', None), ('would', u'MD'), ('contain', u'VB'), ('.', u'.')]