Familiarizing myself with NLTK

This post is a reflection of writing my first code using NLTK. The code can be found here:
https://github.com/jayeetaroy/Text-analytics-Python/blob/master/Basic%20NLP%20Tasks%20with%20NLTK.ipynb

From this exercise, I could see that there are a few pre-flight steps that can be taken:

Tokenization
First, the chunk of unstructured data is broken down into sentences, and then each sentence is further broken down to words. So that these words can be further analyzed. A paragraph of text or a text document has several components including sentences that can be further broken down into clauses, phrases, and words. The most popular tokenization techniques include sentence and word tokenization, which are used to break down a text corpus into sentences, and each sentence into words. Thus, tokenization can be defined as the process of breaking down or splitting textual data into smaller meaningful components called tokens.

Sentence tokenization is the process of splitting a text corpus into sentences that act as the first level of tokens which the corpus is comprised of. This is also known as sentence segmentation because we try to segment the text into meaningful sentences.

Word tokenization is the process of splitting or segmenting sentences into their constituent words. A sentence is a collection of words, and with tokenization, we essentially split a sentence into a list of words that can be used to reconstruct the sentence. Word tokenization is very important in many processes, especially in cleaning and normalizing text where operations like stemming and lemmatization work on each individual word based on its respective stems and lemma.

Some of the main features of a tokenizer include the following:
• Splits and separates out periods that appear at the end of a sentence
• Splits and separates commas and single quotes when followed by whitespaces
• Most punctuation characters are split and separated into independent tokens
• Splits words with standard contractions—examples would be don’t to do and n’t

Normalization

Text normalization is defined as a process that consists of a series of steps that should be followed to wrangle, clean, and standardize textual data into a form that could be consumed by other NLP and analytics systems and applications as input. Often tokenization itself also is a part of text normalization. Besides tokenization, various other techniques include cleaning text, case conversion, correcting spellings, removing stopwords and other unnecessary terms, stemming, and lemmatization. Some of the steps that are a part of Normalization are Cleaning text, Tokenizing text, Removing Special Characters, Expanding Contractions, Case Conversions, Removing Stopwords, Removing Stopwords etc.

Stemming

Morphemes are the smallest independent unit in any natural language. Morphemes consist of units that are stems and affixes. Affixes are units like prefixes, suffixes, and so on, which are attached to a word stem to change its meaning or create a new word altogether. Word stems are also often known as the base form of a word, and we can create new words by attaching affixes to them in a process known as inflection . The reverse of this is obtaining the base form of a word from its inflected form, and this is known as stemming.

Consider the word JUMP . You can add affixes to it and form new words like JUMPS , JUMPED , and JUMPING . In this case, the base word JUMP is the word stem. If we were to carry out stemming on any of its three inflected forms, we would get back the base form.

Lemmatization
The process of lemmatization is very similar to stemming—you remove word affixes to get to a base form of the word. But in this case, this base form is also known as the root word, but not the root stem. The difference is that the root stem may not always be a lexicographically correct word; that is, it may not be present in the dictionary. The root word, also known as the lemma, will always be present in the dictionary.

After lemmatization, we can now explore the relationships between words and perform some basic analysis.

Search This Blog

Musings on Learning

Familiarizing myself with NLTK

Comments

Post a Comment

Popular posts from this blog

Linear Regression Theory

Multiple Linear Regression