Processing Textual Data – An introduction to Natural Language Processing

Have you ever wondered how Chat-Bots mimic human language? Or how Google Translate works? Are these questions going unanswered? Fear not for you have come to the right place, my friend! Processing textual data is a significant part of machine learning and artificial intelligence today. In the following sections, I will give you a simple explanation about processing textual data and help you establish a platform to further build upon.

For those of you who don’t know, Machine Learning algorithms are capable of handling only numeric data. So how do we convert text to numbers? The answer is simpler than you realize. Let me show you how!

Image displaying text in different languages
Image displaying text in different languages

Natural Language Processing

Abbreviated as NLP, this is a subdomain of machine learning that deals with processing everyday language. Things like chatbots, text translation, and sentiment analysis are all possible because of NLP. But how exactly is textual data handled you may wonder. Keep reading to find out!

Process Flow

Flowchart representing the steps involved in processing textual data
Flowchart representing the steps involved

Textual Data

This is the input data format in NLP. This can be any textual data like movie reviews, comments, chats, etc. These text document(s) are passed on to the next stage – Tokenization / Segmentation.

Tokenization/Segmentation

This stage involves breaking down a text stream into words (word segmentation) and sentences (sentence segmentation). This process of tokenizing/segmenting the data helps us handle each of the tokens separately, thereby easing down the task of cleaning this data. For Example:

Input: He is running very fast! She is running very slow.

Output: [‘He is running very fast!’, ‘She is running very slow.’]

Token Normalization

If you are familiar with the term Normalization, you know what this stage is about. For those of you who don’t, the word is self-explanatory! Normalization in this or any context means ‘making things normal’. Doesn’t really help does it! Let me put it this way – It is a process of converting everything into a defined standard so that no element gets more preference than others. Although there might be some words more important than others, we will get into that in the vectorization part. 

Building blocks symbolizing tokens
Building blocks symbolizing tokens

Below, I’ve detailed various methods used to Normalize the tokens.

  1. Punctuation Removal

This stage involves removing punctuation and other grammatical semantics which aren’t necessary for NLP. It also reduces the entire document of sentences to the lowercase alphabet to establish a standard. For Example:

Punctuation Removal Example
Punctuation Removal Example

Notice how the letters have been converted into lower case alphabet and the punctuation – the exclamation mark and the full stop – has been removed.

  1. Stop Word Removal

This is one of the most well-known stages. As the name suggests, it involves removing “Stop words” from the sentences. Stop words are any such words that do not provide valuable information in the sentence. Words like articles, conjunctions, and prepositions are some of the common stop words. Removing such words helps us filter out the unnecessary words and leaves with the ones that contain the most information. Here’s a list of well-documented stop words for your reference – Stop Words

For Example:

Stop words Removal Example
Stop words Removal Example

Here, notice how the stop words – ‘he’, ‘she’, ‘is’, ‘very’ – have been removed, leaving us with only those words which are information-rich.

  1. Stemming vs Lemmatization

Stemming – This is a token normalization technique that reduces each word in a sentence to its base form. It removes things like prefixes and suffixes and leaves us with the root word. It takes less time compared to Lemmatization. It is used in Spam Classification and Sentiment Analysis. For Example:

 Stemming Example
Stemming Example

Here, notice how the word “running” has been converted into its root form – “run”. The words “fast” and “slow” are unaffected because they are already in their root forms and cannot be reduced anymore.

Some more examples:

More Stemming Examples
More Stemming Examples

Notice how the stem of “quickly” is “quickli”. Sometimes, the stem is not a meaningful word. And that is the downside of Stemming. It simply converts into the root form without considering the context of the word. This is overcome by Lemmatization.

Lemmatization – This is also a token normalization technique similar to stemming. This is considered a more sophisticated and time-consuming way to determine the stem of a word. One advantage is that is able to capture the context in which the word is used by considering things like tense and the words around the word in consideration. Moreover, it gives a stem that is a meaningful word unlike stemming. It is heavily used in Chat Bots and Virtual Assistants which mimic a human conversation with users. For Example:

Lemmatization Example
Lemmatization Example

Although in our case we get the same output as the stemmer. In the case of more complex words and sentences, we’ll have different outputs.

Some more examples:

More Lemmatization Examples
More Lemmatization Examples

Notice how a more semantically appropriate root word is chosen in lemmatization, instead of simply chopping the ends off a word.

Vectorizer

Now that we have properly formatted sentences, we have to convert these into vectors. We use parameters like the frequency of a word and methods like encoding to convert these into vectors. There are various vectorization methods like Bag-Of-Words and TF-IDF which are widely used.

Machine Learning Models

Now that we have converted sentences into vectors, we can treat each vector as a data point like in any other dataset. This data then is supplied to a Machine Learning algorithm where the model is trained on this data and evaluated using various measures to test its validity. For a more interesting read on Model Fitting in Machine Learning, you can refer to it here – Model Fitting

Future Reads

  • HMM, or Hidden Markov models are widely used in Speech Recognition and Handwriting Recognition, both of which fall under Natural Language Processing. If you would like to learn more about this, you can refer to this – HMM
  • Sentiment Analysis is a technique used to identify the tone behind a sentence a classify it as – Positive, Negative, or Neutral. If you would like to learn more about sentiment analysis, this is a great article explaining everything required for a beginner – Sentiment Analysis
  • Recurrent Neural Networks or RNNs for short are a deep learning architecture that mimics the neurons in a human brain to process data. RNNs are well-known for their applicability in the fields of NLP. Here’s a great read – RNN

Similar Posts

Leave a Reply

Your email address will not be published.