Have you ever wondered how Chat-Bots mimic human language? Or how Google Translate works? Are these questions going unanswered? Fear not for you have come to the right place, my friend! Processing textual data is a significant part of machine learning and artificial intelligence today. In the following sections, I will give you a simple explanation about processing textual data and help you establish a platform to further build upon.
For those of you who don’t know, Machine Learning algorithms are capable of handling only numeric data. So how do we convert text to numbers? The answer is simpler than you realize. Let me show you how!
Natural Language Processing
Abbreviated as NLP, this is a subdomain of machine learning that deals with processing everyday language. Things like chatbots, text translation, and sentiment analysis are all possible because of NLP. But how exactly is textual data handled you may wonder. Keep reading to find out!
This is the input data format in NLP. This can be any textual data like movie reviews, comments, chats, etc. These text document(s) are passed on to the next stage – Tokenization / Segmentation.
This stage involves breaking down a text stream into words (word segmentation) and sentences (sentence segmentation). This process of tokenizing/segmenting the data helps us handle each of the tokens separately, thereby easing down the task of cleaning this data. For Example:
Input: He is running very fast! She is running very slow.
Output: [‘He is running very fast!’, ‘She is running very slow.’]
If you are familiar with the term Normalization, you know what this stage is about. For those of you who don’t, the word is self-explanatory! Normalization in this or any context means ‘making things normal’. Doesn’t really help does it! Let me put it this way – It is a process of converting everything into a defined standard so that no element gets more preference than others. Although there might be some words more important than others, we will get into that in the vectorization part.
Below, I’ve detailed various methods used to Normalize the tokens.
- Punctuation Removal
This stage involves removing punctuation and other grammatical semantics which aren’t necessary for NLP. It also reduces the entire document of sentences to the lowercase alphabet to establish a standard. For Example:
Notice how the letters have been converted into lower case alphabet and the punctuation – the exclamation mark and the full stop – has been removed.
- Stop Word Removal
This is one of the most well-known stages. As the name suggests, it involves removing “Stop words” from the sentences. Stop words are any such words that do not provide valuable information in the sentence. Words like articles, conjunctions, and prepositions are some of the common stop words. Removing such words helps us filter out the unnecessary words and leaves with the ones that contain the most information. Here’s a list of well-documented stop words for your reference – Stop Words
Here, notice how the stop words – ‘he’, ‘she’, ‘is’, ‘very’ – have been removed, leaving us with only those words which are information-rich.
- Stemming vs Lemmatization
Stemming – This is a token normalization technique that reduces each word in a sentence to its base form. It removes things like prefixes and suffixes and leaves us with the root word. It takes less time compared to Lemmatization. It is used in Spam Classification and Sentiment Analysis. For Example:
Here, notice how the word “running” has been converted into its root form – “run”. The words “fast” and “slow” are unaffected because they are already in their root forms and cannot be reduced anymore.
Some more examples:
Notice how the stem of “quickly” is “quickli”. Sometimes, the stem is not a meaningful word. And that is the downside of Stemming. It simply converts into the root form without considering the context of the word. This is overcome by Lemmatization.
Lemmatization – This is also a token normalization technique similar to stemming. This is considered a more sophisticated and time-consuming way to determine the stem of a word. One advantage is that is able to capture the context in which the word is used by considering things like tense and the words around the word in consideration. Moreover, it gives a stem that is a meaningful word unlike stemming. It is heavily used in Chat Bots and Virtual Assistants which mimic a human conversation with users. For Example:
Although in our case we get the same output as the stemmer. In the case of more complex words and sentences, we’ll have different outputs.
Some more examples:
Notice how a more semantically appropriate root word is chosen in lemmatization, instead of simply chopping the ends off a word.
Now that we have properly formatted sentences, we have to convert these into vectors. We use parameters like the frequency of a word and methods like encoding to convert these into vectors. There are various vectorization methods like Bag-Of-Words and TF-IDF which are widely used.
Machine Learning Models
Now that we have converted sentences into vectors, we can treat each vector as a data point like in any other dataset. This data then is supplied to a Machine Learning algorithm where the model is trained on this data and evaluated using various measures to test its validity. For a more interesting read on Model Fitting in Machine Learning, you can refer to it here – Model Fitting
- HMM, or Hidden Markov models are widely used in Speech Recognition and Handwriting Recognition, both of which fall under Natural Language Processing. If you would like to learn more about this, you can refer to this – HMM
- Sentiment Analysis is a technique used to identify the tone behind a sentence a classify it as – Positive, Negative, or Neutral. If you would like to learn more about sentiment analysis, this is a great article explaining everything required for a beginner – Sentiment Analysis
- Recurrent Neural Networks or RNNs for short are a deep learning architecture that mimics the neurons in a human brain to process data. RNNs are well-known for their applicability in the fields of NLP. Here’s a great read – RNN