One of the most popular subfields of data science today is Natural Language Processing (NLP). NLP is a unique field that combines computer science, data science, and linguistics all together to enable computers to understand and use human languages. It is one of the fields in most technologies we use every day.
If you’re considering learning NLP, you will probably face the same problem everyone faces when learning a new skill: understanding the basic concepts of this skill. You’ll need to know the language of the topic before you can start going through tutorials and videos about a specific topic and be able to understand what’s going on. Because often, in videos and written tutorials, creators use technical terms assuming that the reader/ watcher knows their meaning.
If you don’t know the meanings, it will be harder for you to follow through with any of these tutorials. This article will cover six basic NLP concepts and what they mean, laying the groundwork for a Deep Dive into Language Learning Models ‘LLM’, which will enhance your understanding of advanced tutorials.
№1: CORPUS
In NLP, corpus — Latin for the body — is a term used to refer to a body of text. The plural form of the word is corpora.
A Corpus can contain one or more languages and can be either written or spoken. In addition, corpora can have a specific theme or be generalized text. Either way, corpora are used for statistical linguistic analysis and linguistic computing.
№2: STEMMING
In NLP, stemming is a technique to extract a word’s origin by removing all fixes — prefixes, affixes, and suffixes to extract useful information. Some algorithms used to perform stemming are:
- Lookup tables that contain all possible variations of all words, like a dictionary.
- Stripping suffixes from the word to construct its origin form.
- Stochastic modeling. A unique algorithm understands suffixes’ grammatical rules and uses that to extract a word’s origins.
№3: LEMMATIZATION
Stemming is not always the best way to obtain a word’s origins; sometimes, removing fixes is not enough to get the correct word’s origin. For example, a stemmer will assume the origin of paid is pai, which is wrong.
Here is where lemmatization comes to help.
Lemmatization is a word used to refer to extracting a word’s original form— the lemma. So, in our previous example, a lemmatizer will return pay or paid based on the word’s location in the sentence.
№4: TOKENIZATION
In NLP, tokenization breaks down a sentence into individual words (tokens) after punctuation or special characters are often removed.
Tokens are constructed from a specific body of text to be used for statistical analysis and processing. A token doesn’t necessarily need to be a single word; it could be a sentence or a commonly used phrase, like “rock’ n’ roll” and “3-D printer”.
№5: LEXICONS
In linguistics and NLP, lexicons are part of the grammar of a language that includes all lexical entities. A lexical entity deals with a word’s meanings that differ based on different situations.
Lexicons are essential for more accurate results from your NLP models, especially when dealing with spoken or more colloquial language. For example, if we are performing sentiment analysis of some tweets, knowing the topic around the tweets and the idiomatic ways of describing things can make a big difference in the analysis results.
№6: WORD EMBEDDINGS
Since the whole purpose of NLP is to enable computers to understand human languages, we need to present them in a way a computer will understand. Computers understand algorithms, and algorithms run on numbers.
In NLP, word embedding is a technique used to map words to numerical vectors for analysis purposes. Many algorithms can be used to implement word embedding, like Word2vec. This statistical technique utilises word embeddings to train the neural network efficiently.
Every field has its own terminology that people often use to describe specific processes and steps that make it easier to communicate with each other and explain their work efficiently.
When you newly enter a field, learning these terminologies can take a lot of time and effort. Because you won’t be able to read and fully understand tutorials in this field without knowing these terminologies.
This article presented you with the basic terminologies of NLP that you will find in most articles and videos. Hopefully, knowing the meaning of these terms will make it easier for you to engage with the resources and understand them better.