Text Preprocessing in NLP with Python codes

This article was published as a part of the Data Science Blogathon

Introduction

Natural Language Processing (NLP) is a branch of Data Science which deals with Text data. Apart from numerical data, Text data is available to a great extent which is used to analyze and solve business problems. But before using the data for analysis or prediction, processing the data is important.

To prepare the text data for the model building we perform text preprocessing. It is the very first step of NLP projects. Some of the preprocessing steps are:

  • Removing punctuations like . , ! $( ) * % @
  • Removing URLs
  • Removing Stop words
  • Lower casing
  • Tokenization
  • Stemming
  • Lemmatization

We need to use the required steps based on our dataset. In this article, we will use SMS Spam data to understand the steps involved in Text Preprocessing.

Let’s start by importing the pandas library and reading the data.

import pandas as pd
#reading the data
data = pd.read_csv("spam.csv",encoding="ISO-8859-1")
data.head()