Part 3: Step by Step Guide to NLP – Text Cleaning and Preprocessing

This article was published as a part of the Data Science Blogathon

Introduction

This article is part of an ongoing blog series on Natural Language Processing (NLP). In part-1and  part-2 of this blog series, we complete the theoretical concepts related to NLP. Now, in continuation of that part, in this article, we will cover some of the new concepts.

In this article, we will understand the terminologies required and then we start our journey towards text cleaning and preprocessing, which is a very crucial component while we are working with NLP tasks.

This is part-3 of the blog series on the Step by Step Guide to Natural Language Processing.

 

Table of Contents

1. Familiar with Terminologies

  • Corpus
  • Tokens
  • Tokenization
  • Text object
  • Morpheme
  • Lexicon

2. What is Tokenization?

  • White-space Tokenization
  • Regular Expression Tokenization
  • Sentence and Word Tokenization

3. Noise Entities Removal

  • Removal of Punctuation marks
  • Removal of stopwords, etc.

4. Data Visualization for Text Data

5. Parts of Speech (POS) Tagging

Familiar with Terminologies

 

 

 

To finish reading, please visit source site