Tokenization and Text Normalization

Objective

  • Text data is a type of unstructured data used in natural language processing.
  • Understand how to preprocess the text data before feeding it to the machine learning algorithms.

Introduction

Text data is a form of unstructured data. The most prominent examples of text data available on the internet are social media data like tweets, posts, comments, or the Conversation data such as messages, emails, Chats. Also, it can be article data like news articles, blogs, etc.

Note: If you are more interested in learning concepts in an Audio-Visual format, We have this entire article explained in the video below. If not, you may continue reading.

 

So Text data is essentially a written form of a natural language such as Hindi, English, Russian, or any other. It consists of characters or words arranged together in a meaningful and ordered manner. This means that text data is driven by grammar rules. and defined structures.

In order to work with text data, it is

 

 

 

To finish reading, please visit source site