Bag-of-words vs TFIDF vectorization –A Hands-on Tutorial

This article was published as a part of the Data Science Blogathon

Whenever we apply any algorithm to textual data, we need to convert the text to a numeric form. Hence, there arises a need for some pre-processing techniques that can convert our text to numbers. Both bag-of-words (BOW) and TFIDF are pre-processing techniques that can generate a numeric form from an input text.

Bag-of-Words:

The bag-of-words model converts text into fixed-length vectors by counting how many times each word appears.

Let us illustrate this with an example. Consider that we have the following sentences:

  • Text processing is necessary.
  • Text processing is necessary and important.
  • Text processing is easy.

We will refer to each of the above sentences as documents. If we take out the unique words in all these sentences, the vocabulary will consist of these 7 words: {‘Text’, ’processing’, ’is’, ’necessary’, ’and’, ’important, ’easy’}.

To carry out bag-of-words, we will simply have to count the number of times each word appears in each of the documents.