HashingVectorizer vs. CountVectorizer

Previously, we learned how to use CountVectorizer for text processing. In place of CountVectorizer, you also have the option of using HashingVectorizer.

In this tutorial, we will learn how HashingVectorizer differs from CountVectorizer and when to use which.

CountVectorizer vs. HashingVectorizer

HashingVectorizer and CountVectorizer are meant to do the same thing. Which is to convert a collection of text documents to a matrix of token occurrences. The difference is that HashingVectorizer does not store the resulting vocabulary (i.e. the unique tokens).

With HashingVectorizer, each token directly maps to a column position in a matrix, where its size is pre-defined. For example, if you

 

 

 

To finish reading, please visit source site

Leave a Reply