Python for NLP: Tokenization, Stemming, and Lemmatization with SpaCy Library

python_tutorials

In the previous article, we started our discussion about how to do natural language processing with Python. We saw how to read and write text and PDF files. In this article, we will start working with the spaCy library to perform a few more basic NLP tasks such as tokenization, stemming and lemmatization.

Introduction to SpaCy

The spaCy library is one of the most popular NLP libraries along with NLTK. The basic difference between the two libraries is the fact that NLTK contains a wide variety of algorithms to solve one problem whereas spaCy contains only one, but the best algorithm to solve a problem.

NLTK was released back in 2001 while spaCy is relatively new and was developed in 2015. In this series of articles on NLP, we will mostly be dealing with spaCy, owing to its state of the art nature. However, we will also touch NLTK when it is easier to perform a task using NLTK rather than spaCy.

Installing spaCy

If you use the pip installer to install your Python libraries, go to the command line and execute the following statement:

To finish reading, please visit source site