Datasets for Natural Language Processing

Last Updated on August 14, 2020

You need datasets to practice on when getting started with deep learning for natural language processing tasks.

It is better to use small datasets that you can download quickly and do not take too long to fit models. Further, it is also helpful to use standard datasets that are well understood and widely used so that you can compare your results to see if you are making progress.

In this post, you will discover a suite of standard datasets for natural language processing tasks that you can use when getting started with deep learning.

Overview

This post is divided into 7 parts; they are:

  1. Text Classification
  2. Language Modeling
  3. Image Captioning
  4. Machine Translation
  5. Question Answering
  6. Speech Recognition
  7. Document Summarization

I have tried to provide a mixture of datasets that are popular for use in academic papers that are modest in size.

Almost all datasets are freely available for download today.

If your favorite dataset is not listed or you think you know of a better dataset that should be listed, please let me know in the comments below.

Kick-start your project with my new book Deep Learning for Natural Language Processing,
To finish reading, please visit source site