Unsupervised text tokenizer focused on computational efficiency

PyPI Downloads Code style: black GitHub Build Status

YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [Sennrich et al.]. Our implementation is much faster in training and tokenization than Hugging Face, fastBPE and SentencePiece. In some test cases, it is 90 times faster. Check out our benchmark results.

Key advantages:

  • Multithreading for training and tokenization
  • The algorithm has O(N) complexity, where N is the length of training data
  • Highly efficient implementation in C++
  • Python wrapper and command-line interface

Extra features:

As well as in the

 

 

 

To finish reading, please visit source site