Task-based datasets, preprocessing, and evaluation for sequence models

SeqIO

SeqIO is a library for processing sequential data to be fed into downstream sequence models. It uses tf.data.Dataset to create scalable data pipelines but requires minimal use of TensorFlow. In particular, with one line of code, the returned dataset can be transformed to a numpy iterator and hence it is fully compatible with other frameworks such as JAX or PyTorch.

Currently, SeqIO assumes that the dataset is a sequence, i.e., each feature is one-dimensional array. Modalities such as text or audio are naturally supported. Images are supported as long as they are represented as sequences (e.g., Image GPT). We will release this constraint in the future in order to support higher dimensional data.

SeqIO is a refactor of the t5.data library used (in conjunction with the Mesh Tensorflow

 

 

 

To finish reading, please visit source site