Large-scale Near-deduplication Behind BigCode

People who are interested in document-level near-deduplication at a large scale, and have some understanding of hashing, graph and text processing.

Motivations

It is important to take care of our data before feeding it to the model, at least Large Language