Large-scale Near-deduplication Behind BigCode

Chenghao Mou's avatar

People who are interested in document-level near-deduplication at a large scale, and have some understanding of hashing, graph and text processing.



Motivations

It is important to take care of our data before feeding it to the model, at least Large Language

 

 

 

To finish reading, please visit source site