From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub

Content-defined chunking (CDC) plays a central role in enabling deduplication within a Xet-backed repository. The idea is straightforward: break each file’s data into chunks, store only unique ones, reap the benefits.

In practice, it’s more complex. If we focused solely on maximizing deduplication, the design would call for the smallest possible chunk size. By doing that, we’d create significant overheads for the infrastructure and the builders on the Hub.

On Hugging Face’s Xet team, we’re bringing CDC from theory to production to deliver faster uploads and downloads to AI builders (by a factor of 2-3x in some

 

 

 

To finish reading, please visit source site