Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm
This is the Pytorch implementation for sparse progressive distillation (SPD). For more details about the motivation, techniques and experimental results, refer to our paper here. Environment Preparation (using python3) pip install -r requirements.txt Dataset Preparation The original GLUE dataset could be downloaded here. We use finetuned BERT_base as the teacher. For each task of GLUE benchmark, we obtain the finetuned model using the original huggingface transformers code with the following script.
Read more