LAD: Layer-Wise Adaptive Distillation for BERT Model Compression
LAD: Layer-Wise Adaptive Distillation for BERT Model Compression
Blog Article
Recent advances with large-scale pre-trained language models (e.g., BERT) have brought significant potential to natural language processing.However, the large model size hinders their use in IoT and edge devices.
Several studies have utilized task-specific knowledge distillation to compress the pre-trained language models.However, to reduce the number of layers in a large Detergent model, a sound strategy for distilling knowledge to a student model with fewer layers than the teacher model is lacking.In this work, we present Layer-wise Adaptive Distillation (LAD), a task-specific distillation framework that can be used to reduce the model size of BERT.We design an iterative aggregation mechanism with multiple gate blocks in LAD to adaptively distill layer-wise internal knowledge from the teacher model to the student model.
The proposed method enables an effective knowledge transfer process for Small Bookcase a student model, without skipping any teacher layers.The experimental results show that both the six-layer and four-layer LAD student models outperform previous task-specific distillation approaches during GLUE tasks.