LAD: LAYER-WISE ADAPTIVE DISTILLATION FOR BERT MODEL COMPRESSION

LAD: Layer-Wise Adaptive Distillation for BERT Model Compression

LAD: Layer-Wise Adaptive Distillation for BERT Model Compression

Blog Article

Recent advances with large-scale pre-trained language models (e.g., BERT) have brought significant potential to natural language processing.However, the large model size hinders their use in IoT and edge devices.

Several studies have utilized task-specific knowledge distillation to compress the pre-trained language models.However, to reduce the number of layers in a large Detergent model, a sound strategy for distilling knowledge to a student model with fewer layers than the teacher model is lacking.In this work, we present Layer-wise Adaptive Distillation (LAD), a task-specific distillation framework that can be used to reduce the model size of BERT.We design an iterative aggregation mechanism with multiple gate blocks in LAD to adaptively distill layer-wise internal knowledge from the teacher model to the student model.

The proposed method enables an effective knowledge transfer process for Small Bookcase a student model, without skipping any teacher layers.The experimental results show that both the six-layer and four-layer LAD student models outperform previous task-specific distillation approaches during GLUE tasks.

Report this page