Large language models such as BERT have transformed natural language processing, but their large size and computational complexity hinder their widespread use, especially on resource-constrained devices. This thesis addresses the problem of reducing the size of language models using quantization, a technique that reduces the numerical representation of model weights and activations. It focuses on post-training quantization (PTQ) methods, specifically dynamic and static quantization, implemented and evaluated on the BERT classification model using the ONNX Runtime Library. Quatization aware training (QAT) is also theoretically presented, as well as other commonly used methods for reducing the size of language models. The impact of implemented PTQ methods on model size, inference speed and predictive accuracy is analyzed. The results show that quantization significantly reduces model size and speeds up inference. Dynamic quantization in the BERT model achieves a good balance between compression and accuracy preservation, while basic static quantization results in a noticeable performance degradation. The work thus provides an overview of quantization techniques and a practical assessment of the trade-offs in applying quantization after training on the BERT model.
|