The use of large language models is becoming increasingly widespread among developers. However, privacy and computational requirements are problematic with commercial solutions and the use of large models. In this work, we focus on using large language models with 160 million parameters that are suitable for local execution and augmentation with retrieval from local projects. We train GPT-2 and RETRO models on open-source Python files, experimentally compare them and confirm the benefits of vector embedding based retrieval. Additionally, we improve our models' performance with in-context retrieval, which retrieves the context based on the Jaccard similarity of tokens. We further evaluate in-context retrieval on larger models and conclude that, despite its simplicity, the approach is better than using the RETRO architecture. We highlight the key role of proper tokenization in achieving the full potential of large language models.
|