Cross-lingual text summarization is the process of generating a summary of a text in a foreign language and is a less-researched area of natural language processing, since the majority of the research focuses only on the English language. We developed three different models capable of direct summarization from Slovene to English, based on pre-trained models LongT5, PEGASUS-X and BigBird. For training we used the KAS 2.0 dataset, which contains 52351 Slovene academic works and their corresponding English summaries. We conducted multiple experiments with fine-tuning the models using different portions of the training dataset. The models were quantitatively evaluated using the ROUGE-L and BLEURT metrics, and the LongT5 model performed best, closely followed by the PEGASUS-X model. The BigBird model performed approximately 8% worse according to the BLEURT metric, while it was comparable to the other models on other metrics. We manually qualitatively evaluated 30 generated summaries for each model and classified them as good or bad. The LongT5 model generated three good summaries, PEGASUS-X one, and BigBird none.
|