Despite the increase in linguistic research on spoken Slovenian, which strives to catalogue the many previously overlooked characteristics of the spoken language compared to the written form, the methodology of such discussions largely relies on the qualitative analysis of relatively small and demographically or genre-limited samples of language use, which limits the replicability of research and the ability to generalize findings to spoken Slovenian as a whole. To address this issue, this paper introduces the Spoken Slovene Treebank (SST), a freely accessible, morphologically and syntactically annotated representative sample of the Gos spoken Slovene reference corpus, and illustrates its methodological potential for future corpus-based research of spoken Slovene. By examining three common spoken phenomena – self-repairs, discourse markers, and post-modifying adjectives – we showcase the SST Treebank’s capability for straightforward retrieval of numerous authentic examples. Furthermore, by analysing the distribution of self-repairs across various communicative settings, we highlight its utility for diverse statistical analyses of language practices. In addition to highlighting the SST Treebank’s major advantages, such as its balanced composition, open access, manual grammatical annotations, and direct comparability with other similar corpora worldwide, we also address some limitations in the concluding section, notably its relatively small size and the robust, written-language-oriented annotation scheme.
|