In this thesis, we address the problem of automatic classification of animal sounds from wetlands in conditions with limited labeled data. We applied few-shot and self-supervised learning approaches and evaluated three modern models (CLAP, BYOL-A, and M2D) for recognizing the vocalizations of the frog \textit{Rana dalmatina} in both aerial and underwater recordings. The M2D model achieved the best performance, and further training on unlabeled data increased the F1-score from 0.902 to 0.934. To better organize the large amount of data, we employed clustering using the HDBSCAN algorithm and visualization with UMAP, enabling efficient analysis and interpretation of unlabeled audio representations.
|