Collecting and labeling data is costly and time-consuming. In this work, we present a framework that leverages the power of large language models to artificially generate synthetic data. We tested it on three text classification tasks and achieved improvements over baseline results. We introduced several methods for evaluating the quality of artificial datasets and demonstrated how these insights can be used to develop new generation approaches for synthetic data. Several artificial generation techniques were developed and tested, with the most notable being the addition of frequent words in the prompt, which significantly improves results in scenarios with both a small labeled set and a large unlabeled set available. The highest performance was achieved by combining artificially generated data with LLM-labeled samples from a large set of unlabeled examples. The main contributions of this work are the implemented framework and the developed generation strategies, which we evaluated using multiple metrics across various scenarios.
|