Synthetic datasets are often used for testing and evaluating machine learning models. For quick testing, there are simple, pre-prepared datasets such as Iris or Wine from the Scikit-learn library. For competitions and industrial testing, refined versions of real datasets are used. Due to data privacy, data accessibility, and model explainability, the demand for synthetic datasets is growing. Most machine learning libraries support generating basic synthetic datasets, but mainly for continuous data. However, literature indicates a lack of tools for generating synthetic categorical datasets. Therefore, we developed, tested, and released an open-source library for generating synthetic datasets with categorical features. Our framework allows for the generation of simple and complex datasets with full control over the generative process. We demonstrated its use in three use cases. The first showcases basic functionaliy, the second one simulates a real dataset, while the third one compares DeepFM and logistic regression models on sparse data with various feature interactions.
|