In this thesis machine learning methods are used to classify chemical reactions. At the same time the most important changes in molecular structure are identified that are typical for chemical reactions of RNA-binding proteins.
In the first part, six basic groups of chemical reactions were used to determine the optimal set of parameters for modeling and prediction. Three groups of parameter sets were tested: methods for balancing the learning set (seven methods),
methods for molecular fingerprinting (seven methods) and predictive models (five methods). Empirically is shown that the best combination
consists of the following parameters: random undersampling as balancing method, Morgan+MorganBitVector for molecular fingerprinting and
random forest as predictive model, with which average AUC 0.97 was achieved. For the second part, the optimal set of parameters is used
to discriminate between chemical reactions associated with RNA-binding proteins and those chemical reactions associated with non RNA-binding proteins. AUC score 0.77 was achieved.
|