ИСТИНА |
Войти в систему Регистрация |
|
ИПМех РАН |
||
In the past several years machine learning techniques have played an important role and become absolute necessity in the modern drug discovery process. Multiple methods for predicting physicochemical and chemo-biological endpoints have proven their robustness and significantly improve our current state of understanding of molecular features/properties associated with some specific pharmacological features. Despite a good number of drug discovery supporting toolkits and methods available to public, academy and pharma there is a demand to have a tool which can combine mining/curation of the heterogeneous chemical data and multiple sophisticated molecular machine learning algorithms. This kind of toolkit have to be able to train models using a variety of machine learning algorithms with minimum user intervention or/and have access to a ready to use pre-trained models. In this study we have evaluated our toolbox (Open Science Data Repository, currently under development) for data curation and machine learning modelling for drug discovery. Different heterogeneous publicly available datasets related to Tuberculosis, Malaria, Bubonic plaque, Chagas disease, and others have been used to tune and train multiple machine methods including traditional methods such as Naïve Bayes, k-Nearest Neighbors, Random Forest, Boosted Decision Trees, Regularized Logistic Regression, and Support Vector Machines, as well as novel deep learning methods with Neural Networks models of different complexity. A wide range of model evaluation metrics such as Receiver Operating Characteristic, Area Under Curve, F1-score, Cohen’s kappa, Matthews correlation coefficient have been used to evaluate and compare machine learning models performance. A variety of commonly used in cheminformatics molecular descriptors for compounds representation was built in our methods, thus an additional layer of tuning by searching of the best molecular descriptor for a particular model can be used. Most of the models performed pretty well and the developed workflows are ready to be used for QSPR and QSAR. Moreover, all already tuned and trained models from this study are ready to use for public and can be found on https://figshare.com/s/0286924045d50441bf98. We strongly believe that the modern in silico approaches combined with advances in data mining, curation, and machine learning methods will only accelerate the drug discovery processes.