ИСТИНА |
Войти в систему Регистрация |
|
ИПМех РАН |
||
Other the past decade we have seen a tremendous growth of publicly available chemical and bioactivity datasets, tools for data mining, search, and modelling physicochemical and chemo-biological endpoints. The advance machine learning methods used in cheminformatics and drug research become more assessable to researchers across academy and industry sectors. Unfortunately, the merging of the heterogeneous datasets mining and curation, novel machine learning algorithms and end users is still a big problem due to the many of the underlying cheminformatics approaches and techniques are still under-delivered and hindered to researches. It was therefore our goal to develop a data mining/curation and machine learning framework embedded into a general research data management platform (Open Science Data Repository, OSDR) which can be used as an API, standalone tool or integrated in new software as an autonomous module. In OSDR we have develop a number of pipelines simplifying the entire cheminformatics process. The first step is a chemical processing which is related to datasets importing into our database. It is worth to note that this process is automated and includes validation, standardization and visualization steps. The system is currently supporting almost any available chemical formats. Secondly, we built-in a set of machine learning algorithms which are capable to train and tune models using parameter optimization base on cross-validation and generate a report which includes all the details of different models performance. The choice of available machine methods include traditional methods such as Naïve Bayes, k-Nearest Neighbors, Random Forest, Boosted Decision Trees, Regularized Logistic Regression, and Support Vector Machines, as well as novel deep learning methods with Neural Networks models of different complexity. All the machine learning models were built using open source Scikit-learn (http://scikit-learn.org/stable/) for shallow learning methods, and Keras (https://keras.io/) and Tensorflow (www.tensorflow.org) for deep learning. A wide range of model evaluation metrics such as Receiver Operating Characteristic, Regression Error Characteristic, Area Under Curve, F1-score, Cohen’s kappa, Matthews correlation coefficient, RMSE, R-square and other including some easy to read visualization plots were implemented and automatically included in the training or prediction reports.