ИСТИНА |
Войти в систему Регистрация |
|
ИПМех РАН |
||
In the last decade there is an increasing interest in using in silico tools for potential risk assessment of newly released chemicals due to the large number of chemicals enter the market yearly and the big uncertainty on their possible hazardous effects. Different tools and methods based on machine learning techniques already exist and were used in a wide range of applications starting from quantitative structure-property relationships and expanding into predictive toxicology. There is a lot of historical data accumulated across multiple databases which is publicly available and can be used with novel machine learning methods. Unfortunately, due to different datasets, metrics and validation strategies, the significant gaps remain in both the quantity and quality of data available coupled with optimal predictive methods. This work is an attempt to develop a multitask system which can serve as searchable curated collections of multiple chemical datasets and ready to use novel machine learning methods solely built using open source frameworks and libraries. We have implemented a set of self-tuned, using grid search and k-fold validation, traditional machine learning methods (shallow methods) such as Naïve Bayes, k-Nearest Neighbors, Random Forest, Boosted Decision Trees, Regularized Logistic Regression, and Support Vector Machines base on open source Scikit-learn (http://scikit-learn.org/stable/). The novel Deep Neural Networks models of different complexity have been also implemented using Keras (https://keras.io/), a deep learning open library, and a Tensorflow (www.tensorflow.org) as a backend. The machine learning models were trained and evaluated to predict measures of toxicity from the physical characteristics of the structure of chemicals using the same datasets as in the Toxicity Estimation Software Tool (https://www.epa.gov/chemical-research/toxicity-estimation-software-tool-test). The Deep Learning models showed very good performance evaluation characteristics and were found to be useful in predicting of toxicological and physicochemical parameter endpoints. The results of this work support an optimistic view that some of current obstacles in cheminformatics can be overcome by using Deep Learning methods.