A new approach to finding potentially erroneous entries in a database using machine learning - доклад на конференции | ИСТИНА – Интеллектуальная Система Тематического Исследования НАукометрических данных

Авторы: Khrisanfov M.D., Matyushin D.D., Samokhin A.S.
Международная Конференция (Симпозиум) : Fourteenth Winter Symposium on Chemometrics
Даты проведения конференции: 26 февраля - 1 марта 2024
Дата доклада: 29 февраля 2024
Тип доклада: Устный
Докладчик: Khrisanfov M.D.
Место проведения: Armenia, Tsaghkadzor, Armenia
Аннотация доклада:
The NIST Retention Index (RI) database is one the most widely used sources of retention indices. Its latest version contains more than 300 000 entries (combinations of compound-stationary phase-RI value) from various sources: experimental articles, personal communication, etc. According to our estimates about 80% of the compounds from both NIST17 RI and NIST20 RI databases have only one RI value per stationary phase, which makes searching for erroneous values with statistical methods impossible. Manual inspection of entries also does not seem viable due to the size of the database. We designed a two-step approach to find potentially erroneous retention indices. The first step was to use five predictive models to obtain predicted RI values for the whole database. The models and their respective data preprocessing pipelines were heavily inspired by the following article [1]. We used 5-fold cross-validation to train each of the five predictive models. However, validation sets were used only for obtaining predicted RI values. We did all the tuning of hyperparameters and architectures for the models beforehand. At the end of the step, we had five predicted RI values for each compound-stationary phase-RI combination. The second step was to compare these five predicted values against the experimental ones. For each of the models we considered 5% RI values with the biggest difference between the predicted value and the experimental one to be outliers and marked them with a “yellow card”. This way each of the entries got from zero to five “yellow cards”. We wanted to stay cautious and deemed only entries with 5 “yellow cards” to be potentially erroneous. There are two main reasons for an entry to get 5 “yellow cards”. Either the compound is too unique for its RI value to be adequately predicted or there is an erroneous RI value in the database. In the former case the predictions of different models would be likely inconsistent: the standard deviation would be big, because prediction errors of the models are not fully correlated. In the latter case the predictions would likely align better and the standard deviation would be much smaller. We discovered that the median and the maximum of the distribution of the standard deviations of all five predicted values in the group with 5 “yellow cards” was significantly smaller than even in the group with 1 “yellow card”. Overall, we were able to detect 2093 outlier entries in the NIST 17 RI database, 566 of those were corrected or removed by the developers in NIST 20 RI. The research is supported by Russian Science Foundation (project No. 22-73-10053), https://rscf.ru/project/22-73-10053/
Добавил в систему: Хрисанфов Михаил Дмитриевич

	ИСТИНА	Войти в систему Регистрация
	ИПМех РАН
	Главная Поиск Статистика О проекте Помощь

ИСТИНА

ИПМех РАН

A new approach to finding potentially erroneous entries in a database using machine learningдоклад на конференции

Прикрепленные файлы