ИСТИНА |
Войти в систему Регистрация |
|
ИПМех РАН |
||
Introduction. Oncological diseases are characterized with fast progression. This makes early diagnostic extremely important. Gene expression analysis is a perspective way for cancer diagnostics as it allows to see slight changes in cellular behavior on a molecular level at the early stages of oncology. Transcriptome technologies, such as RNA-seq and microarrays, produce a huge amount of data, which is convenient to analyze with machine learning (ML) approaches. Methods and Results. Random Forest is a widely used ML algorithm for bioinformatic tasks. However, its applicability for gene expression analysis is limited by the need to normalize classified data with the train one. At the same time, there is a set of nonparametric methods based on pairwise comparisons between expression levels of genes across one sample. So they become independent from monotonic normalization. The aim of this study was to combine these approaches. In order to this, we built rank and indicator models. Further, using datasets from TCGA and GEO databases, we demonstrated that both model types were able to distinguish healthy samples and ones from head and neck cancer, subtypes of breast cancer and gliomas, and stages of ovarian cancer with high efficiency. The biological relevance of the rank model was approved by important features analysis. Also, a potential connection between innate immunity genes (roster from InnateDB) and severity of tumors (glioma, breast cancer, ovarian cancer) was revealed with the rank model. Acknowledgments. The reported study was funded by RFBR according to the research project № 19-29-01243.
№ | Имя | Описание | Имя файла | Размер | Добавлен |
---|