ИСТИНА |
Войти в систему Регистрация |
|
ИПМех РАН |
||
One of the most urgent and actual problems in high performance computing is the extremely low efficiency of supercomputer applications. The situation is complicated by the fact that in most cases users do not even know that their applications are not working efficiently. Therefore, special approach is needed to analyze the state of the entire supercomputer that allows detecting such inefficient applications. A possible way to deal with this issue is to study a flow of tasks that run on a supercomputer in order to detect applications with unusual dynamic characteristics or activity (e.g., very low memory usage or deadlocks), in other words, – applications with abnormal behavior. In this paper we propose a new intelligent algorithm for detecting such anomalies in the supercomputer task flow. It is based on data mining methods and is designed as follows. Using system monitoring data, mean and spread of values for dynamic characteristics (such as CPU load, number of cache misses per second, Infiniband network load) is computed for each completed task. This data is used as an input for decision trees classification algorithm that assigns a task to a particular class. At the moment, each task can be one of three classes – normal, suspicious and abnormal; in the future it is planned to separate different anomaly classes. Current results show that a surprisingly large percentage of tasks have an abnormally low resource usage. Also in this paper we will show how different characteristics influence the choice of classification classes, which will help to determine the root causes of program anomaly behavior. The results will be demonstrated on the actual data from the Lomonosov supercomputer (1.7 PFlops peak, 5000 nodes, 500 completed tasks per day).