Learning and Recognizing Failure Causes on OSG

Alina Bejan
Alina Bejan
23 Feb 2009, 14:18
23 Feb 2009, 14:18
13 May 2009, 12:25
From a survey of job failures in US-ATLAS, we found that jobs on OSG constantly fail because of
many different reasons, some are known and others are not. System operators troubleshoot these
failures mostly by examining various logs and probing all the possible small tests based on their
experiences. Obviously, it’s a complicated and time-consuming task. What’s more, some hidden
causes are very difficult to discover if they are beyond system operators’ experiences. Therefore, it
would be greatly of value to learn and build models on tons of monitoring and logging data, for
recognizing failures and discovering hidden causes. A mechanism of Learning and Recognizing
Failure Causes (LRFC) is proposed to index system states on the fly, and retrieval similar ones in
the future. Some experiment results of partial implementation testing on US-ATLAS monitoring
data look promising. In the end, the research plan and resources needed from OSG are listed.

Jing Tie's proposal for the CS Fellowship program
