OSG Document 829-v1

Learning and Recognizing Failure Causes on OSG

Document #:
Document type:
Submitted by:
Alina Bejan
Updated by:
Alina Bejan
Document Created:
23 Feb 2009, 14:18
Contents Revised:
23 Feb 2009, 14:18
Metadata Revised:
13 May 2009, 12:25
Viewable by:
  • Public document
Modifiable by:

Quick Links:
Latest Version

From a survey of job failures in US-ATLAS, we found that jobs on OSG constantly fail because of
many different reasons, some are known and others are not. System operators troubleshoot these
failures mostly by examining various logs and probing all the possible small tests based on their
experiences. Obviously, it’s a complicated and time-consuming task. What’s more, some hidden
causes are very difficult to discover if they are beyond system operators’ experiences. Therefore, it
would be greatly of value to learn and build models on tons of monitoring and logging data, for
recognizing failures and discovering hidden causes. A mechanism of Learning and Recognizing
Failure Causes (LRFC) is proposed to index system states on the fly, and retrieval similar ones in
the future. Some experiment results of partial implementation testing on US-ATLAS monitoring
data look promising. In the end, the research plan and resources needed from OSG are listed.

Files in Document:
Notes and Changes:
Jing Tie's proposal for the CS Fellowship program
DocDB Home ]  [ Search ] [ Last 20 Days ] [ List Authors ] [ List Events ] [ List Topics ]

Supported by the National Science Foundation and the U.S. Department of Energy's Office of Science Contact Us | Site Map

DocDB Version 8.8.10, contact Document Database Administrators