Links

Tools

Export citation

Search in Google Scholar

Known Unknowns in Large-Scale System Monitoring

This paper is available in a repository.
This paper is available in a repository.

Full text: Download

Question mark in circle
Preprint: policy unknown
Question mark in circle
Postprint: policy unknown
Question mark in circle
Published version: policy unknown

Abstract

This paper addresses a central challenge in PRISM, a large-scale distributed monitoring system: coping with the uncertainties and ambiguities introduced by network and node failures. In particular, in a large scale monitor- ing system, such failures interact badly with techniques needed for scalability like hierarchy, arithmetic filterin g, and temporal batching. For example, if a monitoring sub- tree is silent over an interval, it is difficult to distinguis h between two cases: (a) the subtree has sent no updates because the inputs have not significantly changed or (b) the inputs have significantly changed but the subtree is unable to transmit its report. As a result, reported results can be arbitrarily far from their true values. To address this challenge PRISM introduces Network Imprecision (NI), a new metric to characterize accuracy despite node failures, network disruptions, and system reconfigurations. PRISM leverages NI to flag potentially inaccurate results, allowing applications to differentia te between known-correct and likely-erroneous results as well as to correct distorted results by applying several redundancy techniques. Evaluation of our PRISM proto- type shows that NI effectively flags inaccurate query re- sults while incurring low overheads, and we find that us- ing NI to automatically select the best results can reduce the inaccuracy in a PRISM-based monitoring service by nearly a factor of five.