comparemela.com

Latest Breaking News On - Non surgical tale - Page 1 : comparemela.com

Seeing Like an SRE: Site Reliability Engineering as High Modernism

Rik Farrow I recently spent some time trying to write a set of general guidelines for what to monitor in a software system. I came up with this list: Latency distribution and successful/unsuccessful request counts (plus error types) for all RPCs served. Latency distribution and success rate for all other services depended on, as well as circuit breakers tripping. Monitor the last success time for anything that’s supposed to happen periodically. Percentage utilisation for resources (quotas, rate limits, physical and logical system resources), as well as saturation signals for the same, and errors or timeouts. How many instances are up and healthy/unhealthy, restarts, running versions of binaries.

© 2025 Vimarsana

vimarsana © 2020. All Rights Reserved.