Rik Farrow
I recently spent some time trying to write a set of general guidelines for what to monitor in a software system. I came up with this list:
Latency distribution and successful/unsuccessful request counts (plus error types) for all RPCs served.
Latency distribution and success rate for all other services depended on, as well as circuit breakers tripping.
Monitor the last success time for anything that’s supposed to happen periodically.
Percentage utilisation for resources (quotas, rate limits, physical and logical system resources), as well as saturation signals for the same, and errors or timeouts.
How many instances are up and healthy/unhealthy, restarts, running versions of binaries.