About last time I didn’t set-up a dashboard and alarms for my project

Marco Suma
3 min readJul 31, 2022

Last time I didn’t set-up a dashboard and alarms for my project I was working in Amazon Web Services, for a product called CloudWatch. CloudWatch offers a cloud-based monitoring solution to customers so that they can easily set-up a monitoring solution (either custom or off-the-shelf) in a timely manner.

Working in CloudWatch meant that you had to have monitoring in your DNA.

One night I was oncall and one of the so-called company-wide “Game Day” was planned, where one entire AZ was planned to purposely shut-down to test robustness of our services.

To give some more context: engineering model in AWS works a bit differently compared to other companies like Google or Meta. As a SWE (Software Engineer), you actually have to take care of the whole lifecycle of your service (including re-purposing of the hosts where your service is running — you also had to build your own CD pipeline to release software in your servers).

My team was new and did not have a good dashboard to measure our CPU and memory consumption at host-level. As the oncall engineer, I was asked if “I could ack and confirm our service was ready to lose one entire AZ”. I was unprepared to answer and I did not have a quick place to look at it; moved by faith and laziness I answered “yes, we are ready”.

You can imagine we were not… [Murphy’s Law]

In the middle of the night, my pager went off. One alarm fired because of an elevated number of failures while serving customer requests.

By doing a quick RCA I noticed that the CPU of our servers were spiking up to 100% and this was obviously causing failures due to machine overload. Thankfully, we had this unrelated alarm that fired but the truth is that we were completely blind on capturing these type of events. In the middle of the night I had to re-purpose new servers and wait for the termination of the “Game Day”.

That lazy “yes, we are ready” answer costed me one sleep-less night, an high-impact event (somehow known as SEV) and most importantly it caused customer impact.

As an engineer that learns from mistake, that was my last time that I took those questions lightly but most importantly it was the last time I was unprepared without a proper dashboard and a set of strong alarms.

So here’s my questions to you as a software engineer: when was your last time you didn’t build a dashboard and set-up proper alarms for your service? Hopefully it’s a long time ago, but maybe it’s not. So here I am to give some high-level suggestions on setting-up metrics / dashboards and alerts to guarantee high availability of your service (even if they are internal):

  • Make sure your service / product is publishing metrics somewhere
  • Be organized with your metrics, it’s important they follow a common and standard naming convention across your entire org (possibly)
  • Do an analysis and try to understand what things can go wrong with your service. After that, group the metrics that are able to help you answer if your service is healthy or not
  • Create a solid amount of alarms that are looking at those metrics
  • Be prepared to learn from mistakes

Monitoring is like gardening: you have to constantly take care of it, prune the old alarms, add new ones in parallel with the evolution of your product, update dahsboards and so on.

Overall, monitoring does not always prevent a problem, but it definitely helps minimizing it. We can talk about testing another time, that certainly helps with the prevention.

--

--