Warnings are everywhere. Warning signs on road. Warning labels nearly everywhere(Thanks California!). Family and friends warn you about decisions you are going to make, your computer’s battery is low, your check engine light is on…and there is chance that you have ignored all of them.
We are adults, we know better. We got to where we are by knowing which warnings to listen to and which ones not to. My guess? Being a teenager ruined it for us. A sea of warnings, and just like the ocean water, never able to consume it.
Warnings in logs and monitoring
Coming back to reality, let’s talk about warnings in monitoring and log messages in general. Think about all the warnings you see in your day to day job. Do you pay attention to the? For me, I generally ignore them. They all become noise. I am pretty busy with many different things, and investigating warnings that no one else seems to care about can be a waste of my time.
The reasons I generally hear are “We will know before it becomes a problem” or “We will investigate when we have time” . In the end, this is never the case. Even worse, when something fails and there was a warning, the question can be “Why didn’t you look into it?”
What warnings really mean…
When I think about how warnings have been used, the reality is that it is for “other people”. I am not talking about everyone but me, but that everyone views warnings as everyone else’s
problem responsibility. When I warn others, I am making the problem bigger than it really is. After coming true, sometimes I hear “everyone was busy.” Or even worse, that email I sent out to everyone warning them and no one acknowledged? I should have put it in the proper ticketing system.
Same thing can happen with reports. Let’s say if you are a consultant listing out security concerns in an environment. If the client is looking to become PCI Compliant, then your report is more of an actual “error”. They will view everything you put down as needing to be done. If there are no real businesses need yet, this report is a “warning”. They will look at each and every item, balance it with the financial and resources, and decide based on their gut on what should be done. The end result is a shadow of the original recommendation
Warnings should never be used for blame
Never use warnings as a way to blame someone. Never. Ever. You don’t listen to warnings, and I don’t listen to warnings. Expecting someone else to pay attention to a warning log message or a warning alert that also has a critical alert is a bad idea. It is only asking for trouble. de-motivation, and people to start focusing on all the wrong problems.
I am not saying completely ignoring the warnings(I will go into that in the next section), but you do have to understand that no one pays attention to it, even if they should. Your check engine light, chances are you have one on. And you are not caring about it because either you know what it is, or your car still runs fine. I have even heard the phrase “I will worry about it when it turns off.
Let’s use a simple and typical monitoring alert we all can relate to. Let’s say we have warning(does NOT alert) at 75% used, and critical(pages the oncall) at 90%. Warning was happening through out the day, no one looked into it. Critical alert paged the oncall in the evening. Let’s say he did not respond in time and brought down the critical application on this server.
I have heard plenty of bad reasons in this situation. Create alerts for Warnings, lower the critical so we have more time to respond, everyone during the day needs to “do better” at paying attention to warnings.
The same goes for log message. I have seen a warning(and specifically not an error/fatal) message show up. After experience a bug with the application, we go to the logs and see this warning that was pointing to the issue that no one noticed. Once again, the discussion goes to “how do we pay attention to this in the future”
Warnings should tell a story
Warnings do have their place, but not in the way we currently use it. When something finally breaks, warnings should only be used to tell how it broke. What are the things that were missed. Warnings are a story and should be treated as such.
In both examples above, RCA about warnings should be more processes and long term solutions:
- What made someone/everyone decide on the warning level?
- What are others working on to where it was missed?
- Is it or others adding or hurting the overall noise?
- Is there ANY other way to make a warning more meaningful?
Only when missing warnings is viewed as a side effect of something else, would it actually be beneficial. Never address warnings directly.