Category Archives: Design

Pertaining to software design

Warning about Warnings

Warnings are everywhere. Warning signs on road. Warning labels nearly everywhere(Thanks California!). Family and friends warn you about decisions you are going to make, your computer’s battery is low, your check engine light is on…and there is chance that you have ignored all of them.

We are adults, we know better. We got to where we are by knowing which warnings to listen to and which ones not to. My guess? Being a teenager ruined it for us. A sea of warnings, and just like the ocean water, never able to consume it.

Warnings in logs and monitoring

Coming back to reality, let’s talk about warnings in monitoring and log messages in general. Think about all the warnings you see in your day to day job. Do you pay attention to the? For me, I generally ignore them. They all become noise. I am pretty busy with many different things, and investigating warnings that no one else seems to care about can be a waste of my time.

The reasons I generally hear are “We will know before it becomes a problem” or “We will investigate when we have time” . In the end, this is never the case. Even worse, when something fails and there was a warning, the question can be “Why didn’t you look into it?”

What warnings really mean…

When I think about how warnings have been used, the reality is that it is for “other people”. I am not talking about everyone but me, but that everyone views warnings as everyone else’s problem responsibility. When I warn others, I am making the problem bigger than it really is. After coming true, sometimes I hear “everyone was busy.” Or even worse, that email I sent out to everyone warning them and no one acknowledged? I should have put it in the proper ticketing system.

Same thing can happen with reports. Let’s say if you are a consultant listing out security concerns in an environment. If the client is looking to become PCI Compliant, then your report is more of an actual “error”. They will view everything you put down as needing to be done. If there are no real businesses need yet, this report is a “warning”. They will look at each and every item, balance it with the financial and resources, and decide based on their gut on what should be done. The end result is a shadow of the original recommendation

Warnings should never be used for blame

Never use warnings as a way to blame someone. Never. Ever. You don’t listen to warnings, and I don’t listen to warnings. Expecting someone else to pay attention to a warning log message or a warning alert that also has a critical alert is a bad idea. It is only asking for trouble. de-motivation, and people to start focusing on all the wrong problems.

I am not saying completely ignoring the warnings(I will go into that in the next section), but you do have to understand that no one pays attention to it, even if they should. Your check engine light, chances are you have one on. And you are not caring about it because either you know what it is, or your car still runs fine. I have even heard the phrase “I will worry about it when it turns off.

Let’s use a simple and typical monitoring alert we all can relate to. Let’s say we have warning(does NOT alert) at 75% used, and critical(pages the oncall) at 90%. Warning was happening through out the day, no one looked into it. Critical alert paged the oncall in the evening. Let’s say he did not respond in time and brought down the critical application on this server.

I have heard plenty of bad reasons in this situation. Create alerts for Warnings, lower the critical so we have more time to respond, everyone during the day needs to “do better” at paying attention to warnings.

The same goes for log message. I have seen a warning(and specifically not an error/fatal) message show up. After experience a bug with the application, we go to the logs and see this warning that was pointing to the issue that no one noticed. Once again, the discussion goes to “how do we pay attention to this in the future”

Warnings should tell a story

Warnings do have their place, but not in the way we currently use it. When something finally breaks, warnings should only be used to tell how it broke. What are the things that were missed. Warnings are a story and should be treated as such.

In both examples above, RCA about warnings should be more processes and long term solutions:

  • What made someone/everyone decide on the warning level?
  • What are others working on to where it was missed?
  • Is it or others adding or hurting the overall noise?
  • Is there ANY other way to make a warning more meaningful?

Only when missing warnings is viewed as a side effect of something else, would it actually be beneficial. Never address warnings directly.

Think of Application Design like a big city

Countless times I have heard others few an application as a complete and well thought out. When they have looked at the code, they do so with assumption that every single line is intentional and in the proper place. I have had conversations trying to explain the design and well it works, and then have the conversation deviate(and justify) why a certain piece of code contradicts the design

Application and their design are more akin to big cities. Cities(except maybe Dubai) are not built from scratch to become a city. Let’s look at the following:

Source: https://github.com/git/git/blame/master/commit.c

While I am not questioning the design of git itself(no reason to…yet), this is a great example of code that is layered over time. These first few lines span over a decade. In terms of a city, this file is an ancient district with a lot of history. There are lines that were added years ago that still exists. There are new lines that were added for various reasons recently. I bet there were lines removed for certain reasons a while back. Regardless, there is a story here to tell, just as there is to any big city.

The git code itself is rich with history just as much, if not more. Some code is old and untouched, and some are newly created. While the goal of it all is to function, not all of it would be done the same it it was built today.

Treat designs after the fact like city ordinances

Applications will invariably change over time the longer it is up and running. Whether the business is asking for a new feature, libraries need upgrading, APIs changing, there will be times design changes. As with every need, it is not feasible to re-design the application for every change.

Think of application “re-designs” as city ordinances. As a city grows, as time passes, as new needs arise or technology evolve…cities will pass new laws or ordinances. The goal, or at least they advertise it as such, is to help improve the quality and lives of everyone. When ordinances are passed, they usually include a grand father clause.

When it comes to your application, you don’t need to re-write everything immediately. When deciding to go in the new direction, you should do the following:

  • Implement the new design in the section(s) you are working
  • Pay close attention to PRs to make if the are using the new design
  • When the sections of the old design become manageable, work to remove the rest.

Example: A quick fix becomes a standard

Imagine in an application, I need to pull information from a new endpoint. Simple enough. I might not think much of it, I would add the following to the code:

response = requests.get('https://badcure.io/v1/info')
my_needed_info = response.json()['info']['status']

Over time, I or others find they need the same information. So we copy those simple lines over to where we need it. Simple. No problems, and we have more important things to do to care about design.

Just as with any big city, this thing that “just worked” starts to break down and cause problems. The API is unstable at the scale we use it at. We have this code in many places. Handling API failures, and our growing need to be stable, requires us to now address this.

Example: Solving the problem as an “ordinance”

Sometimes, it may be quicker to “change all the things!” Sometimes due to other needs and scope, it is better to go slow. So in the example above, lets say this code was spread out the entire application and you don’t have time to fix it. As with and ordinance, change what you need as you go forward. What I would do is first create an easy to use function which solves the problem:

def safe_info():
retry_count = 3
while retry_count > 0:
retry_count -= 1
try:
response = requests.get('https://badcure.io/v1/info')
response.raise_for_status()
return response.json()['info']['status']
except requests.RequestException:
if not retry_count:
raise
time.sleep(3)

Safe enough. This function will retry automatically, still throw an exception if it continues to fail. Also return exactly what we need. Going forward, we should use this.

Now as we write code, we use this function. As we review PRs, we make sure others are using this over the old code. As you make changes around the old sections, change this code. New and junior employees may not be aware, so make sure they don’t use the old way. This is would be your responsibility as the lead developer.

Eventually, take care of the rest as you have time and the pending work becomes manageable.

In the end: Balance the needs

You don’t HAVE to follow this habit. Sometimes it would be beneficial to do it all in one fell swoop. The main thing to consider: Is your process the right one or is your process one where others may question your future decisions?

If people will question your quality, doing it all at once may be the answer. If people are questioning your time/effort, take a slower approach. The goal is to be able to keep designing and redesigning. Try to keep the business agree with your best judgement and not judge your previous efforts.