Existing by coincidence, programming deliberately
As engineers we spend a lot of our time debugging problems, yet it's rarely taught as a skill in its own right. Some bugs are difficult enough that they can seem borderline impossible to solve, especially for devs toward the junior end of the spectrum. There's no worse feeling than being stuck on a hard problem, not knowing how to proceed. Of course, the right thing to do if you're stuck like that is ask for help; from your team, from other engineers in your org or social circle, from random strangers on the internet. As a random stranger on the internet then, this post is my attempt to help get you unstuck if you find yourself in that situation.
Tangent: It actually started as my attempt to write a "things I've learned from 25 years as an engineer" kind of post like some others I've seen recently. But it turns out I'm more inclined towards concrete, practical advice than I am to deep, philosophical wisdom. 🤷
The post is written as an ordered list but not every problem necessitates all the steps. Sometimes the correct fix arrives in your mind unbidden at step 1 or, even better, step 0! Other times you can skip a few steps, or do them in a different order. But broadly speaking, the order here is a framework I've gradually settled on since my first job working on a resource-handling module for a GSM base station controller frame at Lucent Technologies, back in 1997. In the intervening years I've worked in many different environments: systems programming, databases, desktop apps, web apps, backend and frontend. The steps are generalised and applicable across all of those, they're not specific to a particular language or paradigm.
The hardest problems often appear at times of greatest pressure. Something is broken in production and paying customers are complaining about it. Maybe they're asking for refunds. Your boss wants to know how long it will take to fix and you don't even know what's wrong yet.
If all that's going on you're probably stressed and stress will cause you to solve the problem slower, not faster. So before getting to the obvious step 1, we need to take care of step 0 first. It pays to make sure you're in a good frame of mind. Try to relax, be calm. Your production system might be down for an hour, but that's better than it being down for many hours because you rushed into the wrong action.
Equally important is being confident and optimistic in your outlook. Programming is not magic, systems follow rules even when those rules are mysterious and unknown to us. Each problem has a rational cause and resolution, which you'll discover in time. So persevere, don't give up.
Lastly, be honest with yourself about the problem. Don't kid yourself that you know something to be true if it's only an assumption. Test those assumptions because they will often surprise you. It's okay not to understand all parts of the problem at all times, as long as you acknowledge the parts you don't understand yet. Keep them in mind but compartmentalise and come back to them later. Divide and conquer.
Reproducing the problem seems such an obvious first step that it's almost not worth mentioning. It should be everyone's step 1, but I've often been surprised in conversation with engineers after asking if they reproduced an issue themselves.
It's not enough to work from someone else's description of a bug, or what you think the problem is. Remember, you need to test your assumptions and there's no greater assumption than whether a problem exists as described or what the steps are to make it happen. Prove you understand those correctly first.
Tangent: In my second job, at Transoft, I worked on a text editor and received a bug report from the QA team about an "infinite loop" when right-clicking to bring up the context menu. I couldn't reproduce it so asked them to show me. The "infinite loop" turned out to be them right-clicking in a different area of the screen and expecting that to close the menu. But the software was working as intended, closing the original menu and opening a fresh one at the new click location. So the "bug" was really just a gap in expectations.
Great, so you reproduced the problem. But did you really or was it just a coincidence? Bugs can sometimes be the product of many interleaved factors and if you only have one data point, you can't be certain that you understand the root cause(s).
Reproducing it a second time can rule out the possibility that you made any silly mistakes first time round and increases confidence that you're on the right path. Confidence, if it's tempered by honesty, is your best friend in this process. But confidence is a delicate flower and you must protect it at all costs. Don't let anything trample over your confidence.
If you know how to reproduce an issue, do you also know how not to reproduce it? That is to say, do you know which variables are at play in determining whether the problem occurs?
Experiment with those variables, change them and prove their significance. This can lead to reducing your steps to reproduce, which is absolutely what you want to do at this stage. It's not enough that you can reliably reproduce the problem, you want to isolate it to the fewest number of steps or the smallest amount of data.
Now you're at the point where it's okay to look at code and try to figure out what's wrong, because now you really understand the nature of the problem.
Apply your knowledge of the variables at play to the system in front of you. What code operates those variables? How do they interact? If there's code you don't understand, try to find the person or team who worked on it. They'll be able to shortcut your path to enlightenment and perhaps they've even encountered issues like yours before.
Sometimes the code originates from opaque third-party sources. If you don't have access to those sources, there are still avenues of investigation open to you. Read the API reference or other documentation, search the bug database if there is one. Are there related questions on Stackoverflow or elsewhere?
Tandetgent: In the early 2000s I worked on an application framework that operated as a Binary Behavior for Internet Explorer 6. That meant using a number of IE and Windows APIs which had limited documentation. Whenever reality failed to match our expectations for those APIs, we'd resort to searching usenet or other online forums for an answer. More often than not, when we eventually found the right answer it was posted by a mysterious genius with the name "Igor Tandetnik". It wasn't long before we started prefixing all our search terms with "Igor Tandetnik" by default. As a debugging accelerator, that totally worked.
After reasoning about the code in its static form, look at the dynamic state in memory when the problem occurs (before, during and after).
How you do this is up to you. Earlier in my career I preferred to use a debugger, but mostly these days I'll just print values to the console. Debuggers are great, but for certain classes of problem (e.g. concurrency, UI events) they are observation-as-interaction; hitting a breakpoint can itself change the conditions of the code you're trying to debug. Logging can be me a more reliable debugger under those conditions. Conversely, inserting log statements gets tedious very quickly if your project has slow compile times. Pick whatever works best for the conditions.
Production logs are also there to help you in this step, don't forget to consult those. Ideally your logs are structured and searchable, so you can easily eliminate noise by using appropriate query terms. If you're not familiar with your production logging infrastructure, find someone who is and ask them to show you the ropes.
Whichever method you use, there are two types of state you're interested in: paths followed through the code and the values stored in any data. Make sure you look at both.
Writing stuff down, either on paper or electronically, can be a surprisingly effective analysis method. It works on two fronts, forcing you to actively consider the thing you're writing about and then later as an aide memoire when looking back at the information in your notes.
Try to resist the temptation to prematurely solutionise in those notes. If premature optimisation is the root of all evil (or at least most of it), then premature solutionisation is the root of all misdiagnosed bugs (or at least most of them). Focusing on just the things you've observed to be definitely true will help keep your assumptions and biases in check.
Force yourself to start some notes as soon as you begin to investigate a problem, even when the problem seems like it might be trivial. In the worst case, you can throw them away if they weren't useful. It can also be helpful to write them somewhere public, so other people can benefit from what you've learned and perhaps make suggestions about the problem you're working on. Transparency is a superpower.
Whenever I debug production incidents,
or if I'm just performing
routine maintenance on production infrastructure,
I start a new thread in our
#devops Slack channel
and take live notes there.
At the very least,
these threads serve as a public record
of everything I've done or observed,
associated with a timestamp.
Future engineers can find them using search
and refer back to them
if similar scenarios arise again.
But on more than one occasion they've also been a trigger
for helpful discussion about whatever it is I'm working on.
We've fixed problems more quickly
because of these threads.
Sometimes it helps to remove chunks of code so you can prove they're unrelated (or not). There are two dimensions you can do this on, time-based and feature-based.
Time-based means using source control
to gradually zero in on the changeset
that introduced a bug.
If you're using
git bisect exists for exactly this purpose.
It's a great weapon to have in your armoury
and you should get familiar with it
if you aren't already.
Feature-based means looking at the code and physically removing parts of it yourself. Delete it, comment it out, use conditional compilation, whatever. This is you testing your assumptions. Make sure you take baby steps when following this approach. It's too easy to change lots of things in one go and then be unsure which of them is responsible for any observed effects.
If you focus on the same problem for too long, brain fog sets in and you become less effective. Walking away is the best thing you can do at this point but it can be hard to recognise when it's time for that. Try to consciously introspect on your performance whenever you take a step back from the coalface. Be honest in your appraisal.
I'm lucky enough to have a dog, Milo, who forces me to stop working at regular intervals so we can play or go out for walks. Those walks are sometimes the most productive part of my day, the number of times some fresh insight arrives during a walk is uncanny. If it's not the full solution, it might be some part of it or a theory that moves me one step closer.
The point is that your brain doesn't stop working on a problem just because you stopped actively thinking about it. It's still there, chugging away in the background. Give it some breathing space to do its thing.
While rewriting entire systems is rarely a good idea, rewriting small chunks of functionality can be a powerful way to uncover considerations that might otherwise hide out of sight. Sometimes you can stare at code for ages and it looks fine, but as soon as you try to re-implement it in your own terms you're confronted with tradeoffs that the original author had to make. Those tradeoffs are a great source of "aha!" moments for debugging.
It's important to point out that you're not aiming to replace the code you're rewriting here. The plan is to throw your rewrite away after it has done its job, which is purely to help you understand. Occasionally you might get lucky and discover the fix for your bug is lurking in the "replacement" code but it's best not to set out with that intention, as it can distract you from the real task at hand.
If there's one observation that's been thrown at me more than any other, both as compliment and criticism, it's that I write a lot of tests (too many for some people). But there's one kind of test I absolutely will not compromise on, and that's regression tests. They're like tech anti-debt, compound interest that pays out increasing amounts as it accumulates in your project.
Every time you fix a production bug, you should add at least one new test case to your regression suite. Things that go wrong once in software projects will often go wrong a second time. Lightning does strike twice. The easiest way to deal with that is by writing regression tests as you go. And the easiest way to be certain your regression tests really work is by writing the failing test case first, before you land the fix.
Writing tests like this is also a good way to coax any test-reluctant practitioners into contributing tested code. It's much harder for them to decline on grounds of time or effort, if you're only asking for a solitary test in their PR. Inch by inch, you can nudge them in the direction of better habits.
Eventually you'll understand the problem well enough that one or more fixes reveal themselves to you. If you know in your bones what The One True Fix is then crack on with that, no problems. But if there's even the slightest sense of doubt, you should pause to think through your approach. If part of the solution seems clunky, it might be a sign that you're fighting against the surrounding code instead of working with it. It can be human nature to cling to our beliefs in the face of contradictory evidence, so be honest about any tradeoffs.
When the path forward is unclear, you should proactively seek alternative opinions. Don't think of uncertainty as a sign of weakness; instead your willingness to discuss it is a sign of strength. And all those discussions will pay forward to future bugs, putting you on stronger footing for challenges that lie ahead.