5 min read

    Debugging

    What’s the issue?

    Write down the most basic description of the issue with

    • simple language
    • short sentences or lists
    • no assumptions

    For example:

    • some cron tasks are not running

    Common features?

    Have you received multiple reports of the same issue, or very similar issues?

    After writing down the issue above make a separate list of the things that are in common between the issues. It doesn’t have to be exhaustive, and you might need to keep refining this as you go along, but it helps spot patterns and making the brain work in the background. For example,

    • time of day / month
    • system load / high traffic events (e.g. month end for accounting department)
    • user roles
    • new/old users
    • configuration options (user settings)
    • feature flags

    You aren’t looking through code here - just look for patterns. Can you find users/teams that have similar setups that do not experience these records same issue - make a note of this for later!

    Information gathering

    A slightly deeper look into the issue where you will make some notes of specific information that you can use to narrow down the issue.

    • references (make a note of any affected IDs and timestamps)
    • when did the issue start? Check commits on or before this date
    • skim the codebase and casually make a note of anything that could potentially have an impact
    • known issues on any 3rd party dependencies or APIs?

    Manually check the records located around the same time of the issue. Did they experience the same issue but just didn’t report it?

    If you think the issue is in a well tested area with majority of users not experiencing any issues then you are probably looking at an edge case with a one line (or one character fix)! Keep your eyes peeled.

    Point of failure

    This stage is still an information gathering stage, but it’s a bit more specific. Using the information from the previous steps try to narrow down the area/module of interest (e.g. cron jobs, race conditions, missing records) by ranking the potential for issues.

    You should consider:

    • multiple entry points
    • issues that overlap with code without any test coverage
    • availability of relevant logs

    Also look out for:

    • look out for complex conditionals (are you following the logic correct? )
    • queries that rely on a number of variables
    • loops
    • events fired (are there any listeners you are missing)

    Make a hypothesis

    You have as much information as you think you need at this stage, you have selected the areas of interest and skimmed the codebase.

    Think of this like a PR summary but before you have fixed the issue.

    For example, in a web monitoring saas: “Users are only sometimes notified that a site has recovered because ‘downtime ended’ events used a cached value on the site record as well as a downtime log. If these two values fell out of sync records might be updated but users will not be notified”.

    Then you could also suggest a fix: “We will no longer use the cached value and instead fire an event as a result of any downtime periods being updated. We will lock the row for updates and use database transactions to ensure notifications are fired after successful updates”.

    Fix and repeat

    If your theory was incorrect and your fix didn’t work, go back to the previous steps and try again. This can be frustrating especially after spending time planning and making code changes.

    Here’s a few more things you can use to help you further.

    Production logging

    • send messages with important information
    • be selective (use conditionals) to prevent overwhelming you with information
    • log info at a single place where different paths could be taken

    Puzzle pieces

    • remove code (clutter)
    • drop your ego and write in pseudo-code
    • if you keep asking the same question without an answer consider external factors such as…
    • read it
    • read it again

    Indirect causes

    • API changes (undocumented and unannounced could be harder to find)
    • scheduled tasks (is everything running on time, do your tests take cleanup tasks into account)
    • queues / race conditions (does this bug occurr around particularly busy periods or in a write-heavy part of the application)
    • timezones (do you have users in different locations, are you storing user’s timezone and are you storing dates/times in UTC)
    • server setup vs local config (environment variables and dependencies on the same version)

    Amplify

    Try changing inputs or local code to extreme or unlikely vales and observe results. This is like a binary search but for debugging to bypass a large chunk of potential issues and force different errors.

    For example, you could run your tests with scheduled tasks running and then run without any any scheduled tasks running.

    Take a break

    If you have done at least one cycle and got to the point of frustration take a longer break, do some manual work or work on an easy task. You will come up with a new approach or hypothesis to test but if you’re still stuck take a look at the areas you quickly dismissed at the start.

    Write tests

    • if you can make a failing test you’re 95% there
    • unit test for sense checking using multiple inputs (data providers for coverage)
    • seed the test data with the same values as the real data

    And if don’t have any tests it might be a good time to start…it will help you when debugging the next issue as well as give you confidence when refactoring or deploying.

    Summary

    It’s simple - find the inputs that cause failures without impacting the rest of the application or current users 😂 I said simple, not easy! It’s also not always a fast process but keep being stubborn and you might get lucky.