The bug referred to in this article: BUGFIX: Fix setting srcRecordInfo during interleaving of checkpointing and RMW by hamdaankhalid · Pull Request #956 · microsoft/garnet
The bug that haunted me for months? Yeah, it's fixed, you know what worked? Noise cancelling headphones, a warm cup of coffee, and courage that I am smart enough to deal with our "Hybrid Log + Hash Index" and "Checkpoint Recovery Management" interaction.
This article, however, is not about what worked, it's about what doesn't work.
- Timebox how long you want to spend on recreating the issue. Have a hard timebox, showing up to standup saying "trying to repro" is not something that can be done forever. Spent too long trying to reproduce, this doesn't work.
- Gather all you can from the process in the buggy scenario before rebooting it. Gather the process dump, logs, on disk data, with these things you have essentially captured all relevant data to see the internal state of the program. Like an idiot I rebooted the erroneous process the first time and had to wait a month for this issue to come back alive, how I wished I got a process dump, only I know. I restarted my process, I regret this, should have collected everything the very first time itself.
- If you are debugging a multi-threaded program and suspect concurrency issues. Check for the invariants of your program, if you have written a multi-threaded program, you should be aware of all the states your program can be in, and what things should be atomic. Nothing should be a "gray" area, if it is a gray area, double down, there's a bug there. For me the checkpointing and transaction interaction was, I understood both of them separately but together it wasn't clear, and I eventually had to be brave enough to figure this out. I was praying the issue wasn't here for a little too long, this didn't work.
- If you're working on a stateful program, don't be afraid to rip apart what's in the heap, and what's on disk (if applicable). There are tools you can use, perhaps from IDE that will let you perform investigations on your dumps and disk contents. This is intimidating but if you're not willing to get your hands dirty, you shouldn't be programming. Using my programmer calculator, I was able to assert the physical address to logical address translations and how the program is meant to flow. This helped me remove the possibility of an older issue where our allocator was giving a record that was beyond the tail-address (highest possible address).
Preachy stuff:
- Don't neglect yourself, to think good, you should feel good, and that won't come by skipping meals and sleep. Spent too long trying to push through late nights, this didn't work.
- Get in the right headspace, you know this is going to be hard, so make sure all the other things are sorted and all distracting software and devices like teams, outlook, your phone, apple watch EVERYTHING is put away. I have found myself to have a ritual to tackle hard issues, I call it a Carmack soda (John said in an interview a DC with Ice was his go to, and I copy it too for the ritual). You need to lock in. Use your rituals to get into the right headspace. I worked from cafes, and random locations, needed to be in my space.
- There are some parts of the codebase I am scared; of our custom concurrency control (Epoch management), the checkpoint state-machine, and the hybrid log allocator to be precise. Fear will guide you away from those parts of your code like it did to me, and I begged and prayed the bug was somewhere more familiar like the API layer and above the storage engine. It wasn't till I accepted I need to be intimately familiar with those parts to make sense of this issue. Along the way I ended up writing a bunch of documentation, so another me, can be a little less scared :)
- At some point you are tired and you start "throwing things" to see what sticks. This is exactly what doesn't work. This is exactly when you step away, the night before cracking this bug I spent the whole night alternating between "let me try this" and watching modern family on my couch. Be better than I was at this, I wish I slept last night :)