We screw up at times, and it sucks. Here’s how we handle it at Oak City Labs.
A few months back, we were working on a large client project and deployed a change that escaped our testing process. Fortunately, it didn’t cause a huge impact but it was embarrassing and we corrected the error. And we told the client we screwed up.
Early in my career I learned the importance of root cause analysis and reporting through the chain of the command. In our case, our chain of command ends at the client. In a large organization, this might be a much more complex notification chain. In all cases, it helps to have some structure around the process and also the information shared. Here’s a brief look at what we provide our clients for those (rare!) major screw ups:
- Root cause description
- Short term fix – How we stopped the bleeding
- Impact – Affected users and systems, length of time, anything we can share to help message the outage to users
- How it got past testing – This is usually brief and then revisited during a retrospective or post-mortem
- Long term fix – A brief description and then revisited during a retrospective or post-mortem. We mostly want to answer: How will we keep this from happening again? If we’re unsure, we might provide a date/time for further discussion.
This is fairly standard practice in engineering, manufacturing and healthcare and we believe it to be an important one for software application developers too. It’s part of our process and something you should ask any developer on a project, whether outsource or an internal development team. For anyone growing a software company you might want to ask yourself these questions:
- What does your company do for root cause analysis?
- Do you have Systems Engineers, DevOps or Support teams that handle the communication?
- What does it look like when something goes terribly wrong?
- How do you communicate that to management and then to your users?
It’s an incredibly humbling experience to tell someone you were wrong or that you screwed up. We’ve been fortunate to have clients that understand the precarious balance between speed and bugs introduced because of that speed. For that, we are thankful. For the patience and forgiveness, we are humbled.
We could have 98% test coverage, which means that we have tests that check everything at the code level. We could have a team of QA people that manually test every possible inch of an application. And we could still find ourselves facing some bug, a memory leak or something that goes wrong somewhere in the stack. It’s a fact of life with software development and not for the faint of heart. It’s also incredibly difficult to find the right balance between what the client can afford and the amount of testing we build into an application.
At the end of the day, people and businesses are trusting us to build and solve problems that are critically important to the growth of their business. Would you want a developer that tells you when they screwed up? Or would you want one that finds a litany of reasons why something is not their fault?
We’re not perfect, we’re human. Having a process in place to handle those human errors can help guide your team, keep stress levels in check and ensure good communication between you and your customers.