Failure is an Option

Things will always go wrong, but excellent preparation and strong leadership can turn failure into a kind of success.

The story of Apollo13 is a parable of gritty resolve, technology excellence, calm heroism and teamwork. For anyone focused on leadership, operations and program management it is absolutely the purest of inspirations.

The film of Apollo 13 centres around the phrase Failure is Not an Option,” invented post the original drama in a conversation between Jerry Bostick – one of the great Apollo flight controllers – and the filmmakers. It summarises a key part of the culture of Apollo era NASA, and it has found its way onto the walls or desks of many a leader’s office. It is part of the DNA of modern business culture, and of any sizeable delivery project.

Damaged Apollo 13 Service Module
Damaged Apollo 13 Service Module

Lessons from the Space Program

But one of the reasons that the crew was recovered was this: throughout its history, NASA and mission control knew that failure was precisely an option, and they designed, built and tested to deal with that simple truth. The spacecraft systems had – where physically possible – redundancy. The use of a Lunar Module as a lifeboat had already been examined and analyzed before Apollo 13. In the end, a old manufacturing defect caused an electrical failure with almost catastrophic consequences. It was precisely because Mission Control was used to dealing with issues that Apollo 13 became what has been called a “successful failure” and “NASA’s finest hour.”

The ability to respond like this was hard earned. The Gemini program – sandwiched between the first tentative manned flights of Mercury, and the Apollo program that got to the moon – was designed to test the technologies and control mechanisms needed for deep space. It was a very deliberate series of steps. Almost everything that could go wrong did: fuel cells broke, an errant thruster meant that Gemini 8 was almost lost, rendezvous and docking took many attempts to get right and space walks (EVAs in NASA speak) proved much harder than anybody was expecting. And then the Apollo 1 fire – where three astronauts were actually lost on the launch pad – created a period of deep introspection, followed by much redesign and learning. In 18 months, the spacecraft was fundamentally re-engineered. The final step towards Apollo was the hardest.

But, after less than a decade of hard, hard work – NASA systems worked at a standard almost unique in human achievement.

So, with near infinite planning and rehearsal, NASA could handle issues and error with a speed and a confidence that is still remarkable. Through preparation, failure could be turned into success.

Challenges of a Life More Ordinary

All of us have faced challenges of a lesser kind in our careers. I was once responsible for a major software platform that showed real, but occasional and obscure issues the moment it went into production, expensively tested. We put together an extraordinary SWAT team. The problem seemed to be data driven, and software related and simply embarrassing. I nick-named it Freddie, after the Nightmare on Elm Street movies. It turned out to be a physical issue in wiring – which was hugely surprising and easily fixed. The software platform worked perfectly once that was resolved.

Another example: In the early days of Accenture’s India delivery centres, we had planned for redundancy and were using two major cables for data to and from the US and Europe. But although they were many kilometres apart, both went through the Mediterranean. A mighty Algerian earthquake brought great sadness to North Africa, and broke both cables. We scrambled, improvised, maintained client services, and then bought additional capacity in the Pacific. We now had a network on which the sun never set. It was a lesson in what resilience and risk management really means.

Soon enough, and much more often than not, we learnt to handle most failures and problems with fluency.  In the Accenture Global Delivery network we developed tiered recovery plans that could handle challenges with individual projects, buildings, and cities. So we were able to handle problems that – at scale – happen frequently. These included transport issues, point technology failures, political actions and much more – all without missing a single beat. Our two priorities were firstly people’s safety and well being,  and secondly client service, always in that order.

Technology – New Tools and New Risks

As technology develops, there are new tools but also new risks. On the benefit side, the Cloud brings tremendous, generally reliable compute power at increasingly low cost. Someone else has thought through service levels and availability, and invested in gigantic industrialized data centres. The cloud’s elasticity also allows smart users to side step common capacity issues during peak usage. These are huge benefits we have only just started to understand.

But even the most reliable of cloud services will suffer rare failures, and at some point a major front-page incident is inevitable. The world of hybrid clouds also brings new points of integration, and interfaces are where things often break. And agile, continuous delivery approaches means that the work of different teams must often come together quickly and – hopefully – reliably.

The recent Sony incident shows – in hugely dramatic ways – the particular risks around security and data. Our technology model has moved from programs on computers to services running in a hybrid and open world of Web and data centre. The Web reflects the overall personality of the human race – light and dark – and we have only just begun to see the long-term consequences of that in digital commerce.

Turning Failures into Success

What follows is my own summary view of those key steps required to handle the inevitably of challenges and problems. It is necessarily short.

1. Develop a Delivery Culture – Based on accountability, competence and a desire for peerless delivery and client service. Above all, there needs to be an acknowledgement that leadership and management are about both vision and managing and avoiding issues. Create plans, and then be prepared to manage the issues.

2. Understand Your Responsibilities – They will always be greater in number that you think. Some of them are general, often obvious and enshrined in law – if you employ people, handle data about humans, work in the US, work in Europe, work in India and work across borders you are surrounded by regulations. Equally importantly, the expectations with your business users or clients need to be set and mutually understood – there are many problems caused by costing one service level, and selling another. Solving a service problem might take hours or days. Solving a problem with expectations and contracts may be the work of months and years.

3. Architect and Design – Business processes and use cases (and indeed users!) need to account for failure modes. The design for technical architectures must acknowledge and deal with component and service failures – and they must be able to recover. As discussed above, cloud services can solve resilience issues by offering the benefits of large-scale, industrialised supply, but they also bring new risks around integration between old and new. Cloud brings new management challenges.

4. Automate – Automation (properly designed, properly tested) can be your friend. Automated recovery and security scripts are much less error prone than those done by people under stress. There are many automated tools and services that can help test and assess your security environment. Automated configuration management brings formal traceability – essential for the highest levels of reliability. Automated regression testing is a great tool to reduce the costs of testing in the longer term.

5. Test – Test for failure modes in both software and business process. Test at points of integration. Test around service and service failures. Test at, and beyond, a system’s capacity limits. Test security. Test recovery. Test testing.

6. Plan for Problems – Introduce a relevant level of risk management. Create plans for business continuity across technology systems and business processes. Understand what happens if a system fails, but also what happens if your team can’t get to the office, or a client declares a security issue.

7. Rehearse Invest in regular rehearsals of problem handling and recovery. Include a robust process for debriefing.

8. Anticipate and Gather Intelligence – For any undertaking of significance, understand potential issues and risks. Larger organisations will need to understand emerging security issues – from the small, technical and specific to more abstract global threats. Truly global organisations will need to sometimes understand patterns of weather – for example: to determine if transport systems are at threat. (I even once developed personal expertise in seismic science and volcanism.)

9. Respond – But finally acknowledge that there will be major issues that will happen, and such issues will often be unexpected. So, a team must focus on:

  • Simply accepting accountability, focusing on resolution and accepting the short-term personal consequences. It is what you are paid for.
  • Setting-up a management structure for the crisis, and trigger relevant business continuity plans
  • Setting up an expert SWAT team, including what is needed from suppliers.
  • How to report diagnosis and resolution – be accurate, be simple, avoid false optimism and be frequent
  • How to communicate with stakeholders in a way that balances information flow and the need for a core team to focus on resolution
  • How to handle media, if you are providing a public service
  • And after the problem is solved and the coffee machine is temporarily retired, how does the team learn

And finally a Toast …

In previous articles, I have acknowledged the Masters of Delivery I have come across in my varied career.

In this domain covered by this article, I have worked with people in roles such as“Global Asset Protection”, “Chief Information Security Officer” and teams across the world responsible for business continuity, security and engineering reliable cloud services. They work on the kind of activity that often goes unacknowledged when things go well – but in the emerging distributed and open future technology world, they are all essential. To me, these are unsung “Masters of Delivery.” Given this is the start of 2015, let’s raise a virtual glass in celebration of their work. We all benefit by it.

Keith Haviland

This is a longer version of an article originally posted on linkedin.  Keith Haviland is a business and technology leader, with a special focus on how to combine big vision and practical execution at the very largest scale, and how new technologies will reshape tech services. He is a Former Partner and Global Senior Managing Director at Accenture, and founder of Accenture’s Global Delivery Network. Published author and active film producer, including Last Man on the Moon. Advisor/investor for web and cloud-based start-ups.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s