How you react when your systems fail may define your business

INSUBCONTINENT EXCLUSIVE:
Just around 9:45 a.m
Pacific Time on February 28, 2017, websites like Slack, Business Insider, Quora and other well-known destinations became inaccessible
For millions of people, the internet itself seemed broken. It turned out that Amazon Web Services was having a massive outage involving S3
storage in its Northern Virginia datacenter, a problem that created a cascading impact and culminated in an outage that lasted four
agonizing hours. Amazon eventually figured it out, but you can only imagine how stressful it might have been for the technical teams who
spent hours tracking down the cause of the outage so they could restore service
A few days later, the company issued a public post-mortem explaining what went wrong and which steps they had taken to make sure that
particular problem didn&t happen again
Most companies try to anticipate these types of situations and take steps to keep them from ever happening
In fact, Netflix came up with the notion of chaos engineering, where systems are tested for weaknesses before they turn into
outages. Unfortunately, no tool can anticipate every outcome. It highly likely that your company will encounter a problem of immense
proportions like the one that Amazon faced in 2017
It what every startup founder and Fortune 500 CEO worries about — or at least they should
What will define you as an organization, and how your customers will perceive you moving forward, will be how you handle it and what you
learn. We spoke to a group of highly-trained disaster experts to learn more about preventing these types of moments from having a profoundly
negative impact on your business. It always about your customers Reliability and uptime are so essential to today digital businesses that
enterprise companies developed a new role, the Site Reliability Engineer (SRE), to keep their IT assets up and running. Tammy Butow,
principal SRE at Gremlin, a startup that makes chaos engineering tools, says the primary role of the SRE is keeping customers happy
If the site is up and running, that generally the key to happiness
&SRE is generally more focused on the customer impact, especially in terms of availability, uptime and data loss,& she says. Companies
measure uptime according to the so-called &five nines,& or 99.999 percent availability, but software engineer Nora Jones, who most recently
led Chaos Engineering and Human Factors at Slack, says there is often too much of an emphasis on this number
According to Jones, the focus should be on the customer and the impact that availability has on their perception of you as a company and
your business bottom line. Someone needs to be calm and just keep asking the right questions. &It money at the end of the day, but also over
time, user sentiment can change [if your site is having issues],& she says
&How are they thinking about you, the way they talk about your product when they&re talking to their friends, when they&re talking to their
family members
The nines don&t capture any of that.& Robert Ross, founder and CEO at FireHydrant, an SRE as a Service platform, says it may be time to
rethink the idea of the nines
&Maybe we need to change that term
Maybe we can popularize something like ‘happiness level objectives& or ‘happiness level agreements.& That way, the focus is on our
products.& When things go wrong Companies go to great lengths to prevent disasters to avoid disappointing their customers and usually have
contingencies for their contingencies, but sometimes, no matter how well they plan, crises can spin out of control
When that happens, SREs need to execute, which takes planning, too; knowing what to do when the going gets tough.