
The pager fires off followed immediately by a dog barking. Scott leaps to silence the pager in the middle of the third beep. He turns to Trixy, his barking companion and asks absentmindedly “Does that mean you’re going to answer it?” Trixy barks again in dissatisfaction. Peering at the number and then back at the dog Scott says, “If you can’t answer the pager you can at least bury it so I can get some sleep”.
Percy, the technical lead at the computer operations center answers the phone saying, “Hey Scott, sorry to wake you up but we are getting that same damn system timeout on the SAP machine again what should we do?”. After a long pause Scott asks “How many users are impacted”. Percy answers, “We are in a maintenance window, so the application is down. I think we managed to dodge this bullet”. “Thank God Percy, the last thing we need is another user
impacting outage. I guess we have two choices, we can call the vendor and spend a few hours with them debugging again, or just reboot the damn thing. Ahh, just reboot and I’ll call the vendor in the morning and get the status on that trouble ticket we opened with them last week”. Percy replies “will do Scott, get some sleep”. Scott replies, ”As soon as Trixy and I take a little walk”.
This scene plays thousands of times per night in every major corporation; the tier 1 support group detects a fault and pages someone in tier 2 support to resolve it. Industry research shows that 20% of unplanned downtime is attributed to technology and 80% is attributed to people and process. Companies spend millions of dollars on high availability solutions, to be able to tolerate a failure of technology, but solutions that help people and process are sorely lacking.
Poor Scott, he hasn’t had an uninterrupted nights sleep in a few months. This outage is only one of several recurring problems. No one is sure what causes them, but like clockwork, Scott gets paged just about every night. On this night, Scott was faced with a decision.
Scott the fire fighter can meet Service Level Agreements and get back to bed by just rebooting the environment to restore service.
Scott the debugger can risk the Service Level Agreements and try to debug the problem, so that he might get a full night sleep tomorrow.
It is a tough choice that every system administrator like Scott is faced with on a regular basis. A fire-fighter is a hero when he restores service. On the other hand, no one ever says to the debugger, “Wow, it’s been a year since we’ve seen that problem”. Fast service restoral is recognized and rewarded, frequently at the expense of long term stability issues.
In the typical production environment, the service level agreements do not allow time for debugging. However, not debugging increases the likeliness of future outages. Percy and Scott are measured on the Mean Time to Repair service, not mean time to avoid outage. Scott lives this frustration; as long as chronic problems continue he sacrifices sleep, if he takes the time to debug problems Scott risks sacrificing service level agreements. In his crisis driven job, Scott is recognized for fire-fighting and no one remembers yesterday’s solutions.
Percy, the operations guy, sits in front of about six consoles in a room that looks like a modern air traffic control system. Fault and Performance management tools flash alarms in every color of the rainbow. Percy’s tools are great at identifying problems, so that they can be responded to by the people and processes such as Scott. Companies are inundated with tools to help Percy, but the best tool in Scott’s arsenal is caffeine, forcing Scott to be an excellent fire-fighter.

More lives are actually saved in the real world by changing smoke detector batteries than would ever be saved by dragging bodies out of buildings. But no one gets recognition for avoidance, not even Trixy. In all these years of late night pages you would think man’s best friend would have learned to proactively bury the pager.
Percy, the technical lead at the computer operations center answers the phone saying, “Hey Scott, sorry to wake you up but we are getting that same damn system timeout on the SAP machine again what should we do?”. After a long pause Scott asks “How many users are impacted”. Percy answers, “We are in a maintenance window, so the application is down. I think we managed to dodge this bullet”. “Thank God Percy, the last thing we need is another user
impacting outage. I guess we have two choices, we can call the vendor and spend a few hours with them debugging again, or just reboot the damn thing. Ahh, just reboot and I’ll call the vendor in the morning and get the status on that trouble ticket we opened with them last week”. Percy replies “will do Scott, get some sleep”. Scott replies, ”As soon as Trixy and I take a little walk”.This scene plays thousands of times per night in every major corporation; the tier 1 support group detects a fault and pages someone in tier 2 support to resolve it. Industry research shows that 20% of unplanned downtime is attributed to technology and 80% is attributed to people and process. Companies spend millions of dollars on high availability solutions, to be able to tolerate a failure of technology, but solutions that help people and process are sorely lacking.
Poor Scott, he hasn’t had an uninterrupted nights sleep in a few months. This outage is only one of several recurring problems. No one is sure what causes them, but like clockwork, Scott gets paged just about every night. On this night, Scott was faced with a decision.
Scott the fire fighter can meet Service Level Agreements and get back to bed by just rebooting the environment to restore service.
Scott the debugger can risk the Service Level Agreements and try to debug the problem, so that he might get a full night sleep tomorrow.
It is a tough choice that every system administrator like Scott is faced with on a regular basis. A fire-fighter is a hero when he restores service. On the other hand, no one ever says to the debugger, “Wow, it’s been a year since we’ve seen that problem”. Fast service restoral is recognized and rewarded, frequently at the expense of long term stability issues.
In the typical production environment, the service level agreements do not allow time for debugging. However, not debugging increases the likeliness of future outages. Percy and Scott are measured on the Mean Time to Repair service, not mean time to avoid outage. Scott lives this frustration; as long as chronic problems continue he sacrifices sleep, if he takes the time to debug problems Scott risks sacrificing service level agreements. In his crisis driven job, Scott is recognized for fire-fighting and no one remembers yesterday’s solutions.
Percy, the operations guy, sits in front of about six consoles in a room that looks like a modern air traffic control system. Fault and Performance management tools flash alarms in every color of the rainbow. Percy’s tools are great at identifying problems, so that they can be responded to by the people and processes such as Scott. Companies are inundated with tools to help Percy, but the best tool in Scott’s arsenal is caffeine, forcing Scott to be an excellent fire-fighter.

More lives are actually saved in the real world by changing smoke detector batteries than would ever be saved by dragging bodies out of buildings. But no one gets recognition for avoidance, not even Trixy. In all these years of late night pages you would think man’s best friend would have learned to proactively bury the pager.
Copyright Dave Nocera 2007
No comments:
Post a Comment