Saturday, September 22, 2007

Slowpoke





As he enters the lane in front of you, his left blinker continues to flash; you notice boney knuckles choking his steering wheel like a chicken neck; then some maniac passes you on the right. Finally the realization hits: “You are now stuck in the left lane driving behind a slowpoke”.

You can forget darting into the right lane; everyone else has already thought of that and you are the one car trapped. Study the situation carefully; it will take clear planning and perfect timing to free yourself. So sit back, relax and keep a safe distance; and be alert for an opportunity to escape.

Now he is riding his break and the traffic behind you continues to build, entering the right lane at this speed would be a death sentence. Drivers zipping by on the right give you dirty looks, as if it is your fault that they were stuck in traffic. Perhaps it was your fault, perhaps you should have used your laser guided missile to destroy this slowpoke the moment he got in front of you and all this would have been avoided.

If only when we received our driver’s license we all were issued one laser guided missile. The back of your driver’s license would read, “As a responsible driver you are expected to use your laser guided missile to destroy any cars that you deem unsafe. You only get one shot in a lifetime, so use it wisely. Thank you and shoot responsibly – The Division of Motor Vehicles”.




Other drivers would not know if you had already used your laser guided missile or if you have them targeted this instant. Imagine how nice the roads would be, no rude drivers, no speed limits, no police; driving would be managed by a totally self regulating system of drivers who can destroy you or be destroyed based upon any infraction. All the slow pokes and maniacs on the road would be destroyed.

Then in my rear view mirror I spot an opportunity, a driver in the right lane has left just enough room for me to cut in front of him and escape this slowpoke. I wait for just the right second, hit the accelerator and now I am free. As I pass the slowpoke I look again in my rear view mirror to see an angry face driver staring back. Then I wonder about that missile idea, if that guy behind still had his last shot would he use it on me?


The next time you are tempted to reach for that imaginary missile remember the truth is that a slowpoke is anyone who drives slower than you and the maniac is anyone who drives faster; and if the driver missile system were a reality, long ago you too would have been eradicated.

Copyright Dave Nocera 2007

Sunday, September 16, 2007

Aspirin of Stability

Scott wondered if one donut was going to satisfy his hunger as he filled his brand new super jumbo coffee mug with some fresh brew.

Phil’s eyes drift up from a pile of trouble tickets and focus on Scott, who walks towards him holding a silver coffee mug. Phil called, “Hey nice coffee mug, how much?” Scott responded, “the coffee was $1.75, but the mug was free”. Phil said, “How did you get such a great mug for free?” “Well”, Scott replied, "I won this mug at an EMC vendor raffle” Phil interrupted, “I cannot believe you let yourself become part of their propaganda machine!”

As the smiles faded from their faces, Scott sat and Phil continued, “As I was saying earlier the plan is to escalate the problem to the vendor”. Scott said “We’ve already have a severity one ticket opened with them for a month”. Phil replied, “I thought so too, but the vendor informed me that our ticket was closed until we load the latest patches”. Scott interjected, “I told them a month ago, there is nothing in the patch readme files to suggest that our problem is resolved in those new patches”. Phil continued, “That is why we need to escalate, we pay for a premium support contract, and as soon as we get a tough problem they read from a script, ‘load two patches and call me in the morning’. “

The problem that Phil and Scott are faced with is not unique. Since ENIAC, the world's first electronic digital computer, vendors have been pushing the patch as the aspirin of infrastructure stability.

Phil said to Scott, “In reviewing these trouble tickets I noticed that only about one half of the nodes ever had the problem, any ideas?” Phil went off to refill his coffee; Scott carefully studied the list of problem nodes.

When Phil returned, Scott announced, “These servers are also running application training, so somehow application training might also be related to the problem”. Phil replied, “Scotty, you found the needle in the haystack!” To that Scott answered, “But Phil, you found the haystack”. After some mutual back patting, they each grab their coffee and depart.

Over the next few days Scott and the application group identify and resolve the problem, which actually turned out to be an error in the application installation scripts. The problem had nothing to do with the vendor patch recommendation. If Phil and Scott had followed the vendor’s advice: their efforts would have wasted a lot of time, generated a tremendous amount of change, and have been totally unnecessary.

A few weeks after the problem was solved, Phil runs into Scott on the cafeteria checkout line and now Phil is carrying the same super jumbo environmentally conscious coffee mug. Scott asks, “Did you win that from EMC?” Phil smiled and said these words of wisdom, “The moral of this story Scott is that vendors don’t always understand our problems but they sure do have some great coffee mugs.”

See my related article in IT Managers Journal - The Patch Paradox



Copyright Dave Nocera 2007

Pager Duty


The pager fires off followed immediately by a dog barking. Scott leaps to silence the pager in the middle of the third beep. He turns to Trixy, his barking companion and asks absentmindedly “Does that mean you’re going to answer it?” Trixy barks again in dissatisfaction. Peering at the number and then back at the dog Scott says, “If you can’t answer the pager you can at least bury it so I can get some sleep”.

Percy, the technical lead at the computer operations center answers the phone saying, “Hey Scott, sorry to wake you up but we are getting that same damn system timeout on the SAP machine again what should we do?”. After a long pause Scott asks “How many users are impacted”. Percy answers, “We are in a maintenance window, so the application is down. I think we managed to dodge this bullet”. “Thank God Percy, the last thing we need is another user impacting outage. I guess we have two choices, we can call the vendor and spend a few hours with them debugging again, or just reboot the damn thing. Ahh, just reboot and I’ll call the vendor in the morning and get the status on that trouble ticket we opened with them last week”. Percy replies “will do Scott, get some sleep”. Scott replies, ”As soon as Trixy and I take a little walk”.

This scene plays thousands of times per night in every major corporation; the tier 1 support group detects a fault and pages someone in tier 2 support to resolve it. Industry research shows that 20% of unplanned downtime is attributed to technology and 80% is attributed to people and process. Companies spend millions of dollars on high availability solutions, to be able to tolerate a failure of technology, but solutions that help people and process are sorely lacking.

Poor Scott, he hasn’t had an uninterrupted nights sleep in a few months. This outage is only one of several recurring problems. No one is sure what causes them, but like clockwork, Scott gets paged just about every night. On this night, Scott was faced with a decision.
Scott the fire fighter can meet Service Level Agreements and get back to bed by just rebooting the environment to restore service.
Scott the debugger can risk the Service Level Agreements and try to debug the problem, so that he might get a full night sleep tomorrow.

It is a tough choice that every system administrator like Scott is faced with on a regular basis. A fire-fighter is a hero when he restores service. On the other hand, no one ever says to the debugger, “Wow, it’s been a year since we’ve seen that problem”. Fast service restoral is recognized and rewarded, frequently at the expense of long term stability issues.

In the typical production environment, the service level agreements do not allow time for debugging. However, not debugging increases the likeliness of future outages. Percy and Scott are measured on the Mean Time to Repair service, not mean time to avoid outage. Scott lives this frustration; as long as chronic problems continue he sacrifices sleep, if he takes the time to debug problems Scott risks sacrificing service level agreements. In his crisis driven job, Scott is recognized for fire-fighting and no one remembers yesterday’s solutions.

Percy, the operations guy, sits in front of about six consoles in a room that looks like a modern air traffic control system. Fault and Performance management tools flash alarms in every color of the rainbow. Percy’s tools are great at identifying problems, so that they can be responded to by the people and processes such as Scott. Companies are inundated with tools to help Percy, but the best tool in Scott’s arsenal is caffeine, forcing Scott to be an excellent fire-fighter.

More lives are actually saved in the real world by changing smoke detector batteries than would ever be saved by dragging bodies out of buildings. But no one gets recognition for avoidance, not even Trixy. In all these years of late night pages you would think man’s best friend would have learned to proactively bury the pager.





Copyright Dave Nocera 2007





Saturday, September 15, 2007

Cube Wall

Sarah and Phil were system performance engineers that sat in the same corporate cubicle island with a shared adjoining back wall. They talked constantly about work and it was creating problems. The other local inhabitants of their area wined when they talked too loudly through their adjoining wall. So they stopped.







Building Services then protested that it was a safety hazard for them to stand on chairs and have conversations over their cube wall.






So they stopped.









And they were fresh out of ideas, when finally their manager complained that they were wearing out the carpet walking around from one cube to another.





So they stopped.



Although intended to promote productivity, modularized office cubicles systems sometimes get in the way of how people do their work. Desperate for a solution, they negotiated with Building Services to remove one of the cubicle wall panels between their cubes, making it possible for them to communicate in an effective manner.



Sarah: “Marketing sent out a confusing customer mailing, causing thousands of unnecessary calls to customer service. On the back end database we observed five times more activity than on any normal day, until database redo logs fell behind resulting in an outage. The emergency fix was a change to a database configuration parameter.”

Phil: “No performance simulation can predict the impact of a confusing customer mailing.”

It took production performance problems to provide the necessary feedback to stabilize the environment.


Phil: “In performance testing of the new web based document retrieval system, the CPUs were idle, but now in production the CPUs are pegged at 100%. What went wrong?”

Sarah: “The older system did not support search, so users were forced to navigate to documents using hyperlinks. When we simulated the users in performance testing, we falsely assumed the users would continue to use hyperlinks. However, those dang users changed their behaviors and started using the new search engine far more than we anticipated.”

In performance engineering, real users will frequently change their behaviors in creative and unexpected ways when they interact with new systems, creating workloads that affect change. No one was able predict a change in the way users interacted on the new system, just as no one was able to predict the way Sarah and Phil interacted.

By removing the wall panel between their two cubes, Sarah and Phil achieved the minor celebrity status with their co-workers. They demonstrated that they were able to apply their system performance engineering skills to solve a cubicle performance engineering problem. As the news of their solution spread, it drew attention to their new cubicle arrangement and people came by to admire their innovation.


But now a new problem emerged. People started using Phil and Sarah’s cubical opening quicker thoroughfare to the company cafeteria.


Copyright Dave Nocera 2007





Office Terms




Seal the Deal

The meeting was dragging on and on; finally Scott whispers to Phil, “When will the boss stop talking?” Phil whispers back, “ I am about to explode from that second monster cup of coffee”. Scott smothers a laugh and the boss abruptly wraps the meeting and rushes out of the conference room. Scott and Phil waste no time following him into the Men’s room. The boss parks himself in the center urinal and Scott and Phil occupy the two urinals to the right and left of him.






After an awkward moment of silence, the boss says, “what do you guys think, was that a good meeting?”. Phil was first to respond, “It was good, lots of information to digest”, then Scott adds, “Yea good meeting”.


Then silence ….





Observing from the 5th dimension were aliens. One alien asks, “What does this mean?” Another responds “We observe this human ritual thousands of times every day in every company we visit and it seldom varies:


  1. Humans arrive into the conference room,

  2. Then they drink coffee and listen to the proposal,

  3. Finally they go to the urinal to ‘Seal the Deal’.”


    Later one alien breaks the silence at the de-ionization portal, “Your research on the office habits of humans is fascinating”.


Effective Meetings



Copyright Dave Nocera 2007