Plan Ahea…

Written February 16th, 2009 by Capn

You’ve seen that sign, right? “Plan Ahea … ” – no “d”, because they ran out of space. Yeah, haha, funny. Until it really happens. Let me tell you about what happened this past August in Syracuse, New York.

One Wednesday morning, one of the security guards was on rounds in the basement – which is where central supply is, along with materials management, infection control, health information management, medical coding – and a host of other departments. They thought they were witnessing an earthquake, when in reality it was a 10″ water main that burst. Burst right through the floor, caving in part of it and raising another area; it capsized a couple shelving units and put everybody’s ankles underwater, covering 50% of the functional floorspace down there. We only postponed about 10 elective surgeries that morning, and things were back in action by lunch. It was horrendous down there, but central supply was relocated and functional in no time, and the recovery was underway almost immediately. By 6pm that night work crews were cutting out the bottom 2′ of sheetrock of every wall (note to self: must get a Sawz-All) and all the carpets were on their way out the door.

You can’t plan for that stuff, much less ‘test’ it. We had chiefs and day laborers shoulder-to-shoulder mucking out our new swamp, moving supplies and shelving units. It was an impressive display of teamwork and folks doing what had to be done to keep the ship afloat. Most nursing units didn’t even realize what went on, and even a couple days afterwards – with workers going 24/7 – some folks thought it was just part of the construction we have going on already (a whole new OR).

Then, the next day, just before lunch – I noticed a SharePoint site was acting peculiar. Then I heard from a couple other departments – “is something wrong/are things slow?” In a nutshell, the RAID on our old SQL2000 machine went belly up. Luckily, we had a SQL2005 cluster that we had been migrating things over to slowly … but the race was on. We couldn’t rejuvenate the old box, so we tore the band-aid the rest of the way off and did a forced-march into the new environment. (I can hear the IT folks’ and the project manager types’ fingernails peeling the finish off the desk already.) Oh yeah, let’s add some spice to this impromptu migration: send the database back-up guy & the network operations manager three hours away to a conference, and the two server managers out on vacation. Sound like fun yet?

Ironically, this machine died a similar death only a month earlier (thanks, dude ;) ), and I spent the better part of a day rescuing all the web-related applications and porting them all over to the cluster. I had the two dozen or so databases I relied on back in operation before the crashed machine was rebuilt and back on the network. But this go around, there was no back-up short of dumping everything into the cluster. Within 24 hours, we were 100% operational again, and a couple applications even got impromptu upgrades along the way to appease the SQL2005 architecture.

Along the way, I got a crash course (haha) in our back-up system, SQL2005 clusters and learned some interesting SQL tricks. I also learned that – even though they weren’t available at the moment – the guys who assembled the recovery plan had tested it. And while that Thursday was definitely a trial-by-fire, it went – albeit somewhat accidentally – by the numbers, and we lost at most 90 minutes of live-time data (between the last timed sequential back-up and the actual server failure).

Of course, there are lessons to be learned with any disaster. The lesson to avoid is learning retroactively that you needed a disaster recovery plan. (What’s that saying? The bad thing about experience is that you only earn it immediately following the event for which you could’ve benefited from it). One of the things we implemented prior – not as a disaster recovery measure, but simply as a means of amassing a ‘skills inventory’ – is our “Functional Directory”, for our department (I’m in IT). Each team member started with the same template (a Word doc), and basically outlined all the systems and applications they’re not only responsible for, but those which they have experience in, along with their perceived level of competency. Even if those experiences didn’t relate to their job description here. These profiles are kept in a dept-only Sharepoint site, so anyone of us can quickly search on a keyword and pull up who knows something about it.

Granted, that SharePoint site wasn’t available when the SQL server it lived on took a dirt nap, but it was one of the first systems back on-line. Then I could query who in the department might know something about the other systems that were effected, and voila – the team of one became a team of many, reaching out to the areas of the hospital that were feeling the effects of losing their applications.

Obviously, you can’t plan for everything. In retrospect however, I can look back on those two unrelated and dissimilar ‘disasters’ and find one similarity – the one thing that I could count on: teamwork.

  • No Related Post

Leave a Reply