Data centres harbour dark secrets… archaic, poorly-built equipment that’s too valuable to shut down, and will fail spectacularly when you least expect it. Business Technology’s resident U.S. blogger Keil Hubert explains this as a curse that’s passed from manager to manager like a terrible disease.
I shattered a tooth this morning. Caught me completely by surprise. One moment, I was waking my laptop up so that I could get back to work on my first column for July  and barely a moment later my mouth was full of enamel fragments. I’m typing this while trying very hard to ignore the fact that I have two-thirds of a molar covered in temporary dental cement aching in time to my pulse. The loss of a tooth completely threw off my morning – I’d had plans to get my car serviced, get a haircut, finish a couple more chapters for my next book, do some yard work… all of those plans went right out the window.
You might argue that I should have seen the dental damage coming – weren’t there symptoms like pain, discolouration, or other indicators that the tooth was about the fracture? The answer is, unfortunately, no; the tooth never hurt. It never looked or felt odd. As an American, I had it drilled into me at an early age to take obsessive care of my teeth (which is why I don’t apologize for that pun). The tooth that broke was, in fact, one that had been twice repaired when I was a small boy. The massive wedge of silver amalgam that made up the middle of the tooth is still there, still affixed to the root and the inner surface of the tooth. Only the outer surface shattered.
That – my dentist believes – accounts for why part of my tooth was weak and brittle enough to disintegrate, and why the rest of it could soldier on as if nothing ever happened. The dental services I received in the 80s were brutally barbaric compared to modern techniques. One of my dentists grumbled that those old filling techniques I’d endured were ‘darned near Soviet’… both for their imprecision and for their durability. So, I had a compromised tooth in my head for years, and I had no idea that it was about to fail until it failed catastrophically.
It may seem odd, but that’s an apt analogy for the sort of lurking danger that we all face in our data centres when we play the IT professional game: old, vulnerable devices that continue to soldier on only because they were designed to survive a clash of dreadnoughts. Servers, routers, cooling units… it doesn’t matter. We don’t know that they’re going to cataclysmically fail until it’s too late to do anything about it. The old steampunk-y black box that your predecessor’s predecessor signed off on finally goes out in a cloud of blue smoke and hard drive fragments, littering your surviving machines with unrecoverable dream splinters.
Everyone who has ever inherited a running IT plant has dealt with this nightmare scenario. It’s one thing to have absolute control over your operational environment. One of the happiest tours I ever did as a consultant was when I got to build a Dot Com from the carpet up – every piece of kit was shiny, new, tested and warrantied. I knew that my systems would work. I knew what everything in my racks was built for. Every other job I had in a ‘head of IT’ role involved taking over a place that someone else had built. Sure, I got to install my own systems and sometimes got to replace the more troublesome bits of legacy kit, but most of the time I was required to leave the working bits alone.
The biggest problem with these inherited systems is that you rarely ever have any idea what their vulnerabilities are. Most IT shops are rubbish at documenting their production gear; they’re overwhelmed just trying to get new systems online. Documentation is something that gets put off indefinitely. Soon, the bloke who cobbled together the temporary accounts-receivable server has quit and no one else on the team knows that he never got around to getting the server into the backup rota. That critical element of institutional knowledge doesn’t come out until the ‘whoopsie’ moment when the AR box goes (literally or figuratively) to pieces.
I experienced two incidents of this phenomenon, both in the same company, back when I was first trying my hand as a junior sysadmin. I was a junior tech writer on contract for a company in the aviation sector. I was returning to the office from a working lunch with my supervisor one afternoon when we passed on open comms closet. Curious, I scoped out the space to learn how my employer did things ‘behind the curtain.’ I noticed that the IT folks were inexplicably running a Macintosh SE/30 on the bottom of the switch rack, and that it was merrily chugging along with its screen off, resting atop an Iomega Bernoulli II cartridge drive. My boss noticed me lagging behind and asked what had captured my attention.
‘That,’ I said, pointing at the poor little Macintosh. ‘That’s going to fail soon.’
My boss looked at the setup, shook his head dismissively, and told me to get back to work. I dutifully obliged.
A week later, the boss summoned me down to his office and demanded to know how it was that I knew the little Mac in the comm closet was going to fail. He saw me raise an eyebrow and admitted that yes, it had failed that morning. I shrugged and explained to him that it was fairly obvious to me since I owned both of those items:
- There wasn’t any cooling in the comms closet – I’d looked. There weren’t any HVAC vents. That meant that the space would get very hot from all the electronics exhaust. Both devices’ manuals said that you really shouldn’t operate them above 40C.
- The Bernoulli drive was a wicked cool piece of removable storage engineering, but it wasn’t designed to be left running continuously. You turned it on, slotted a cartridge, did your stuff, then ejected the cartridge and turned it off. Back then, if you wanted 24/7 data storage, you bought a hard disk drive.
- The fact that the machine was left running where it was and in that configuration meant that it wasn’t being looked after. Whoever had installed it either didn’t know what the hell they were doing, or didn’t care what happened to it. That suggested that no one was performing regular maintenance on it.
‘Put those three clues together,’ I said, ‘and it was a fair bet that the rig was going to fail soon; that it was going to fail was inevitable.’
My boss just shook his head and dismissed me. He didn’t forget our conversation, though. Three months later, we moved our department to a new building. Somehow, the move managed to cripple the graphics department’s ability share files. Remembering that I knew something about Macintosh support (in an otherwise all-PC office), my boss volunteered me to troubleshoot the other group’s problem.
I was a bit out of my depth, but I was eager to take a crack at the problem. Since there was no documentation at all about what had been deployed, it took me a week to deconstruct how the illustrators were rigged up. I wound up tracing cables through the cubicles, calling finance for purchase invoices (in order to look up part numbers), and interviewing two of the illustrators who’d actually been around when the solution was first deployed.
It seemed that an outside IT contractor had designed an all-too-clever solution for joining up the writers’ DOS PCs and the illustrators’ Macintosh IIfx workstations. The consultant had jury-rigged an AppleTalk router  in software… on that same SE/30 that’d I’d found months before in the comms closet. For reasons I still can’t fathom, the bloke had booted the Mac, mounted the Bernoulli cartridge, and had run the router software from it… and then (somehow) ejected the boot floppy (which promptly disappeared). You weren’t supposed to be able to eject Macintosh boot floppies back then. Nevertheless…
Once I figured out how the consultant had built his rig, I managed to cobble together the parts that I needed to duplicate it. Once I was sure it was working, I took it all down, documented the steps to bring it online, and then brought it up again to make sure that I’d gotten it right. Once it worked, I turned over a written systems maintenance plan to my boss and showed him how anyone could put Humpty Dumpty (the SE/30) back together again. He was impressed – so impressed that he gave me a 25 per cent pay rise and made me a permanent employee. 
Everywhere that my career took me after that incident involved some variation of the ‘inherited nightmare’ problem. I’d take over a new team, start poking around to learn what all I had in my data centre, and then discover that we were utterly dependent on some crucial black box that had been built out of duct tape and old crisp packets in the days before any of the current staff had been employed. At one place, the ‘black box’ was a clinical support server and a stack of insanely-configured network switches. In the next place, the ‘black box’ was the office’s Internet gateway server. The place after that did all of our payroll processing on a consumer-grade DOS box with a sparking hard drive, and so on. All of these problem systems were undocumented, and only got fixed because some clever lad or lass in IT (only sometimes me) recognized the signs and portents and initiated an emergency replacement plan before the office went up in flames.
It shouldn’t be this way. As IT pros, we’re supposed to work from approved systems architecture plans. We’re supposed to meticulously log all of our build decisions and lessons-learned. When we transfer ownership of a data centre to a new boffin, we’re supposed to hand over comprehensive records that detail everything that the new owner needs to be aware of for everything under their authority. Supposed to. Right. Guess how often that actually plays out?
I teach my new sysadmins and engineers to be cynical towards any bit of kit that they didn’t personally build. It’s not that the guy or gal who installed a given bit of kit had to have been a fool; rather, you have to constantly remind yourself that we don’t know what we don’t know. Any given machine could have been put together wrong on accident. Or repaired with inferior parts. Or misconfigured. Or dropped on the floor. Dozens of woes could have shorted the expected lifespan and/or reliability of a given device, and if no one bothered to tell us that it happened… well, that leaves us vulnerable to its forthcoming and inevitable collapse.
In keeping with the title of this column, I always found it easier to explain this phenomenon to soldiers and to former soldiers than to non-veterans. Just compare the potential hazards lurking inside inherited information systems to venereal disease transmission among peple: ‘If you aren’t his or her first-and-only, then you have to consider that he or she picked up something before you met them. You won’t necessarily know that you’ve been exposed until the signs and symptoms of infection start to manifest.’ Yes, it’s a rather crude analogy, but it works because soldiers often have ribald senses of humour, and enjoy jokes that make civilians blush. If that analogy is too risqué for your workplace, find a different handle on the topic – but be sure to get your point across.
These days, we’re lucky enough to have any accurate information at all of what happened to our precious data centre before we came along. We don’t know how it might have been abused, ignored, or exploited… and we often won’t know what horrors are waiting for us, whispering their insidious threats to us just below the soporific whine of the cooling fans.
I’m not suggesting that we start a new gig as head of IT by purifying the previous team’s systems with cleansing fire (tempting as that may be). Rather, I’m suggesting that we enter into any new head of IT role with a healthy respect for everything that transpired before our arrival. If there aren’t comprehensive records or knowledge base entries, then it’s prudent to assume that most everything under our remit is likely to fail soon – so it behooves us to ensure that backups and redundant systems are validated ASAP, and that someone gets cracking on the systems’ run-books straightaway. That, and start drafting your contingency plans for replacing every business-critical system on short or no notice.
One last word of advice: no matter how angry you might be with your company when you decide to part ways, don’t do to your replacement what the guy or gal that you replaced did to you. Turn over honest and comprehensive records. Don’t be the villain in their personal adventure story.
 Working title: ‘50 Shades of Bray’. I’m not sure if I’m going to finish that piece or not.
 Apple’s proprietary networking protocol.
 Great fellow, that man. I really enjoyed working for him.
Keil Hubert is a retired U.S. Air Force ‘Cyberspace Operations’ officer, with over ten years of military command experience. He currently consults on business, security and technology issues in Texas. He’s built dot-com start-ups for KPMG Consulting, created an in-house consulting practice for Yahoo!, and helped to launch four small businesses (including his own).
Keil’s experience creating and leading IT teams in the defense, healthcare, media, government and non-profit sectors has afforded him an eclectic perspective on the integration of business needs, technical services and creative employee development… This serves him well as Business Technology’s resident U.S. blogger.