Posted on 5th May, 2021 in Production
After a painful series of days trying to respond to a catastrophic hardware failure, having to work through the weekend, and even showing up to the server room in person on a Sunday night, I thought the worst was behind me. Yet, all that was only the opening act of a protracted journey to hell.
As of midnight, we had successfully rebuilt all of our production systems in our newly built out environment. It was now Monday morning, 5 days since our outage began. Today our Support Team would be testing everything to ensure it was fully operational. If all went according to plan, we'd be able to notify our end users that they can access our systems again. As for the cause of our problems, a SAN with multiple drive failures, it was sitting in the offices of a data recover specialist awaiting analysis.
Ken, Ryu, Rashid, and I were working from home today as a "reward" for spending the night in the server room. What that really meant is that our boss, M. Bison, wanted us available to continue working on getting our systems back up and running from the moment we opened our eyes. Commute time was wasted time in restoring our systems. Despite everything that had come before, the day was relatively calm and I was almost feeling hopeful.
By mid-morning the Support Team had completed their testing and reported back with a handful of minor issues. By lunch time everything had been resolved, and we were able to make most systems available to the users. As the users flooded back in to the systems, some of them might encounter minor errors here and there. The users report those errors to Support and Support would report them to us for resolution. Then the cycle would begin again. The days that followed would repeat that pattern countless times, but we were making progress. As a result, the rest of our developers went back to their normal duties while us 4 senior-most members stamped out any remaining fires.
Have you ever seen the Final Destination series of films? The underlying premise is that Death has elaborate designs for people's deaths. However, Death is also vengeful when cheated. If you evade him now, he'll be back when you least expect it.
On the same day we restored service to our users, we received an email from Facilities. As I sat at my desk going through my inbox, I assumed it was about an issue with the work order management system that we run for them. Instead, it was the spectre of Death. As per the poor soul who was delivering this grim missive, one of the cooling units in our server room had failed. We had narrowly survived the first attempt on our server room's life and in doing so we messed with "Death's design". Now the Reaper was back to try and collect his due. We sent Guile down to the server room to handle the situation. After some considerable effort from Guile and the building's HVAC team, cooling was restored to full capacity.
This wouldn't be the end of our HVAC issues though. Death would continue to "circle around and [try to] get us all again". The following day the fire suppression contractors, who had already done their share of damage, would request that we "briefly" shut down all cooling so they can test what they were installing. While I might be able to cheat death once or twice, I didn't trust the contractors enough to want to roll the dice again. Fortunately, it was easy to convince M. Bison who at this point was more paranoid than we were to talk the contractors in to doing whatever it is they needed to do the following week.
Creative Accounting 101
A concerning number of close calls aside, we finally made it to a full week from the onset of disaster. On the afternoon of Thursday, May 16 one of our users noticed an issue in our invoice processing software.
The software was cobbled together by Eagle as a prototype long before he flew the coop, when he was a student employee. The powers that be put it in to production and then eventually handed it off to me for perpetual maintenance. While it wasn't a financial system of record, it supplemented a lot of the features our core financial systems were lacking. As such, it was used by our facilities staff to pay any invoices for all of our buildings' operating expenses. That includes minor things like paper towels for the restrooms as well as major things like water, electricity, oil, gas, and tax bills.
History aside, this system (at the time at least, it's currently being retired) is very important. So what was the issue with the user noticed? The PDF copies of every invoice for the previous fiscal year were missing. That's hundreds if not thousands of invoices. For whatever reason this user really needed this historical information but it was just not available. How did this happen? Observant readers may have noticed that I mentioned how we had restored some backups that were a few days old in previous posts. So, that accounts for about a week's worth of missing invoices from the current fiscal year. However, that doesn't explain why all of the previous fiscal year's data is missing. Well, the way this software was designed, the invoice PDFs are partitioned across 2 separate databases. One for whatever is part of the current fiscal year, and one for all of PDFs from the previous fiscal years. At the end of every fiscal year, the current year's files are moved to the database with all of the previous years' data. The database is then backed up. This is the only time that database is backed up. Meaning, there can be no up-to-date backup available if the tape that backup lives on is overwritten. By my calculations, that's an event that happens roughly 4 times a year… whoops!
This realization is what would put us on the road to hell. In order to retrieve the data, we would have to recover it from the SAN which was still awaiting analysis by a data recovery specialist. We had no choice but to try and stall this user until we heard back from the data recovery company. I didn't know it yet but what was about to occur would stay with me for the rest of career.
Welcome back dear readers! It's been a while since the last post, hasn't it? Unfortunately, I was swamped with writing for work and other activities so I had to put a hold on this series. This post is a little different since it's written under a time crunch and a series of bite-sized happenings. The upcoming entry will be the end of the Servergate saga though. In it, I hope to explain the most critical failing of all and the fate of the failed hard disks.