Posted on 31st Mar, 2021 in Production
May is a critical time in the life of a university. It's a time of transition. For students, it's a time for final exams and graduations. For administrators, it's the time for the students to begin moving out of the dorms. It's also the ideal time to plan any construction you need to get done while the campus is mostly empty. That makes it one of the worst times of year for a perfect storm of hardware failures to bring down your business-critical applications. Yet here were are on Day 2 of Servergate.
A (Small) Plague On Both Your Houses
On the day of the incident, I wound up working a solid 16 hours but I went to bed at ease. I knew that reinforcements would arrive by morning and I would soon be relieved of my command. The Support Team which bore the brunt of our clients' ire however couldn't say the same. After the previous day's abuse, approximately half of the team would call out sick before I had even woken up.
At 6 AM our boss (M. Bison) emailed the entire department to acknowledge the last status update and order Tsuchiya, the head of our Design Team, to take point on resolving the shipping issues with our replacement hard drives. The day before, we had to obtain emergency authorization to purchase replacement hard disks from a
shady unregistered vendor on the internet. For a variety of bureaucratic reasons members of our department (specifically) aren't allowed to spend money ourselves. So we need to get someone from Purchasing to do it for us. Purchasing was so kind as to assign an absolute genius of a purchasing officer to our case who had the infinite wisdom to choose the slowest shipping speed available on an emergency hardware procurement. At 6:34 AM, Tsuchiya called out.
The Boys Are Back in Town
Aside from myself, only 3 other people had deep knowledge and experience with our servers and infrastructure. They were:
- Ken - the senior-most member of our department and the leader of all our developers
- Ryu - our software architect and senior anime advisor
- Rashid - a laid back software engineer
Up until yesterday all of them were in Washington state for a conference. Despite that, Ken was already on-site and drafting a game plan based on where I had left off by 8:30 AM. Ken even went so far as to jump on the live grenade that was M. Bison's hourly status updates. Ryu would be in the office shortly after that to guide our developers-turned-IT-operations-interns through rebuilding virtual machines and deploying our software. Unfortunately, Rashid picked up a cold at the conference and would call out sick. However, with Ken's return, I was no longer in command. Soon I would be able to focus on what I was (allegedly) good at.
Despite being out sick, Tsuchiya came through with the expedited shipping with minutes to spare. Our drives were 2 minutes away from being shipped out when he got a hold of someone. By 10:30 AM, he had the shipping speed upgraded to 1-day. He also updated the shipping address to Ken's house since the building our servers live in doesn't accept deliveries on weekends.
By that afternoon we had successfully finished building out a new production environment and restoring 2-day old database backups to it. As for the applications, we had no way knowing what versions were originally deployed. So we settled on deploying the most recent commit on master (this was before "main" became the default branch) from each application we could find on the laptop of the only developer who was still using their physical machine to write code.
As the team was finishing rebuilding our systems from scratch, Ken and I went over to our datacenter. Ken's return also meant that we finally had someone authorized to speak to tech support for our SAN, despite it being well out of warranty. When you’re a large enough institutional customer, they tend to give you some leeway. Yesterday the most we could do is find out if they had replacements for the hard drives we needed, but today we might find a way to save ourselves from this mess.
Ken got Dell Support on the line and they talked us through getting the SAN back up and running. Essentially, we reinserted the hard drives I had pulled the day before and prayed that at least one of them wasn't totally fried. After coaxing the unit in to re-checking the health status of the drives, 1 drive came back from the dead.
Our SAN could tolerate up to 4 drive failures before all of the data on it would be permanently lost. The fourth drive's return from the dead meant that the entire array could be saved by replacing the 3 fully-dead drives and rebuilding the array. We were stunned, as the unit spun up and our servers detected the storage again. We thought our ordeal was finally over! Perhaps there really was a light at the end of a tunnel and a benevolent god of IT. However, before we had even finished celebrating, the drive died again. The person on the other end of the phone advised us that we should repeat our magic ritual and see if we could resurrect it once more. If it did revive, we should try to move any necessary data from it to one of our healthy SANs.
Once again, we sacrificed our sanity to the machine gods and prayed, and once more the drive rose from the grave. It wouldn't last long though. The drive was running on fumes and we all knew it. So we started trying to migrate any critical virtual machine images and files off of the SAN while we had the chance. We couldn't move nearly as much data as we had hoped before the drive gave out again. We would only be able to revive it once more before it quit for good. There was in fact no benevolent god, just a petty and sadistic one.
At 4:30 PM, the ever-patient support agent from Dell advised us to start looking for a data recovery specialist. Since the drive was ready to give out at any moment, it was too risky to attempt to rebuild the array. Should that fourth drive fail at any point mid-rebuild, we would be guaranteed to lose all of our data. Chaos erupted in our makeshift war room once more. This time we were all furiously googling for data recovery vendors in New York City that were open past 5 PM. We found a handful of viable options. Between Ryu, Ken and myself, we called all of them to verify that they were open and could give us a quote. In the end, only 2 services responded. One was actually in the city. The other was just a receiving office that accepted devices and shipped them overnight (at the customer's expense) to their labs in California for processing.
We decided to go with the vendor who wasn't planning to ship our SAN across the country at our expense without even being able to provide a quote for the shipping costs. After a frantic exchange of emails and phone calls we received a quote and instructions. The vendor was located in Midtown Manhattan and would be closing at 8 PM. We needed to get the SAN to them before then if we wanted it evaluated. The vendor wanted $1000 upon receipt of the SAN for expedited evaluation. If they deemed recovery possible then they would send us a quote with the cost to perform the data recovery.
Just before 7 PM Ken and I ran over to the server room. We pulled our beleaguered SAN from the rack it had called home for years. Ken then princess-carried the 50+ lbs of steel and hard drives out the door and in to the elevator. It was only as the elevator reached the ground floor that the thought of how absurd this must appear crossed my mind. Given that it was already after-hours on a Friday evening, I braced myself for a thorough interrogation from the lobby security guard. Oddly enough, the interrogation never came.
We worked our way out to the front of the building where I temporarily split from Ken. While he was finding a standpipe to precariously balance our SAN on, I was trying something equally as difficult… finding a yellow cab in Manhattan. During rush hour. On a Friday. While visibly being a minority. I'm proud to say that it only took 3 tries and about 10 minutes before I caught one. They pulled up to the front of the building and we loaded the SAN in the trunk. Once in the cab we asked the guy to take us to Midtown and step on it. Naturally, we were ensnarled in traffic within blocks of our destination by a crane lifting construction materials. As the clock ran down on us, we seriously debated whether we should take a cue from television and make a run for it.
Eventually we pulled up to the recovery vendor's office building. It was a narrow doorway in a narrow building supervised by a squat security guard. He gruffly demanded to know what our business was at such an hour. We explained that we're here to drop off the very heavy SAN we were carrying to the offices of one of the building's tenants. After giving us the once over, he reluctantly took down our contact information and let us go on up in an elevator. The vendor's offices were modern and brightly lit with glass walls and doors. It was a world of difference compared to the portions of the building we had seen before. However, there was no one at the front desk to let us in. It was already 7:45. Ken joked about having to break in to their offices as he grappled with the SAN. We made enough noise in the process to draw someone's attention though and soon we were let in to make the drop.
Since we were already several hours past the Purchasing department's availability, it was too late to find a way to pay the vendor. So just like with the cab ride, we charged the vendor's evaluation fee to Ken's personal credit card. Once everything was settled, we exited the building exhausted and fired off our mandatory update emails. Then we split ways and headed home. Tomorrow was a Saturday but we knew we'd be back at work anyway.
My dearest readers, I must apologize. I know I said this post would be shorter, less technical, and more lighthearted. Unfortunately, I remembered quite a few details that I had previously suppressed to protect my sanity. It looks like this series will run over its estimated length. That's okay though. I'm out of technical problems to solve, so the next post will cover "data recovery" and how 4 hard drives failed in the first place.