When I screw up, I screw up magnificently

If you noticed a post appearing here in the last couple of days then disappearing mysteriously, you were seeing the most public symptom of a complicated technical screw-up.

The details are a bit technical, but the story is absolutely heartbreaking, and I share it here in the hopes that someone else can learn from the mistakes.

In the simplest terms, what happened was my hosting provider got awesome new hardware for the virtual private server VPS that genesismachina.ca was running on, and moved genesismachina.ca from the old machine to the new one. You didn’t notice any change, because the genesismachina.ca address (with all subdomains, including blog.genesismachina.ca) was simply redirected to point to the new IP address (sort of! more on that in a minute!).

Unfortunately, I put in a week’s worth of work on the old server… thinking it was the new one.

Now, let me be clear: I wasn’t stupid here. I knew the server was moving, and I took care to update the IP address in the DNS records (so that the site being served online was being served from the new hardware, not the old), and in my SSH aliases (so that when I connected to the VPS to work on it I was connecting to the new one, not the old).

I checked and double checked to make sure everything had transitioned properly. I was well aware of the chance that I might do work on the old machine that would be lost, so I made sure I was working on the new machine. Or so I thought.

So how did I screw up despite my diligence? Well, it’s technical, but if you’re technically inclined… read on and learn from my misery.

I missed a DNS record

When I redirected *.genesismachina.ca from the old IP to the new one, I forgot that blog.genesismachina.ca was configured separately.

You see, I use a wildcard DNS A record for almost all of genesismachina.ca’s subdomains. It’s really convenient, because I can just set up *.genesismachina.ca, and www.genesismachina.ca works, cpp.genesismachina.ca works, test.genesismachina.ca works, and any other subdomain of genesismachina.ca will just work automatically.

The catch is, it only works for subdomains that you have no other DNS records for. Not just other A records; any other DNS record. So if you set up *.genesismachina.ca, foo.genesismachina.ca will work fine right off the bat. But if you then add a TXT record for foo.genesismachina.ca, it will stop working – you have to add an A record for foo.genesismachina.ca explicitly to get it to work again.

So some time ago, after I’d set up the *.genesismachina.ca A record, I added a TXT record for blog.genesismachina.ca, so that I could verify to Google that I own the domain. That broke the wildcard A record for blog.genesismachina.ca, so I duly added an explicit A record for blog.genesismachina.ca to make it work again… then forgot about it. The UI on my DNS manager shows the top few records, then has a small “show more” button at the bottom… which I never use because the bottom few records are uninteresting (just a few Google verification strings, the validation for my Rizon vHost, and so on)… so I totally forget that there are more DNS records than I normally see.

When my VPS was moved, I updated the IP addresses on all the other A records… and missed blog.genesismachina.ca. Of course, it still appeared to work, because it was loading it from the old hardware. Eventually it would have broken when the old hardware was finally taken offline, and I would have noticed it and fixed it. But in the meantime, everything I did on the blog was being done on the old, soon-to-be-gone machine.

That’s why that post appeared and disappeared. I posted it on the old machine, and when I fixed the blog.genesismachina.ca DNS record to point to the new machine… it was simply gone. It’s still there, on the old hardware. Not that that helps.

But that’s not the whole story….

I missed a SSH IP

If it were just a matter of a misdirected A record, I actually would have noticed it almost right away. But I didn’t. The reason I didn’t is why I’ve now worn a groove in my desk from banging my head on it in despair.

You see, when I work on genesismachina.ca, I connect to the VPS machine via SSH. I have it configured on all my machines that I can just type ssh vps and it securely logs me in to the genesismachina.ca machine.

To make that work, I have SSH configured on almost all my working machines like this:

Host vps
    Hostname       = genesismachina.ca
    IdentityFile   = ~/.ssh/genesismachina.ca
    User           = ...
    Port           = ...
    IdentitiesOnly = yes

The first line makes it so that I can type “ssh vps”, and it will connect using the hostname (on the second line) “genesismachina.ca”. (For more details, check out the blog post that inspired me to set this up.)

That all works fine, and because it’s using genesismachina.ca, when I redirected the genesismachina.ca DNS record to point to the new IP address, all of my machines automatically worked with the new VPS.

Except one.

One machine was set up differently. Why? Because it was the machine I used when I first got the VPS, and was setting it up… before I got the genesismachina.ca domain. Because I had no domain name back then, I had to do everything via IP address. So my SSH configuration looked like this:

Host vps
    Hostname       = 172.93.121.23
    IdentityFile   = ~/.ssh/genesismachina.ca
    User           = ...
    Port           = ...
    IdentitiesOnly = yes

You see? Unlike all my other machines, this one was connecting using the IP address, and not the domain name. After I got the domain name, I just never went back and changed the configuration, because everything worked, and I forgot about it.

Before the VPS moved to the new hardware, the domain genesismachina.ca and the IP 172.93.121.23 both pointed to the same server. After the move, genesismachina.ca pointed to the new server (at 172.93.123.88), while 172.93.121.23 pointed to the old one. So all of my machines automatically just worked – when the server changed, they followed the domain to the new server – except that one… which was set up before the domain existed to point directly to the (now old) IP.

So if you’ve been following:

  • All the subdomains at genesismachina.ca were correctly redirected to the new machine… except blog.genesismachina.ca.
  • The SSH logins on all my machines automatically switched to the new machine by following the domain redirect… except one, because it had been set up before the domain existed, and used the hard-coded (old) IP.

If I had done any work on blog.genesismachina.ca on any machine but the exception, I would have immediately noticed there was something wrong. Changes that I made to the server via SSH wouldn’t be showing up online. It would have been confusing as hell until I realized I’d missed an A record, but at least I would have seen there was a problem, and gone about looking for a fix. And if I’d never figured it out, the blog would have gone offline when the old hardware was shut down. At that point I surely would have noticed that the IP address was wrong, and soon clued in to why.

If I had used the problematic machine to do work on any genesismachina.ca domain except blog.genesismachina.ca, I would have immediately noticed something wrong. Again, changes I made to the server wouldn’t show up online. Again, it would have been confusing as all hell, but either I would have noticed that changes made on other of my work machines stuck, or eventually the old hardware would have been shut down.

So the only way this could have been really catastrophic would be if I worked on the problematic machine and only worked on blog.genesismachina.ca (and no other parts of genesismachina.ca).

What are the chances of that, right?

Here are the chances of that

So, thinking I was working on brand new VPS hardware, I spent the better part of last weak fixing up and polishing blog.genesismachina.ca… using that one particular computer and account that was still pointing to the old server. I didn’t notice anything strange: changes that I made to the server were reflected right away on the site online… because both my work machine’s SSH configuration and the DNS record for blog.genesismachina.ca were pointing to the old VPS.

And keep in mind that I’d checked on other working machines to confirm that the other genesismachina.ca sites were being served from the new VPS. Specifically, I checked www.genesismachina.ca and test.genesismachina.ca… and just assumed that since they worked, all the subdomains were working… and if they weren’t all my SSH configurations were working, so I’d catch any problem.

So I confidently spent a week solving problems that had been annoying me for months. Several problems. I tweaked file permissions, adjusted users and groups, set up aliases, reconfigured Postfix, reconfigured Apache, reconfigured WordPress, updated and tested everything… I fixed and polished the hell out of blog.genesismachina.ca. And then I wrote a post on blog.genesismachina.ca, crowing about how much I had accomplished and how great it felt to finally have everything just right. In fact, when my housemate Alyssa came home Friday evening, I even bragged to her about what a productive week it had been.

And then….

I decided to check in to blog.genesismachina.ca from another computer, just to download the most recent backups containing all the changes I’d made.

But, strangely, none of the changes I’d made during the last week were there.

“The hell?” I thought. “How is that possible? I’m looking at the results of the changes on the site right now! They must be there.”

Sure enough, any changes I made to the site on the server weren’t being reflected online. It was like I was staring at an alternate universe. Like the site I was tweaking on the server wasn’t the same site as the one available online. But then… if the site on the server wasn’t the one available online, who the hell was serving it? Where the hell was this phantom site being served from?

When I realized where it was being served from, it was like a punch in the face. I spent the next few hours trying to figure out how the mix-up had happened… and coming to terms with the fact that I’d just wasted a week’s worth of work on what was, in essence, the contents of the trash bin.

It’s never as much fun the second time you solve a problem

So now I’m looking at trying to figure out every little tweak and fix I did on the old server, and replicating them on the new one.

Though I had done like a hundred fixes, I hadn’t actually recorded anything because I was waiting for a specific cron job to run to confirm everything was actually fine – when it did, then I would have the backups right there to copy: everything would be fixed, and I would have it backed up in one fell swoop. It seemed like a clever idea at the time. When I didn’t get an error email from cron, I assumed everything was good, and I could go download the logs and backup at my leisure.

I spent all friggin’ day today with two SSH terminals open – one on the old server, one on the new – using the BASH history of the old server to try and reverse engineer the things I’d tweaked so I could repeat them on the new server. Also cating dozens of configuration files to see what changes I’d made. And of course, repeating the tests I ran before to make sure they pass on the new server.

(And in case you’re wondering why I didn’t just request that the hosting provider simply resynchronize my stuff on the old server to with the new server, it’s because I’d done other updates on the new server to other parts of genesismachina.ca besides the blog that I didn’t want to lose.)

I wish I could offer a tech moral here, but honestly, the circumstances that tripped me up were so ridiculously rare – and only problematic when combined – that I’m not sure there’s anything one can learn from it. The best I can think of is: “If you are working on your server ±1 month around a migration, don’t just double and triple check; quadruple check.”

Or maybe just accept that sometimes luck will deal you an incredibly douchey hand. Weeks like this are what they invented booze for.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.