Geekery – Page 14 – Rants and Revelations

No such thing as a smooth upgrade.

My colo box has started exhibiting this strange behaviour:

My “guest” (aka domU) OS will stop talking to the network. I can still log into it by going to the “host” (dom0) OS and issuing the xm console xen1 command.
The guest still thinks it’s connected to the network. ifdown eth0; ifup eth0 doesn’t accomplish anything.
If I reboot the guest, using shutdown -r now "", reboot, or, from the host, xm shutdown xen1; xm create xen1.cfg doesn’t come back up. xm gives an error about being unable to reserve enough memory.
If I reboot the host, it doesn’t come back, and I have to either go into the colo or put in a trouble ticket, wait a few hours and then phone them up to ask why they’re ignoring my trouble ticket. They always respond that they’re really swamped right now and they must have missed it in the rush. When I go in, they’re always bored out of their minds and playing games. Oh, and good fucking luck finding a phone number anywhere on their web site. I only found one because I had it in my phone from before they were taken over by Earthlink.

When it was happening every 4 or 5 months, I wasn’t worried. When it happened twice in one month, I got worried. When it happened again 3 days after that “twice in one month”, I’m really worried.

Thinking that this might be a Xen problem, I decided to upgrade the host OS from Debian 6 to Debian 7. Mostly, it worked just fine except for two “small” problems:

I couldn’t figure out how to make it boot the Xen stuff automatically and
When I manually booted the Xen stuff, the network wouldn’t come up

The first problem is due to the way they re-arranged the grub menu – all the Xen stuff is under a submenu. The recommendation I found was to use dpkg-divert --divert /etc/grub.d/08_linux_xen --rename /etc/grub.d/20_linux_xen to put the Xen stuff ahead of the non-Xen stuff in the Grub menu. That seems like a cheezy hack, but I’ll take it for now.

The second problem appears to be because of changes in the way Xen does bridging – evidently they bring up eth0 before /etc/network/interfaces brings it up, or something like that, and everybody gets all confused. The extremely dubious hack I found on-line to fix that is to add a pre-up ip addr del xx.xxx.xxx.xxx/255.255.244.0 dev eth0 || true to the definition of eth0 in /etc/network/interfaces. I suspect a better long term answer will be to figure out how to set up the proper bridging for the Xen stuff.

Now that’s all hacked together to work, fingers crossed that it actually reduces the freeze-up problem. Meanwhile, all the guest OSes are still running a 2.6.32-5-xen-amd64 kernel and I’d like to switch to a 3.2.0-4-amd64 kernel. Hopefully I can do that without another long night of standing in a hot colo facility.

What a day so far

First off, our power went out. A quick survey of the neighborhood showed it was out all up and down the street, and a call to RG&E revealed that it was a tree down over a power line.

Then while I was taking a break in the back yard, unable to work because of the lack of power (although in retrospect I probably should have mowed the grass since I’m going to have to work tomorrow to make up lost time), I got an email from a member of the flying club asking why my email address (and a non-functioning email address at that) was listed as a technical contact for their domain, and can I help them transfer the domain over to their control. Doing what googling I could do on my phone showed that the current registrar are notorious domain hijackers. Oh oh.

Once the power came on, the main router was flashing a green “power light” and not connecting. Again, doing what limited web searching I can do on a tiny smartphone screen shows that this means the firmware is corrupt, and it can happen if the router loses power (which seems like a pretty shitty failure mode – you’d think the firmware could only be corrupted if it were in the process of updating the firmware, otherwise it’s not exactly what you’d call “firm”, now is it?) The solution is to download the latest firmware and reflash the ROMs, which is difficult if you don’t have an internet connection. Fortunately I have two of these routers, one at the other end of the house to act as a wireless repeater. So I grabbed that one and did a factory reset, and then reconfigured it as best I can. That was a bit of a hassle because at some time in the past I changed the name of our wifi from either Robinson_Tomblin to Tomblin_Robinson or vice versa, and I couldn’t recall which, and so when I got it wrong the iPad and iPhone happily connected to it, but the printer, the TiVos and the Nexus 7 wouldn’t.

With network connections re-established (sort of – every router configuration change seemed to involve losing it again for a time up to a minute or so), it was time to download the new firmware, enable tftp in the Windows laptop, and flash it. Amazingly enough, it actually worked. Then I reconfigured that router, and everything was back in business.

Except now my security camera isn’t working. Down to the basement to unplug the POE cable, plug it back in, and it’s working.

Now it’s time to look into the flying club business. Thank goodness for searchable mail archives – the club asked me to transfer the domain to them in February 2011, and I did. And they were using that infamous domain thief as their registrar. And at the time I pointed out that they’d need to reset all the various contact email addresses. I also gave them a list of email forwards I had set up for their domain, and they decided to turn them all off. So phew, it’s not my problem and not my fault and if they can’t remember how to log into their registrar account and change the email address, too bad for them. I feel sorry for them, and I don’t wish them ill will, but the relief of it not being something I have to help fix is overpowering all that.

Why I want to punch Microsoft in the face

Ok, not the company, just anybody who was ever involved in their web browsers.

I’m writing a web application. I’m trying to make it modern with good UX (User Experience). Sometimes my boss’s decisions go against that desire, but I do what I can. Real world requirements aren’t always as straight forwards as the stuff you read in “Design For Hackers”.

So this week, I did a new part of the app. It was finally working the way I wanted to on real browsers, so then I turned to IE testing. It didn’t work right on anything older than IE 10. After two days of screwing around, I had a workaround that worked ok on IE 8 and 9 – it didn’t look too much worse than it does on real browsers, just different. That’s good, because the boss says that IE 8, because it comes on Windows 7, is the corporate standard and I don’t have to support IE 6 or IE 7. So I upload my test code to their server and clicked on the link, and it looked like a dog’s breakfast. Turns out that Microsoft, in their infinite wisdom, have decided that when something is on your intranet, should run in “compatibility mode”, which basically means it acts like IE 7.

IE is supposed to recognize a header, “X-UA-Compatible”, which is there so the web developer can tell the browser which version of IE it’s written for, but because Microsoft are a bunch of idiots, they decided that the “use compatibility mode on the intranet” setting should override this. I can’t think of a single reason for this, other than sheer idiocy.

On StackOverflow, a user offered up a “simple” workaround – all you need to do is get every web server on the corporate intranet except yours to change to serve up a “X-UA-Compatible” that specifies compatibility mode, and then the sysadmins to change the default setting on the Active Directory servers (and probably Citrix as well) to make sure people’s logins allow the setting from the web server to take precedence over their login settings. That of course pre-supposes that you can even find every web server on the corporate intranet. And find their owners. And get those owners to sign anything without 12 years of running around making business cases and getting manager approvals. And then get the web servers actually configured that way.

I think it would just be faster to wait for every computer in the company to be replaced by one running a better OS. Or the heat death of the universe.

So off I go to try to find a work-around that works on IE 7 as well.

Upgrades are never easy

Debian stable just updated. Usually when Debian drops a new “stable”, it means its bombproof as hell and tested out the wazoo. This time, I’m not so sure that is true.

First candidate is a virtualbox that I use to keep some client data on an encrypted partition and safer than just leaving it on my desktop machine.

First attempt threw some errors about problems with “default-jre” and “openjdk-6-jre”, but I don’t use java on this virtualbox so I just removed them.

Second attempt gave a huge problem because of some conflict between CPAN installed Perl modules in /usr/local/share/perl/5.10… and the new 5.14 modules. It seems to me that the installer should just remove /usr/local from the Perl paths and ignore any locally installed stuff.

I tried removing that directory manually, but by that time the install was so screwed up that I actually went back to a clone I’d made of the virtualbox and tried again. This time I removed the JRE stuff and moved /usr/local/share/perl out of the way. The upgrade went much more smoothly, except the screen goes totally blank for a long time during the upgrade, and when it’s done the reboot prompt is showing empty boxes instead of letters. Fortunately I guessed correctly as to which box was the “ok” button.

After it upgraded, I discovered that Postgres 8 was marked as deprecated, so I did a pg-dumpall, removed it, imported the dump into Postgres 9, and all was well, no problems. Then I had to get RT working again, so I used aptitude to install as many of the packages as I could that formerly had been in /usr/local/share/perl. The only one I couldn’t find a deb for was Plack::Handler::Starlet, so I let CPAN install it.

Once that was up and running to my satisfaction, I figured it was time to move on to my linode. The linode hosts my navaid.com databases and a bunch of mailman mailing lists, and not much else. Remembering the Postgres 8 to 9 thing, I made sure to pg-dumpall before I started. There were no files in any local perl directories, and no jre, so I was good to go.

As it was updating, I saw it removing the Postgres 8 version of postgis. Oh oh, I thought, that’s not good. I’ve discovered in the past that you can’t simply recreate a postgis database using a pg-dumpall dump. So after the upgrade, I of course tried to install postgis for PostgreSQL 9, and once again panicked as it dragged in a ton of X11 crap I don’t need. Then I tried and failed to do a restore of the dump file. What I ended up doing was

creating the database user for that site
creating the databases for that site
running the scripts that come with postgis for creating the spatial functions
coping the pg dump file, and cutting out anything related to other DBS, and cutting out the drop and creation of these DBS.
running this cut down version of the dump file
making another copy of the dump file that includes all the other DBS, including the drop and create commands and running it.

Everything seems to be running now.

Some time I’ve got to go on and upgrade my xen host and guest oses on my colo box, but I’m really reluctant to do that one because if something goes wrong, I’ll have to drive in and try to fix it while standing in a freezing cold server rack farm.

Something strange is going on…

There is something strange going on with my colo box. I tried to reboot it last month and it didn’t come up – I had to call my provider and get them to power cycle it. Nothing useful in the logs.

Yesterday I had to install a security update to the xen hypervisor, but I didn’t reboot. This morning, I discovered that the websites working on the xen guest (the domU in xen parlance) were not working. So I tried to log in, or ping, and discovered it wasn’t talking to the network. Fortunately the xen host (aka dom0) was working – I could log into it, then use xm console xen1 to log into the guest. Couldn’t find anything wrong, except it’s not talking to the network. Even “ifdown eth0; ifup eth0” doesn’t cure it. So I tried to reboot the guest, but it didn’t seem to come back up. I wondered if the hypervisor update I installed yesterday was the problem, so then I rebooted the whole computer, and it didn’t come back up either.

I drove down to the colo facility, and connected a monitor and keyboard, but nothing showed up. On the front panel, there are a couple of blinking lights. I power cycled. It came up just fine. Logged into the host, xm consoled into the guest, verified that I could ssh out, and from my home computer I could wget a few web pages from it. Issued a reboot command, and it booted just fine. Poked around the BIOS settings to see if there was something about not booting if there wasn’t a keyboard or something stupid like that, but couldn’t find anything. Booted, verified once more, and came home.

Until the next time, I guess.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31