That can’t be good

My dom0 is only responding to the network some of the time, as evidenced by how stuttery it munin graphs are for the last 36 hours or so, and the fact it took 4 tries before I could ssh into it. Meanwhile, the domU which relies on the dom0 for network bridging, is going just fine with no evidence of network problems.

Just in case, I’ve set up an “at” job to reboot the dom0 at 9:45. I’ll kill the job if I can figure out what is making it flaky in the mean time, but because it’s an at job it will continue even if I manage to totally pooch the network while I’m working. Cross your fingers and hope it comes up again.

Can’t always be the hero

I am mostly responsible for the scripts that upgrade our software from one version to the next. For the most part, it’s pretty straight forward, since we use apt-rpm to take care of upgrading rpms and making sure their dependencies are fulfilled. There is a bit of a hitch in that some of our rpms were made by people who don’t really get rpm, so the rpm just installs a tar file and the %post script unpacks the tar. Trust me, that’s more bizarre than you think, even if you think it’s pretty bizarre. Because of that and because of dependencies, we can’t just do a “dist-upgrade”, but have to upgrade our rpms one at a time.

But the biggest twist on the upgrade is that going from 3.3 to 3.6 of our software involves going from RedHat 7.3 to Centos 3.4. (Don’t ask what happened to versions 3.4 and 3.5 of our software – it’s too painful.) I am not quite proud of the horrible hack I put together to do that, but it took a bunch of work to get it working so that they can just put a DVD in the drive and type /mnt/cdrom/upgrade now and it mostly works. It uses grub on the hard disk to boot the DVD with a custom kickstart file, formats all the partitions but one, installs the new CentOS, and then uses a custom finish script to reinstall our software and restore the backed up configuration off that one partition that wasn’t formatted.

Actually, as an aside, all of the upgrades only “mostly work”. Partly that’s because we were stupid enough to put Dell computers on the customer sites, so it’s hit or miss whether they’ll even reboot after a year or two’s continuous use. But mostly it’s because rpm has an intermittent bug in the locking code that RedHat was told about at least 3 years ago, and still hasn’t fixed. Which means that sometimes apt-get fires up rpm -U, and rpm just hangs. And of course I get blamed because the upgrade doesn’t entirely work, although I think they’re starting to realize it’s not my fault.

When the upgrades don’t work, I usually get dragged in by the customer support people to log into the customer site to fix it. And usually it’s pretty straight forward – reboot the machine with the locked rpm database, and then manually step through the steps the upgrade script would have taken on that machine. Although the customer support people have recently learned that they can just re-run the whole upgrade script, because the apt-get install portion won’t do anything if it was already done.

Today I got dragged in because of a bigger problem – a customer site that is running version 6.2 of our software had a hardware crash on the main server. They sent the customer a new server, but for some stupid reason the fulfillment house that sent out the new server had version 3.3 of our software installed on it. I guess customer support sent them a 3.3->3.6 upgrade DVD, and tried to upgrade it to 3.6 remotely, and then upgrade from 3.6 to 6.2. But at that point they noticed nothing was working.

I investigated, and discovered that most of the rpms weren’t installed. Also, nothing was backed up properly in the partition that doesn’t get reformatted, so it hadn’t been restored correctly. So I decided to go back to the 3.6 version to see if I could get it working there and then go forward. Fortunately they’d left the 3.3->3.6 upgrade DVD in the drive. I ran the /mnt/cdrom/upgrade script and came back a few hours later. And sure enough, only 3 of the 8 rpms that are normally installed were installed. I tried to manually install them, but the first one failed because it had a dependency on the mozilla rpms, and the mozilla rpm was corrupt. It didn’t matter whether I tried the one in the apt repository, or on the DVD itself. It was hosed.

At this point, I gave up and said that they’d have to ship out another replacement unit with the proper version of our software installed. So much for being the hero.

But as I was leaving work, I had a few ideas on how I could have repaired that mozilla rpm and gotten it working. But I was too late. I figured even if I went back everybody was gone as well, and besides they’d already ordered the replacment – my reputation as a hero is ruined. Sigh.

I guess that answers that question.

Yesterday, I asked the musical question “Just how fucked am I?”. I woke up to the answer this morning – sometime around 3:30am my domU stopped working. Good bye web sites, good bye mailing lists, good bye picture gallery, good bye blog. Dammit.

I emailed Dave at Anexxa, and found out that when they’d moved to the new rack, they hadn’t labelled the ports on the managed power, so they couldn’t power cycle my machine. But they were going out to the facility at 1:30pm and could do it them. So I volunteered to come out with them. Probably just as well that I did.

The box power cycled fine, and I thought I’d tried the upgrade from there at the terminal, thinking maybe apt turned off the network connections just before it asked a question or something. They’ve done dumber things – remind me to tell you the story about trying to upgrade Morphix from an X console some day. No dice – it got to about the same place, but hung just as hard. I suppose if I really want to do the upgrade, I’ll have to try booting to the non-Xen kernel and do it from there, but for now I’ve marked the kernel packages as “don’t upgrade” and I’ll leave it alone until I have occassion to be back at the colo.

Just how fucked am I?

Unpacking linux-image-2.6.18-4-686 (from …/linux-image-2.6.18-4-686_2.6.18.dfsg.1-11_i386.deb) …
Done.

My colo box consists of xen.xcski.com, the dom0 which controls the others, and then xen1, xen2 and xen3 which are the domUs. Because it was way easier to do it this way, the dom0 is running Debian “etch” (aka “testing”), while the domUs are running Debian “sarge” (aka “stable”). The problem with using “testing” is that there are frequent updates, way more frequent than with “stable”). The problem with remote updates is if something fucks up, there isn’t any easy way to fix it. Usually that’s not a problem.

Today’s upgrades include a new xen kernel. But it says it’s installing a new kernel, leaving the existing one there. So it shouldn’t be a problem, right? Well, I was wrong. It downloaded the upgrades, then got to the “unpacking” stage and hung. I can’t ssh to the dom0. I can’t kill the upgrade. It’s not responding to the munin probes. The only thing I can think of is doing a power cycle and maybe scheduling a site visit. But the domUs are running fine. So why would I do anything drastic while the real meat of the colo box is still going fine?

I don’t know what to do. Wait and see, I guess.

Hours of boredom punctuated by minutes of terror?

I’ve heard commercial flying described as hours of boredom punctuated by minutes of sheer terror. Private flying, on the other hand, especially in winter, sometimes seems like hours and hours of work on the ground punctuated by a few blissful minutes in the air.

Yesterday, we had two missions in mind – we needed to get the Archer N9105X out to Batavia for its annual, and I wanted to investigate a month old report (but not a formal squawk) that the Lance was impossible to start. I’ve been meaning to look into that but the weather has either been low clouds and snow or high winds and bitter cold, so I haven’t been inclined to go to the airport. And neither has anybody else it seems – there has been almost no flying of club aircraft this winter. Not like last year when it seemed like every weekend was a good one for flying.

It’s a bit of a problem when you want to get two aircraft ready for flight but you’ve only got one pre-heater cart.

I should mention that we’ve got the coolest pre-heater cart in existance. It’s got the standard propane bottle, battery and Red Dragon heater, some fancy ductwork to duct the heat from the heater into the cowling of the aircraft you’re trying to heat. But it’s also got an electrical panel so you can plug it into the wall to keep the battery charged up, and it’s also got a connector for a Piper External Power plug so you can use the pre-heater cart to jump start airplanes.

Anyway, only having one of them is sub-optimal when you’re trying to get two planes ready at the same time. Especially when you’re dubious about starting both of them and jump starting an aircraft requires one person at the controls and another person to remove the external power plug and stow the pre-heater cart once it’s going. So this is what we did:

First I pre-heated the Lance. I dragged it out of the hangar (man, that plane is heavy compared to an Archer) and into the sunshine so that it would hopefully not get totally cold soaked while we pre-heated the Archer. Paul P is very new as the Maintenance Coordinator for the Archer and he had an email from the previous one saying that if you hooked up the external power with the battery master on, it would actually charge the internal battery, so we did that while pre-heating it.

Once it was pre-heated and had been on this “charge” for a while, we decided that we’d try to jump start the Lance, and then once it was running, Paul P would try to start the Archer and if that didn’t work I’d idle the Lance for a while to warm it up and recharge the battery, then shut down and jump start him. But it didn’t work like that – the Lance wouldn’t crank at all, even with the external power. The prop wouldn’t move far enough to kick through the compression of one cylinder. Ok, time for plan B. I’ll have to deal with the Lance later, but right now we’ve got to get that Archer moving.

I moved the pre-heater over to start pre-heating the Dakota while Paul was to get the Archer started and go ahead. We knew the Dakota wouldn’t be a problem starting because he flew it two days ago. But unfortunately, Paul didn’t know the first rule of winter starts, which is you start the damn thing as soon as you get the pre-heater off it, and then you do the cockpit preparation. Instead, he must have sat there for 10-15 minutes with the fin strobe going, which meant that marginal battery was using power to spin gyros and the engine was getting colder. No doubt he also had the radios going and was getting the ATIS and contacting clearance as well. So by the time he tried to start, he got one good spin, but it didn’t catch that time, and it didn’t have enough juice for a second spin. So once again, it was disconnect the pre-heater cart from one plane and drag it over to another. I jump started him, and he left almost immediately after, which confirms my suspicions that he’d used battery power to get ATIS and his departure clearance.

Anyway, the Dakota was warm enough, so I dragged it out of the hangar and started it. No problems starting, and I did my pre-flight cockpit preparation with the engine running and left. It’s kind of amazing that the Dakota has almost the same engine as the Lance, (it’s got an O-540 de-rated to 235 horsepower while the Lance has an IO-540 (the I stands for fuel injection) at the full 300 horsepower) but the Dakota turns over so easily while the Lance is a hard cranker even at the best of times.

I got to Batavia while Paul was just finishing up talking to Jeff Boshart about the squawks on the Archer, so I had almost no shut-down time there. We got up and going again, and I put on my foggles and flew an ILS for practice. It’s still nice how much more situational awareness you’ve got with the Garmin 530 there – Paul pointed out that we were heading to a cloud bank in a mile or two, but I pointed out on the 530’s screen that we were just opposite the FAF and we’d get turned 90 degrees very shortly. And sure enough, we got the turn almost as soon as I finished speaking.

When we got back, I discovered that although there hadn’t been any bookings for the Dakota early in the morning when I’d booked the Lance for the ferry flight, when we made the quick decision to take it, somebody had already booked it. D’oh! I guess I should have checked. He was waiting for us when we got there, and he was surprisingly good natured about it.

Anyway, after we got back, I decided that as Maintenance Coordinator I needed to do something about the Lance. I grabbed the battery tester and battery charger from the line shed and decided to try to charge it up. That’s when I discovered I had to remove 18 screws from an external access panel, and two screws from the forward baggage compartment floor, and 4 quarter turn fasteners on the battery cover just to get access to the battery. Before I started, the hygrometer was showing 0% charge – none of the balls were floating at all. After an hour or so, the hygrometer was showing 25% charge. Progress of a sort, anyway. I figured that the guy with the Dakota should probably be getting back soon, so I started pre-heating the Lance again as well as charging it.

The Maintenance Coordinator for the other Archer (39Z) showed up – he’d been planning to fly, but had gotten delayed so he wasn’t going to fly but wanted to check out the plane on the ground. I prevailed upon him to help me jump start it, and he agreed so I put the battery charger away, put back all those damn screws, and then disconnected the pre-heater heater pipes and hooked up the jump start cables. I jumped in the plane and tried, and dammit, it still wouldn’t crank with just the power from the pre-heater cart battery. However, the POH says that if you flip on the battery master, you get the power of the external power battery and the internal battery, so I tried that and it actually started. Woo hoo. Did my cockpit preparation with the engine running, and away I went.

I took it for an hour flight to charge up the battery. It was a great day for it, sunny and the air was still and smooth. I buzzed around the Finger Lakes, and practiced flying a DME arc using just the DME instead of that horrible “turn 10 twist 10” method they teach when you’re an IFR student. Works pretty well, although you really need good situational awareness to make it work. At one point, just for the hell of it I tried a steep turn – of course that’s the only time in the whole flight that Rochester Approach felt the need to point out some traffic. And boy was he confused trying to give me a clock heading.

The Lance has a graphic engine monitor which also shows you some other facts, like the outside air temp and the oil temp. One of the things it shows you is the voltage level. Early in the flight I switched it to showing the voltage level and it was a nice 13.6 volts. Then I turned off the alternator master, and it quickly dropped to about 10.5 volts. Near the end of the flight I did the same experiment and this time the voltage only dropped to about 12.3 volts, so I think that proves that the battery was charging.

Afterwards, every muscle in my body was complaining about all the pushing and pulling of aircraft out of hangars, running back and forth to the line shed to get appropriate keys and tach books and battery chargers and the like, and pushing the start cart around, and just standing in the freezing cold waiting for batteries to charge and engines to warm up.

I was at the airport from about 10am to about 5:30pm. In that time, I flew for nearly two hours. The work to reward ratio isn’t what I would call optimal, but I’d do it again in a minute.