Archive for January, 2007

Moment of truth

Tuesday, January 16th, 2007

Tomorrow morning, I go out to the colo box and replace the existing one with my spiffy “new” Yellow Menace. I’ve tested the hell out of this one and it can handle all three domUs and the dom0 all doing dd’s from /dev/zero to the hard disk over and over again, which will be a nice change from the existing one freezing up throwing ext3 errors whenever I’m doing something disk intensive.

My plan is to pull out the old one, move the disks to the new one, boot it up, make sure I can log into the dom0 from home, and then go home to tinker with it. If things go as I expect, the services that live on here, like this blog, my mailing lists, my photo gallery, the Rochester Flying Club web site and others should only be off the air for an hour or so.

Then I’ll test the hell out of the old box when I get it home to see if I can reproduce the problem, and then see if I can fix the problem somehow, maybe with a new IDE cable. Then I’ll know how to advertise it on eBay - either as a working box, or a box with a suspect IDE controller but a probably fine SCSI controller.

And the productivity hits just keep on coming

Tuesday, January 16th, 2007

Evidently it’s company policy that working at home must be requested 24 hours in advance, in writing. So if I find myself unable to come into work for some reason, they want me to stay home and do something else rather than doing useful work on our project. Well, I’ll miss the money, but I think they’re going to miss the work more.

Ice day

Tuesday, January 16th, 2007

When this morning’s alarm clock went off, the radio was saying that Vicki’s place of work was closed because of an overnight ice storm. I looked outside and there was a good half-inch of clear ice on the trees, roads, and my car. And the local news web sites said that the state police were telling people not to drive if they could avoid it.

So I thought about what I’d be doing if I went to work, and it was just working on design documents. I have most of the documents I needed at home, so I thought “screw it” and decided to stay home.

I wanted to email people to tell them that I was going to do that, and I only had a few of their addresses. So I emailed the ones I had, and one of them emailed my new official direct supervisor (even though I really get my job assignments and direct supervision from somebody else, but she signs my time sheets).

She wrote me back. She’s evidently mad that I didn’t follow her new procedure, and phoned her for permission *before* I decided to stay home. In the past, I’ve always been trusted to work at home if I had work that could be done at home, so this seems like a real lack of trust on her part. But then again she’s new to the project and doesn’t know any of us that well - plus she has little to no day-to-day contact with us developers, so maybe she doesn’t know us well enough to know who to trust.

So instead of having a nice day at home where I could work productively but in a relaxed environment, I had to struggle to produce work while worrying if I’d just jeopardized my job.

Just for the record, I got more work done than I would have if I’d been at work.

My new colo box

Wednesday, January 10th, 2007

It arrived last night. So of course I had to spend the whole damn night setting it up, in spite of the fact that the colo guys won’t be there to let me in for a few days.

Isn’t it pretty? As you can see, it’s a former Google Search Appliance, but without the Google software. Google evidently didn’t want anybody messing with it, and they even included a modem that you were supposed to connect to it in the case of problems so they could remotely diagnose and/or fix it. Which leads to problem number 1.

See that faceplate offset from the box in this picture? When it first came, that faceplate obscured access to the disk caddies, the floppy and the CD-ROM. Which would cause some difficulties installing new hard drives and booting from a CD to install the OS. The screws holding the faceplate on had had their (whatever you call the part the screwdriver goes in) rounded out so you couldn’t use a screwdriver to remove them. That necessitated solution number 1.

We ran out to the hardware store and bought a Dremel tool, and used that to cut grooves in the top of the screws so I could get them off. Once that was done, it was remarkably easy to get everything up and running. The drives are in caddies that slide out the front, and each drive is on a separate IDE controller - it appears that /dev/hda is the first disk, /dev/hdc is the second, /dev/hdd is the CD-ROM, /dev/hdc is the third disk. The only catches in the installation are:

  • There is a BIOS password that I don’t know - there is a jumper that says it will clear the CMOS, but I haven’t had the nerve to use it in case it clears something else that I don’t want cleared.
  • There are two ethernet jacks, and the one recognized by the Linux controller as eth0 is not the one you’d expect. Also, if you set the jumper to disable the ethernet controller, both seem to stop working. Either that, or the other one is a different type of ethernet controller that the Xen kernel can’t deal with.

I’d been using my Windows box as a sacrificial test machine with a couple of scratch disks to emulate what I have at the colo facility to play around. For the last couple of weeks, I’d been convinced that this was going to mean that I was going to have to take the box home to transfer the disks and install the latest Xen kernel, because I couldn’t get it to work consistently. But the new box has two things going for it - it has three disk slots, and the current colo boxes are using LVM. Experimenting with the new box showed that I could install a basic Debian Sarge on it, upgrade it to Debian Etch, use the Xen packages in Debian Etch to get up and running, and then slap the disks from my test machine in, and even though they’re on different devices than they were on the test machine, LVM automagically recognized them and I was able to mount the domU partitions exactly the same as I had in the test machine. From there, it was a simple matter to copy the new /lib/modules/2.6.18-3-xen-686 to the partitions, make a few small changes to the domU config files, and start them up.

This is great - it means that the way things are now, I can take the new box over to the colo, swap the hard drives over, and boot it up and it will be in a state that I can do the rest of the setup from home, with no data loss and only a few hours of downtime. That is currently scheduled for next Wednesday.

Technolust

Tuesday, January 9th, 2007

I’m watching the macrumours.com coverage of Steve Job’s keynote, and I want one of everything he’s demoed so far, especially the internet communicator thing. This is, to quote a friend, “the sci-fi phone”.

By all means, Paul, …

Monday, January 8th, 2007

Yesterday I went to CompUSA to buy a new PCI IDE controller - one of my external USB drives that I bought to perform networked backups of my colo keeps losing its mind, and I was thinking that it might be that the USB controller (either the interface card or the one in the external box) isn’t up to heavy data transfer, so I thought it might be good to move it “indoors” as it were.

I installed the controller and a 250Gb hard drive. The system found it at /dev/hdg - I guess I put it in the second of the two IDE controllers on the card. I made /dev/hdg a physical volume (pv) under LVM2, made it a volume group (vg) and put a logical volume (lv) on it for my mp3 collection. After I moved my mp3s from where they’d lived before, on /dev/hdb, I wiped /dev/hdb and made it part of the vg, and made another vg for the colo backup. Yesterday I also discovered the “--link_dest” argument to rsync, so I can keep several days worth of backups in much less space.

Tonight I’m going to rip that hard disk out of the external USB drive and put it on my currently eviscerated Windows machine to see if IBM/Hitachi’s “DFT” drive function tester can find any problems with it. If not, I’ll add it to the vg and increase the size of the mp3 lv.

Tomorrow my new colo box should arrive, unless UPS does their customary screw-ups. I’m scheduled to go out to the colo facility on Thursday. I’m going to move the old drives to the new box, and upgrade to the newest version of Xen. I’ve practiced upgrading to the new Xen on my currently eviscerated Windows box (that’s why it’s eviscerated, I had to put scratch disks in it) and it didn’t go well, but I think I know what I did wrong. I also tried a full install of Debian on the dom0, and was able to save the domUs when I tried that.

If that goes well, I’ll be up again in a few hours. If it doesn’t go well, I’ll bring it home and work on it overnight, and I’m tentatively scheduled to go back to the colo on Friday.

The ultimate Heisenbug?

Saturday, January 6th, 2007

We’ve got a problem that happens apparently at random times at a few customer sites, but which we’ve been unable to reproduce in the lab. I’m not sure if that means it’s a Heisenbug or just a really nasty Bohr-bug.

The part of the system that is affected are three programs:

  • One that generates events, called “tixd
  • One that is responsible for collecting events from all the programs in the system (not just these three) and delivering them to subscribers, called the “EventBroker” or “eb
  • One that subscribes to the events that the “tixd” generates, which we call the “scheduled

What has been happening on these customer sites is that after days or weeks of proper operation, for no apparent reason, the “tixd” would say that it’s generating an event, but the “scheduled” wasn’t getting them any more. The customer would notice the problem, sometimes a day or two later, complain that things weren’t happening that were supposed to happen, our service people would restart the whole system, and everything would start working again.

This bug has been happening for ages now, and every time I get called in to look at their logs because I wrote the “scheduled” and all the fingers point to me. But I couldn’t find any reason why “scheduled” would stop responding to events, or would unsubscribe from events. A few builds ago, Tom put some debug into his “eb” that would log every event that came in and which subscribers it was being delivered to. He also logged subscribes and unsubscribes. And so we waited.

Today, it finally happened again. And this time, I’ve got the logs that show:

  • At 6am, an event is generated by the tixd, and the eb delivers it to the subscriber scheduled
  • Between 10am and 11am, there is a flurry of event subscribes and unsubscribes, all unrelated to scheduled. But some of these unsubscribes are caused when events are being delivered to subscribers that have exited without unsubscribing.
  • At about 1am, there is another event generated by the tixd, and the eb receives it but says there are no subscribers found.

At this point, because the eb log shows no unsubscribe coming from scheduled, I’d say it’s not my bug and pass it off to Tom, the author of the eb. But unfortunately, my employer declined to renew Tom’s contract at the end of last year, so he no longer works here. He dodged this bullet by only 5 days. And so I’ve got to figure out why this is happening. Lucky me.

Another damn failure on my colo box

Thursday, January 4th, 2007

Yesterday I had a panic attack - suddenly a certain repository of binary files was empty where it had been nearly 300Gb a few hours previously. I knew that Vicki was uploading some stuff to it today using an “ncftpput” command that I’d shown her but which I knew she didn’t understand what each command line argument meant. So I’m sorry to say my first reaction was “I bet she somehow wiped it”. But I looked in her ~/.ncftp/spool/log file and couldn’t see anything unusual. I guess I owe her an apology for that thought.

I looked on the domU, and “df” showed the partition still mounted, and still 91% full. But nothing showed up to “ls”. I unmounted it and shut off the nightly backups so that it doesn’t delete the backup. “fsck.ext3 /dev/hdb” gave an error about a zero length partition. Then I thought I should probably be doing this on the dom0, and so I logged into it and had the same error with “fsck.ext3 /dev/hdb1″. “fsck -l /dev/hdb” on that entire drive showed that it didn’t think the drive was there at all. Oh oh. Moment of panic time - one of the other domUs has some of its disk space on that drive as well, thanks to LVM. I wonder what’s screwing up on his domU if it can’t get to some of its disk space. Time to shut them all down and reboot.

I did an “xm console xen1″ to connect to my domU and that’s where I saw the oh-so-familiar ext3 errors. But everything shut down relatively cleanly and rebooted. I saw one message in the log files about resetting the ide0 controller, which I’m not sure was caused by or the cause of the problem. And after the reboot, all 300Gb of files were back. Thank goodness, because the upload bandwidth I’ve got at home these days means it would have taken months to get that partition restored from my backup.

This partition that screwed up this time is a normal primary disk partition, not an LVM logical volume, and on a different physical drive than the other failures, so at least I’ve eliminated LVM and the disk hardware as a cause. But that leaves the IDE controller and Xen.

I can’t wait for my new 1U server to come. Still not sure whether I should try Xen again or VMWare. VMWare probably isn’t as fast and it’s a lot more difficult to manage without getting the for-pay version, but at least it’s “ready for prime time”.

Some professors get all the breaks

Thursday, January 4th, 2007

Psychology researcher studies air quality in Irish pubs around the world.

I can just see that grant application process

“Hello, NSERC? I want money to spend the year at Irish pubs. Why? Umm, ah, to… study… air quality. That’s right, air quality. Certainly not to get drunk off my ass. Nosiree.”

Are you a pilot who blogs, or a blogger who flies?

Thursday, January 4th, 2007

I got an email today from “IFR Pilot” (who also signs off as Darrell) cc’ed to a bunch of other pilot-bloggers proposing that we all have a fly-in and get to know each other. After a few massively cc’ed exchanges where people seemed enthusiastic about the idea, I set up a mailing list so that other pilot-bloggers could find this list and sign up. If you are in that category, you can sign up at this link.

A lot of the people on “IFR Pilot”’s list were people I’d never heard of, so I can see I’m going to be adding a whole bunch of new blogs to my RSS reader.

So how’d I do? (Aviation edition)

Tuesday, January 2nd, 2007

For 2006 I set myself a few goals for my flying. If I recall correctly, it was

  • Fly 50 hours this year.
  • Do some airwork and get more proficient at smooth flight, especially the use of the rudder.
  • Start work towards a Commercial or Float Plane rating.

Well, it didn’t quite work out that way. I only got 37.9 hours flying time (25.3 complex), although I would have been 5 or so hours closer to my goal if the Lance hadn’t been broken on the day we departed for Oshkosh, and maybe another 3 hours if we’d been able to fly to Albany on Thanksgiving weekend. Oh well. That’s still up for the 20-25 hours I normally put in a year. I also didn’t do much airwork, mostly cross country. So I still finding myself having to look at the ball and putting in rudder as an afterthought rather than feeling what needs be put in. However, I did get training in the Garmin 530, and I think I’m getting more precise in my approaches and IFR en-route flying. I also had a little adventure with ice avoidance and negotiating with ATC for what I needed on my way home from Pinckneyville. So while I didn’t meet my goals, I think I had a pretty satisfying flying year.

I’m not sure if I’m going to get to Oshkosh this year - this is our 10th anniversary and I think I’m going to be spending my vacation time on a cruise or something. So I probably won’t be heading down to Florida for Jack Brown’s Seaplane Base or up to Parry Sound for Georgian Bay Airways for a float rating either.

So my goals for this year remain

  • Become a more proficient yoke and rudder pilot.
  • Continue to fly more than I have been in the past.