I needed that like I needed a hole in my head

As I was reading my email this morning, I noticed that 3 or 4 trackback spams had gotten through SpamKarma2, all from an IP in the UAE. I went to the SpamKarma2 page and found that as well as the 3 or 4 that had gotten through, there were also a few hundred that hadn’t gotten through. I took care of that, and was reading the rest of my email when 3 more got through SpamKarma2. All still from this IP in the UAE. Ok, this calls for bigger guns than SK2. I went to the terminal window that was tailing my logs from the colo box, all ready to “iptables” this IP out of my hair, when suddenly my terminal window stopped responding. So did my other terminal window on the dom0 of the colo box. So did all my web sites. So did my mailing lists.

I went off to work wondering if this was just a DDOS and it would come back up when they got bored of me, or if the box was truly locked up and would need a power cycle. If it was locked, I was seriously considering throwing in the towel on colo, because obviously I can’t keep the sort of uptime I demand. Even Linode was better than this, and they were getting hit by DDOSes all the time. The only thing I didn’t like about the Linode was the piss-poor amount of memory I got – 128Mb versus the 1000Mb I have on my domU.

On my way to work, I got an email from Vicki saying my blog was back up, and at the next traffic light I was able to verify that some of my other web sites were still running. Looks like I weathered the storm.

Photoshop is hogging my disk! Help!

Yesterday I was editing a gigantic Photoshop file (100,000 pixels by 2500 pixels) that I’d put 48 shots from my 8 megapixel camera into, by opening the shots 10-20 at a time, going into each one and doing a select all (splat A) and copy (splat C), closing the file in question, then going into the big file and doing a paste (splat V). Along the way I’d saved the big file a few times. Along the way I’d also done some experimenting with cropping the small jpegs, and the big file, although I’d ended up rolling back all the changes.

This morning I decided I needed to crop some files and overlay them on the big file. First thing I did was flatten the existing 48 layers on the big file. Then I opened up 10 of the jpegs, just as I had before. But when I attempted to crop one of the jpegs, I got a message that I was out of swap space. Actually, I got two popup messages. The first looks like an OS message:

Your startup disk is almost full.
You need to make more space available on your
startup disk by deleting files.
[ ] Do not warn me about this disk again

The second came from Photoshop:

Adobe Photoshop
Could not complete your request
because the scratch disks are full.

At this point, I tried a bunch of things. I exited Photoshop, I rebooted, and I started up Photoshop. I opened only one jpeg. I verified that Photoshop said that the file took up 21 megabytes in memory and there were 30+gigabytes of disk space free. Then I tried the crop tool. And I got the same popups. Before I dismissed them, sure enough “df” and “Activity Monitor” both verified that all 30+ gigabytes of disk were gone. Other tests with other files have given exactly the same results. Even if I resize the file to half the size (and it says it’s only taking up 6 megabytes in memory) it still consumes all the memory when I attempt to crop it.

Can anybody tell me what would make Photoshop suddenly change so that cropping a file should cause it to use over 1000 times as much disk space as the size of the original file?

That can’t be good

My dom0 is only responding to the network some of the time, as evidenced by how stuttery it munin graphs are for the last 36 hours or so, and the fact it took 4 tries before I could ssh into it. Meanwhile, the domU which relies on the dom0 for network bridging, is going just fine with no evidence of network problems.

Just in case, I’ve set up an “at” job to reboot the dom0 at 9:45. I’ll kill the job if I can figure out what is making it flaky in the mean time, but because it’s an at job it will continue even if I manage to totally pooch the network while I’m working. Cross your fingers and hope it comes up again.

Can’t always be the hero

I am mostly responsible for the scripts that upgrade our software from one version to the next. For the most part, it’s pretty straight forward, since we use apt-rpm to take care of upgrading rpms and making sure their dependencies are fulfilled. There is a bit of a hitch in that some of our rpms were made by people who don’t really get rpm, so the rpm just installs a tar file and the %post script unpacks the tar. Trust me, that’s more bizarre than you think, even if you think it’s pretty bizarre. Because of that and because of dependencies, we can’t just do a “dist-upgrade”, but have to upgrade our rpms one at a time.

But the biggest twist on the upgrade is that going from 3.3 to 3.6 of our software involves going from RedHat 7.3 to Centos 3.4. (Don’t ask what happened to versions 3.4 and 3.5 of our software – it’s too painful.) I am not quite proud of the horrible hack I put together to do that, but it took a bunch of work to get it working so that they can just put a DVD in the drive and type /mnt/cdrom/upgrade now and it mostly works. It uses grub on the hard disk to boot the DVD with a custom kickstart file, formats all the partitions but one, installs the new CentOS, and then uses a custom finish script to reinstall our software and restore the backed up configuration off that one partition that wasn’t formatted.

Actually, as an aside, all of the upgrades only “mostly work”. Partly that’s because we were stupid enough to put Dell computers on the customer sites, so it’s hit or miss whether they’ll even reboot after a year or two’s continuous use. But mostly it’s because rpm has an intermittent bug in the locking code that RedHat was told about at least 3 years ago, and still hasn’t fixed. Which means that sometimes apt-get fires up rpm -U, and rpm just hangs. And of course I get blamed because the upgrade doesn’t entirely work, although I think they’re starting to realize it’s not my fault.

When the upgrades don’t work, I usually get dragged in by the customer support people to log into the customer site to fix it. And usually it’s pretty straight forward – reboot the machine with the locked rpm database, and then manually step through the steps the upgrade script would have taken on that machine. Although the customer support people have recently learned that they can just re-run the whole upgrade script, because the apt-get install portion won’t do anything if it was already done.

Today I got dragged in because of a bigger problem – a customer site that is running version 6.2 of our software had a hardware crash on the main server. They sent the customer a new server, but for some stupid reason the fulfillment house that sent out the new server had version 3.3 of our software installed on it. I guess customer support sent them a 3.3->3.6 upgrade DVD, and tried to upgrade it to 3.6 remotely, and then upgrade from 3.6 to 6.2. But at that point they noticed nothing was working.

I investigated, and discovered that most of the rpms weren’t installed. Also, nothing was backed up properly in the partition that doesn’t get reformatted, so it hadn’t been restored correctly. So I decided to go back to the 3.6 version to see if I could get it working there and then go forward. Fortunately they’d left the 3.3->3.6 upgrade DVD in the drive. I ran the /mnt/cdrom/upgrade script and came back a few hours later. And sure enough, only 3 of the 8 rpms that are normally installed were installed. I tried to manually install them, but the first one failed because it had a dependency on the mozilla rpms, and the mozilla rpm was corrupt. It didn’t matter whether I tried the one in the apt repository, or on the DVD itself. It was hosed.

At this point, I gave up and said that they’d have to ship out another replacement unit with the proper version of our software installed. So much for being the hero.

But as I was leaving work, I had a few ideas on how I could have repaired that mozilla rpm and gotten it working. But I was too late. I figured even if I went back everybody was gone as well, and besides they’d already ordered the replacment – my reputation as a hero is ruined. Sigh.