More upgrades

So back when I wrote this post, my system had 2 21″ 1080p monitors, and it had a pair of 500Gb drives and another pair of 1Tb drives. But I didn’t stand still.

Over time, I replaced one of those monitors with a 27″ WQHD IPS LED monitor. That’s a lot of letters, but the important thing is it is very big, and it has a lot of pixels, and it’s beautifully sharp. I had hoped when I got it that I’d be able to keep both of the 21″ monitors as well, because the motherboard has built-in HDMI and DisplayPort, but it appears that when you have an external graphics card the built-in graphics shut down. I may be wrong about that, but I never discovered a way to enable it.

A few weeks ago I decided to treat myself and bought a second video card (this time an nVidia GeForce GT 620) to go into the second PCIe slot. That took a bit of fiddling around until I discovered that I needed to enable “xinerama” on the nVidia Settings in order to get all three monitors so I could drag windows from one to the other and cut and paste from one to the other. Without that setting it was acting like I had the two original monitors that acted like they had before, and a third monitor that would have nothing to do with them. Interestingly enough, though, KDE’s keyboard settings will no longer work – I had to go into the xorg.conf file and manually add a setting to swap the control and caps lock key. It also doesn’t have the numlock key on by default any more, although I haven’t manually fixed that. There are some weird little graphic glitches, especially on the login screen.

I also replaced my lovely and clicky Unicomp keyboard, which I spilled Diet Coke on and was experiencing some odd behaviour, with a dasKeyboard Professional Model S. It has MX Blue switches, which means it has almost as good a feel as the Unicomp, but it’s not as noisy.

Over the intervening years, I’ve also upgraded the disks. I haven’t really needed more space (although I’m getting more profligate about keeping stuff I formerly would have deleted), but as disks get old I’ve replaced them with bigger ones. So I went from the 2×500 + 2x1Tb to

  • 2x1Tb + 2x2Tb
  • 2x2Tb + 2x2Tb

There might have been an intermediate step along the way I left out. In each case, I’ve used fdisk to put a single partition on each new disk (because although I could just add the raw device to a RAID, I’ve found that making a partition allows you to use it for booting later, and as disks age out I’ve made the second set of disks into the first one, etc.) Then I’ve made them into a RAID-1 using mdadm --create /dev/mdNNN --level=1 --raid-devices=2 /dev/sdc1 /dev/sdd1, then migrated everything off the old pair to the new pair using pvcreate /dev/mdNNN; vgextend lvm2 /dev/mdNNN; pvmove /dev/mdMMM; vgreduce lvm2 /dev/mdMMM Then I’ve made sure grub knows about the new disks using grub-install /dev/sda, and usually I’m good to go.

A couple of weeks ago, one of my 2Tb drives reported some SMART errors. Nothing bad enough to trigger an email report, but enough to turn an orange warning flag on munin (basically smartctl returns an error code of 192). I took the machine down and ran SEATOOLS and it found a couple of sector errors and offered to repair them. I repaired them, and everything was fine for a week or so and then the same thing happened. At that point I decided to buy new disks. But since 3Tb disks are now as cheap as 2Tb disks back when I bought them, I figure it’s time to upgrade. So I bought the 3Tb drives.

At first when they came, I pulled out one pair of 2Tb drives and checked the model numbers. They agreed with the model numbers that SEATOOLS had reported for the bad drives, so I put the new 3Tb drives in their place, and connected the old ones up to the “other” SATA cables, the ones that don’t correspond to a drive carrier where the disks just sit on the floor with the cables hanging out. I booted everything up, and went through the whole process – the pvmove is the worst part of it, because it takes a few hours. After it was done, I was looking at stuff and realized that the new 3Tb drives had only made a 2Tb RAID! Wish I’d noticed that before I’d done all the time consuming stuff. Turns out fdisk doesn’t support 3Tb drives, and it doesn’t give you any warning before it makes a 2Tb partition on your 3Tb drive. So I did the same commands in reverse to move all the content back off the new drives onto the old one. Then I used GNU parted instead of fdisk to create a gpt partitioned disks with one 3Tb partition each. Went through the hours of migration, and it wouldn’t boot. It would boot with the old disks hanging off the side, but not if I took them out. A bit of reading revealed that there was a problem with grub and gpt disks – you needed to create a first small partition for grub to install its image to, and then the second big partition to make the big RAID on. So off I went, migrating back, repartitioning, migrating forward. All in all, a lot of wasted hours spent on this.

But after all that work, the damn thing still wouldn’t boot. I could plug the old disks back in and boot, but as far as I could tell, the raid that included those two disks wasn’t used for anything – it wasn’t in pvdisplay, and I could mdadm --stop it and the system would keep going. But if I unplugged it, it wouldn’t boot – it would show the Grub prompt screen, tell me that it was unable to read /dev/fd0 (which is odd, because I don’t have a floppy), say it was unable to read lvm/lvm2-boot, and then throw me to the grub-rescue prompt which is utterly useless. But as long as I got those two drives plugged in, it was booting so who was I to complain?

I asked on askubuntu.com, and didn’t get any response. I asked on the ubuntu forums, and got a response from a moderator who said “I don’t know much about RAID and lvm”, but who then proceeded to assert about 7 different things that were completely untrue about RAID and lvm. He also demanded that I run this tool, “boot-repair”, which would magically cure everything. Except for a couple of problems:

  • The documentation for the tool says that you can start it “from the command line”, but what they really mean is that you can start it “from the command line” if you’re running in a graphical environment. It doesn’t work if you’re away from home and sshed in. Minor, but annoying.
  • It wants to destroy your existing mdadm setup and replace it with dmraid. That’s a big nope.
  • It sends a lot of information about your system to a pastebin file, without giving you any option to edit or update some of that before it shares it with the world. Hey, look at that, it dumped out some disk sectors that have an email on it!
  • After gratuitously mucking about with my system, it didn’t actually fix anything.

Oh, and when you write to the authors of the tool, as the tool itself recommends you do if it didn’t fix your problem, you get an email that basically demands you donate some money to them before they’ll look at your pastebin file.

I tried asking on G+ as well, but the only advice I got there was a suggestion that I give up trying to boot from the gpt drive and install an SSD to boot from.

Anyway, in order to diagnose some more, I tried booting from the Kubuntu 13.10 install disk, and trying the “Try” option to get a liveCD environment working. With the “old” disks not installed, I was able to assemble two RAID-1s, one 3Tb and one 2Tb. So far so good. But then I noticed something that made my heart sink – pvdisplay was showing the 3Tb drives, but it was listing the other pv as “missing”. I suddenly realized why I wasn’t able to boot after taking out what I thought was the older failing pair of disks – because I’d taken out the newer, non-failing pair. Because mdadm and lvm successfully insulate you from worrying about getting disks in the right place and the right order, I had assumed that because the pair I was migrating away from was showing up as /dev/sde1 and /dev/sdf1, that they were the ones that I had outside the disk caddies sitting on the ground. But in actual fact, the ones sitting on the ground were actually /dev/sdb1 and /dev/sdd1. I was fooled because device letters don’t map 1:1 with specific SATA cables on the motherboard, if you put in extra drives they might end up between the ones you had before. With a growing mixture of trepidation and excitement, I checked checked the part numbers as well as the model numbers on the 4 2Tb disks, and confirmed my mistake. I put the newer 2Tb drives back in the caddies, and removed the older 2Tb drives, and everything booted correctly. And God alone knows how many steps back things would have worked if I’d bothered to check these part numbers earlier. I probably wouldn’t have had to inadvertently post the contents of an old email on pastebin, that’s for sure.

As of right now, /dev/sda and /dev/sdb are 3Tb drives, and /dev/sdc and /dev/sdd are 2Tb drives, and now that I have an extra Tb to play with, I have to figure out how to allot it. I currently have no way of knowing whether the system is booting from /dev/sda or from /dev/sdc, and I’m also not 100% sure that the 2Tb pair that I removed are the ones that had the problems in the first place. I think I’ve got a few extended SEATOOLS sessions ahead of me.