Wednesday, November 26, 2008

good practices: protect/manage your linux server for/over 366 days

After 366 days of uptime, my dear Debian server sent me an email via SMARTd and you can imagine my joy when I read the title ("back up data NOW", haha).

Luckily, I learned something 366 days ago (when I just reinstalled on another drive, no real data to worry about): 
  • use hddtemp to monitor the HDDs temperature (i.e. via syslog); if this goes up, your drives will start singing different tunes and none is pleasing to your ears/data :-)
  • install SMART monitoring tools
  • employ mirroring / redundancy via RAID 1, 5, or better
Luckily (again), I got a spare drive some days ago and already started the magnificent mirroring process on a running server (many thanks Falko!). Unfortunately (finally! :-), I didn't complete it then, since it required downtime and I wasn't that eager to reboot. Now, with the main drive failing I had to reconsider...

So, I just went on shutting down the services, bring the system to its knees (almost) and issue the last rsync (yeps, Falko is teaching you on how to do it on a working/ideal environment). Yes, there's a chance to ruin your good copy trying to rsync from a zombie device, but hey, I just let the blonde moments taking over... and they didn't dissapoint (at this point you should stop thinking about "good practices" and just harbor a smiley :-)

What Falko doesn't tell you, is that you should stop your services before taking a hot copy (via cp, see close to bottom on the second page), or else, you'll start fixing your services (like DBs, etc) when the RAID is ready.

Rsyncing went fine, but devil's tail came into play: I got this funny idea that I should start the services again (keeping uptime, heh?). Do NOT DO IT on a failing drive! In fact you should concentrate ONLY on the important data, back that up and once that's safe try to recover more. Sometimes you're lucky, some other times you're not and I was stuck with the latter; I couldn't do *it than reboot. To my surprise, the system came up and I had a new chance to fix it by following Falko's steps.

Remember, once the new/RAID drive is ready make sure you install the bootloader on it, or else, the reboot will teach you that. What else is the reboot teaching you? Make sure you add both (hd0,0) and (hd1,0) as root for your RAID stanzas! If the failing drive (i.e. hd0) dies for good, your BIOS will see your second drive as (hd0) and the grub config is now a bit off...

I still have one reboot underway (lets call it "maintenance break", shall we? :-), but with these new lessons learned I hope I can safely close this chapter... Happy rsyncing/mirroring to everyone!