Critical system disaster prevention

Synopsis

In order to achieve a higher level of reliability and availability on our main
production systems we have rolled out a number of initiatives that either help
prevent system failure, or assist us in determining root cause more quickly in
order to rectify the issue(s).

The main areas we have concentrated on are filesystem redundancy on our root
filesystem, system change log files to help track possible root cause, and to
limit a users time as root we enable a timeout on root shells. We also use the
explorer tool to gather system configuration information, as well as backing up
particular system files to an NFS filesystem for global access.

Solstice LVM

Through SLVM we maintain detached / filesystem images that are preserved between
each major system change, allowing a fall back to a known 'good' boot image in
case of a major system failure. Although out of date it does allow for a very
quick method of recovering a critical system, allowing services to be restored
much quicker than if tape recovery were required. This image is left detached in
a bootable state (/etc/vfstab and /etc/system modified to boot from the
submirror rather than the main mirror), and is usually only sync'd to the main
mirror prior to a major update, eg OS upgrade.

We also use 3-way mirrors across the root filesystem, allowing the option of
detaching one of these submirrors prior to any system update. This image is then
made bootable as with the 4th mirror. This option is used regularly whenever any
system changes are made, such as patch updates, configuration changes and OS
upgrades, and allows us to recover to the very latest 'good' system
configuration.

A simple script is used to drop and reattach the 3rd mirror, automating all the
configuration changes needed to make a submirror bootable. We also make use of
check scripts run from cron to ensure administrators are informed when a
filesystem is running with less than 3 submirrors, after allowing a grace period
for system maintenance.

Once any updates have been verified the 3rd submirror is simply reattached.

The typical uses of this are;

        OS upgrades using pfinstall
        Patch updates - protecting against potentially bad patches
        Package updates
        Configuration changes - eg. DHCP to EDHCP migration

To make the detached submirror bootable the following steps need to be taken;

1. Check, and where required repair the detached submirror filesystem.
Simply run fsck(1M) against the detached submirror, answering yes to any
questions (in this example d13 is the detached submirror):

        /usr/sbin/fsck -y /dev/md/rdsk/d13

Then mount the detached submirror on a temporary mount point, eg:

        /usr/sbin/mount -F ufs /dev/md/dsk/d13 /mnt/root

2. Update /etc/system with new rootdev.

To change the md rootdev from d10 to d13 (using d13 as the detached submirror)
change:

        rootdev:/pseudo/md@0:0,10,blk

        rootdev:/pseudo/md@0:0,13,blk

3. Update /etc/vfstab with the new / md.

Using the same example as above, change:

        /dev/md/dsk/d10 /dev/md/rdsk/d10 / ufs 1 no logging

        /dev/md/dsk/d13 /dev/md/rdsk/d13 / ufs 1 no logging

It is important to remember that if you are mounting any directories from root
onto seperate mount points (eg. /var) then these will need to be treated in the
same way as /, with /etc/vfstab updated accordingly.

4. Setting default boot device.

Set your default boot device from OBP using eeprom:

        /usr/sbin/eeprom boot-device=mirror3

Where mirror3 is an alias set for the underlying boot device:

        devalias mirror3 /ssm@0,0/pci@18,700000/pci@1/SUNW,isptwo@4/sd@6,0

See the following man pages for references to Solstice LVM commands;

metadetach(1M), metaattach(1M), metadb(1M), metaclear(1M).

The key benefit of this method is that the cost is just a couple of additional
disks for mirroring.

Tracking system changes

In order to quickly identify who has made what changes to a system simply
maintain a change log containing any modifications to be updated by the
submitter whenever work is carried out. SCCS makes this quick and simple, with
the added benefit of allowing just one person to update any given file at a
time.

Another benefit of using system change logs is that the system configuration can
be found easily if a reinstall is required, and also ensures that on complicated
system configurations that services are not missed or mis-configured.

Root shell timeout

To ensure that unused root shells aren't left around it is a simple matter to
set an inactivity timeout when using ksh;

        [[ "$(who -r)" = *run-level\ 3* ]] && typeset -xir TMOUT=600

This effectively sets the TMOUT variable to read-only, and does not set it at
single user for obvious reasons.

Automated Explorer

Explorer is a tool that gethers extensive system information, which can be
executed in interactive or batch mode. Ouput can be automatically emailed to the
explorer-database.

It can be run (via a script - automated_explore) through cron to automatically
gather system information, crucial to rebuilding after a serious outage. The
number of weeks that output is kept can be modified, and can be kept as tar
files saved in a location specified within the automated_explore script;

#!/bin/sh

EXPLORER_DIR=/home/explorer
PARAMETERS="-keepdata -nofile"
STORE_DIR=/home/explorer/servers/`uname -n`
CYCLE_LENGTH=1
DEBUG=1
GZIP=/usr/local/bin/gzip
<snip>
        $EXPLORER_DIR/explorer $PARAMETERS
        mv $EXPLORER_DIR/explorer.`hostid`.`hostname`* $STORE_DIR
<snip>

The standard information gathered details the system configuration, ranging from
network to disk output. Also included is detailed information on the patches and
packages loaded, plus device hierarchy and disk formating.

Here is an overview of information gathered from a small(ish) system;

montgomery 63 # ls ./*
./README

./disks:
dev-lL.out         diskinfo           ls-l_dev_rdsk.out prtvtoc
df-a.out           diskinfo.err       ls-l_dev_rmt.out   swap-l.out
df-e.out           format.err         ls-l_dev_rst.out   swap-s.out
df-g.out           format.out         ls-ld_tmp.out
df-k.out           ls-l_dev_nrst.out maj_min_dev#.out

./etc:
TIMEZONE       hostname.hme0 inetd.conf     nodename       services

default        hostname.qfe1 init.d         path_to_inst   system
defaultdomain hostname.qfe2 inittab        release        vfstab
dfstab         hosts          mnttab         resolv.conf
dumpadm.conf   inet           name_to_major rpc

./messages:
dmesg.out   messages    messages.0 messages.1 messages.2 messages.3

./netinfo:
arp-a.out           netstat-k.out       netstat-r.out       nisshowcache-v.out
netstat-a.out       netstat-m.out       netstat-rn.out      rpcinfo-m.out
netstat-an.out      netstat-p.out       netstat-s.out       rpcinfo.out
netstat-i.out       netstat-pn.out      nfsstat.out

./patch+pkg:
modinfo.out     patch_date.out pkginfo-l.out   showrev-p.out
patch-list      pkginfo-i.out   pkginfo-p.out   showrev.out

./sysconfig:
dumpadm.out                            prtconf-v.out
eeprom.out                             prtconf-vp.out
ifconfig-a.out                         prtdiag-v.out
ipcs-a.out                             ps-ef.out
kernel_ls-l.out                        psrinfo-v.out
last-100-login.out                     strings-var_crash_hostname_vmcore.out
last-20-reboot.out                     sysdef-d.out
lockstat-sleep-5.out                   sysdef.out
ls-al_var_crash_hostname.out           uptime.out
prtconf-V.out

./var:
INST_RELEASE         crontabs             sadm-ld.out
INST_RELEASE.err     log                  yp_binding_ls-l.out
montgomery 64 #

Run from cron on a regular basis this tool can provide invaluable information on
system configurations;

montgomery 66 # crontab -l | grep explore
        # explore stuff
        0 0 * * * /home/explorer/automated_explore > /dev/null 2>&1
montgomery 67 #

SUNWexplo can be downloaded from the following site;

        http://sunsolve.Ebay.Sun.COM/cgi/show.pl?target=resources/explorer

File backups

In order to quickly recover particular files when time is of the essence we copy
certain information to a globally rsync'd filesystem, so that we have access to
any production servers files from any site.

The script we use simply reads in a list of files which are then copied to a
local filesystem, which in turn is copied across to at least 3 other sites.

The script is just 54 lines, plus whatever you list for copying;

#!/bin/sh

# Copy certain files to a backup holding area for all geo's
# At present /home/explorer is rsync'd to all geos.
# File copies go to /home/explorer // config file under /usr/local/share/sh

HOST=`/bin/uname -n`
DATE=`/bin/date`

TRANS_LIST=/usr/local/share/sh/backup-file-list
DEST_LOC=/home/explorer/servers/local/recovery-files/$HOST

LOG=/home/explorer/servers/local/recovery-files/backup-log

do_backup()
{
if [ ! -d $DEST_LOC ]
then
        mkdir -p $DEST_LOC
        if [ $? -ne 0 ]
                then
                echo "Could not make dir: $DEST_LOC" >> $LOG 2>&1
                exit
        fi
fi

echo "\nArchive of backup files for $HOST, $DATE;\n\n" > $LOG
echo " SOURCE FILES:\n" >> $LOG 2>&1

/usr/bin/crontab -l > $DEST_LOC/crontab_l.txt
echo "crontab -l" >> $LOG 2>&1

while read NextFilesystem
do
        if [ -r $NextFilesystem ]
        then
          echo "$NextFilesystem" >> $LOG 2>&1
          cp -p $NextFilesystem $DEST_LOC/ 2>&1
        fi
done < $TRANS_LIST
echo "\n\n\nAll files backed up to $DEST_LOC" >> $LOG 2>&1
}

mail_result()
{
cat $LOG | mailx -s "Archive of backup files for $HOST" any-old-address@anywhere
rm $LOG > /dev/null/ 2>&1
}

do_backup
mail_result

The type of files we want quick access to are Solstice LVM config info
(md.tab/mddb.cf), filesystem info (vfstab/mnttab/system), plus nsswitch.conf and
NIS+ files (.rootkey/NIS_COLD_START). Retrieving these files via ftp/NFS is far
quicker than resorting to tape backups.