Critical system disaster preventionSynopsis In order to achieve a higher level of reliability and availability on our main production systems we have rolled out a number of initiatives that either help prevent system failure, or assist us in determining root cause more quickly in order to rectify the issue(s). The main areas we have concentrated on are filesystem redundancy on our root filesystem, system change log files to help track possible root cause, and to limit a users time as root we enable a timeout on root shells. We also use the explorer tool to gather system configuration information, as well as backing up particular system files to an NFS filesystem for global access. Solstice LVM Through SLVM we maintain detached / filesystem images that are preserved between each major system change, allowing a fall back to a known 'good' boot image in case of a major system failure. Although out of date it does allow for a very quick method of recovering a critical system, allowing services to be restored much quicker than if tape recovery were required. This image is left detached in a bootable state (/etc/vfstab and /etc/system modified to boot from the submirror rather than the main mirror), and is usually only sync'd to the main mirror prior to a major update, eg OS upgrade. We also use 3-way mirrors across the root filesystem, allowing the option of detaching one of these submirrors prior to any system update. This image is then made bootable as with the 4th mirror. This option is used regularly whenever any system changes are made, such as patch updates, configuration changes and OS upgrades, and allows us to recover to the very latest 'good' system configuration. A simple script is used to drop and reattach the 3rd mirror, automating all the configuration changes needed to make a submirror bootable. We also make use of check scripts run from cron to ensure administrators are informed when a filesystem is running with less than 3 submirrors, after allowing a grace period for system maintenance. Once any updates have been verified the 3rd submirror is simply reattached. The typical uses of this are; OS upgrades using pfinstall Patch updates - protecting against potentially bad patches Package updates Configuration changes - eg. DHCP to EDHCP migration To make the detached submirror bootable the following steps need to be taken; 1. Check, and where required repair the detached submirror filesystem. Simply run fsck(1M) against the detached submirror, answering yes to any questions (in this example d13 is the detached submirror): /usr/sbin/fsck -y /dev/md/rdsk/d13 Then mount the detached submirror on a temporary mount point, eg: /usr/sbin/mount -F ufs /dev/md/dsk/d13 /mnt/root 2. Update /etc/system with new rootdev. To change the md rootdev from d10 to d13 (using d13 as the detached submirror) change: rootdev:/pseudo/md@0:0,10,blk rootdev:/pseudo/md@0:0,13,blk 3. Update /etc/vfstab with the new / md. Using the same example as above, change: /dev/md/dsk/d10 /dev/md/rdsk/d10 / ufs 1 no logging /dev/md/dsk/d13 /dev/md/rdsk/d13 / ufs 1 no logging It is important to remember that if you are mounting any directories from root onto seperate mount points (eg. /var) then these will need to be treated in the same way as /, with /etc/vfstab updated accordingly. 4. Setting default boot device. Set your default boot device from OBP using eeprom: /usr/sbin/eeprom boot-device=mirror3 Where mirror3 is an alias set for the underlying boot device: devalias mirror3 /ssm@0,0/pci@18,700000/pci@1/SUNW,isptwo@4/sd@6,0 See the following man pages for references to Solstice LVM commands; metadetach(1M), metaattach(1M), metadb(1M), metaclear(1M). The key benefit of this method is that the cost is just a couple of additional disks for mirroring. Tracking system changes In order to quickly identify who has made what changes to a system simply maintain a change log containing any modifications to be updated by the submitter whenever work is carried out. SCCS makes this quick and simple, with the added benefit of allowing just one person to update any given file at a time. Another benefit of using system change logs is that the system configuration can be found easily if a reinstall is required, and also ensures that on complicated system configurations that services are not missed or mis-configured. Root shell timeout To ensure that unused root shells aren't left around it is a simple matter to set an inactivity timeout when using ksh; [[ "$(who -r)" = *run-level\ 3* ]] && typeset -xir TMOUT=600 This effectively sets the TMOUT variable to read-only, and does not set it at single user for obvious reasons. Automated Explorer Explorer is a tool that gethers extensive system information, which can be executed in interactive or batch mode. Ouput can be automatically emailed to the explorer-database. It can be run (via a script - automated_explore) through cron to automatically gather system information, crucial to rebuilding after a serious outage. The number of weeks that output is kept can be modified, and can be kept as tar files saved in a location specified within the automated_explore script; #!/bin/sh EXPLORER_DIR=/home/explorer PARAMETERS="-keepdata -nofile" STORE_DIR=/home/explorer/servers/`uname -n` CYCLE_LENGTH=1 DEBUG=1 GZIP=/usr/local/bin/gzip <snip> $EXPLORER_DIR/explorer $PARAMETERS mv $EXPLORER_DIR/explorer.`hostid`.`hostname`* $STORE_DIR <snip> The standard information gathered details the system configuration, ranging from network to disk output. Also included is detailed information on the patches and packages loaded, plus device hierarchy and disk formating. Here is an overview of information gathered from a small(ish) system; montgomery 63 # ls ./* ./README ./disks: dev-lL.out diskinfo ls-l_dev_rdsk.out prtvtoc df-a.out diskinfo.err ls-l_dev_rmt.out swap-l.out df-e.out format.err ls-l_dev_rst.out swap-s.out df-g.out format.out ls-ld_tmp.out df-k.out ls-l_dev_nrst.out maj_min_dev#.out ./etc: TIMEZONE hostname.hme0 inetd.conf nodename services default hostname.qfe1 init.d path_to_inst system defaultdomain hostname.qfe2 inittab release vfstab dfstab hosts mnttab resolv.conf dumpadm.conf inet name_to_major rpc ./messages: dmesg.out messages messages.0 messages.1 messages.2 messages.3 ./netinfo: arp-a.out netstat-k.out netstat-r.out nisshowcache-v.out netstat-a.out netstat-m.out netstat-rn.out rpcinfo-m.out netstat-an.out netstat-p.out netstat-s.out rpcinfo.out netstat-i.out netstat-pn.out nfsstat.out ./patch+pkg: modinfo.out patch_date.out pkginfo-l.out showrev-p.out patch-list pkginfo-i.out pkginfo-p.out showrev.out ./sysconfig: dumpadm.out prtconf-v.out eeprom.out prtconf-vp.out ifconfig-a.out prtdiag-v.out ipcs-a.out ps-ef.out kernel_ls-l.out psrinfo-v.out last-100-login.out strings-var_crash_hostname_vmcore.out last-20-reboot.out sysdef-d.out lockstat-sleep-5.out sysdef.out ls-al_var_crash_hostname.out uptime.out prtconf-V.out ./var: INST_RELEASE crontabs sadm-ld.out INST_RELEASE.err log yp_binding_ls-l.out montgomery 64 # Run from cron on a regular basis this tool can provide invaluable information on system configurations; montgomery 66 # crontab -l | grep explore # explore stuff 0 0 * * * /home/explorer/automated_explore > /dev/null 2>&1 montgomery 67 # SUNWexplo can be downloaded from the following site; http://sunsolve.Ebay.Sun.COM/cgi/show.pl?target=resources/explorer File backups In order to quickly recover particular files when time is of the essence we copy certain information to a globally rsync'd filesystem, so that we have access to any production servers files from any site. The script we use simply reads in a list of files which are then copied to a local filesystem, which in turn is copied across to at least 3 other sites. The script is just 54 lines, plus whatever you list for copying; #!/bin/sh # Copy certain files to a backup holding area for all geo's # At present /home/explorer is rsync'd to all geos. # File copies go to /home/explorer // config file under /usr/local/share/sh HOST=`/bin/uname -n` DATE=`/bin/date` TRANS_LIST=/usr/local/share/sh/backup-file-list DEST_LOC=/home/explorer/servers/local/recovery-files/$HOST LOG=/home/explorer/servers/local/recovery-files/backup-log do_backup() { if [ ! -d $DEST_LOC ] then mkdir -p $DEST_LOC if [ $? -ne 0 ] then echo "Could not make dir: $DEST_LOC" >> $LOG 2>&1 exit fi fi echo "\nArchive of backup files for $HOST, $DATE;\n\n" > $LOG echo " SOURCE FILES:\n" >> $LOG 2>&1 /usr/bin/crontab -l > $DEST_LOC/crontab_l.txt echo "crontab -l" >> $LOG 2>&1 while read NextFilesystem do if [ -r $NextFilesystem ] then echo "$NextFilesystem" >> $LOG 2>&1 cp -p $NextFilesystem $DEST_LOC/ 2>&1 fi done < $TRANS_LIST echo "\n\n\nAll files backed up to $DEST_LOC" >> $LOG 2>&1 } mail_result() { cat $LOG | mailx -s "Archive of backup files for $HOST" any-old-address@anywhere rm $LOG > /dev/null/ 2>&1 } do_backup mail_result The type of files we want quick access to are Solstice LVM config info (md.tab/mddb.cf), filesystem info (vfstab/mnttab/system), plus nsswitch.conf and NIS+ files (.rootkey/NIS_COLD_START). Retrieving these files via ftp/NFS is far quicker than resorting to tape backups. |