[Summary] SunRay server failure

From: Chris Hoogendyk (choogend@library.umass.edu)
Date: Mon Mar 08 2004 - 17:29:53 EST


Original message at end.

Bottom line. The combination:

  SunRay Server Software 1.3
  Solaris 8 Release 10/00
  Solaris 8 Patch 109077 (patches included in recommended cluster)
  Solaris 8 Patch 111302

Is not compatible. Won't work. Will fail. SunRay Server Software 1.3
apparently requires Solaris 8 Release 4/01 or later and preferably
Release 7/01 or later (although I was running 10/00). Patch 109077,
updates dhcpd and it's configuration, on which SunRay Server Software is
dependent. This Patch precipitated the failure. Furthermore, this patch
has a bunch of dependencies, and the instructions recommend that you NOT
try to uninstall it. So, I seemed to be basically stuck as far as simple
solutions were concerned.

I tried doing an upgrade install of Solaris 8 Release 2/02. Freakishly,
I had a disk drive failure during the install. So, find an unused disk
drive, partitian it to match, go to backup tapes for recovery, and punt
on the upgrade for now. Unfortunately, when I went back to the January
full backup, I found it had the same failure. On checking the patches, I
found it had an earlier version of 109077. On checking my records for
patching, rebooting, and backups, I found I had not rebooted since
before that patch cluster. If I had, the system would have failed back
then. So, go back even further on my full backup tapes and recover
again. This worked, but then I had a couple of months of fixes and
changes on that server that I had to repeat. Fortunately, it wasn't too
much.

Anyway, I'm back up and running, and next time there is a break on
campus and I can schedule some official down time, I'll try the upgrade
to Solaris 8 Release 2/02 and SunRay Server Software 2.0 (that
combination works).

--------------- Bloody Details for those who care ---------------

After having gone through this "bare metal" recovery, I now have some
changes I will make in my backup procedures. More on that after these
details.

Since this server had no tape drive, I do my backups to a tape drive on
another server. So that added to my difficulties a little. I had to go
through:

  reboot/shutdown/init and then "stop-a" to get to ok prompt

  insert CD 1 of 2 of Solaris 8 software

  ifconfig hme0 129.117.162.215 netmask 255.255.255.0 broadcast
129.117.162.254 up

  ping 129.117.162.133

Since I'm booted from CD, I don't have my user accounts and profiles, so
I have to get the machine at 133 to let me in as root. I have rshd on
that machine already open and covered by tcp_wrappers to allow in only
my server that have no tape drives. Now I had to add a /.rhosts file for
root. It had to have DNS names for reverse lookup. When I started by
trying just the IP address, it didn't work.

  129.117.162.215 +
  sunrayserver +
  sunrayserver.mydomain.edu +

Then I'm all set to do my recovery from the other server's tape drive.

  newfs /dev/rdsk/c0t0d0s0

  mount /dev/dsk/c0t0d0s0 /a

  cd /a

  ufsrestore rvf 129.117.162.133:/dev/rmt/0n

  ls

  cd ..

  umount /a

repeat above for each partitian required and on tape in sequence. Then do a:

  installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk \
    /dev/rdsk/c0t0d0s0

The documentation said to do "pboot", but I found that in that directory
the only file was "bootblk". When I rebooted, it worked. I did a "uname
-i" to see what it returned (SUNW,Ultra-4) and looked down through the
directories.

--------------- Changes to my procedures ---------------

I use a script to generate an informational file that I call a label and
then write it out as the first item on the tape when I do backups. Thus,
when I pick up a tape, I can pull off that first file with an
interactive ufsrestore and see what I put on the tape and what the
machine it came from was like.

My label looks like:

<Label>
Amen-ra-02Dec2003-t1
Tue Dec 2 09:36:17 EST 2003
Library Information Systems & Technology Services
W.E.B. Du Bois Library
University of Massachusetts
(413) 545-0074

------------

Filesystem kbytes used avail capacity Mounted on
/dev/dsk/c0t0d0s0 15344171 539601 14651129 4% /
/proc 0 0 0 0% /proc
fd 0 0 0 0% /dev/fd
mnttab 0 0 0 0% /etc/mnttab
/dev/dsk/c0t2d0s3 15346527 1063952 14129110 8% /var
swap 4511248 16 4511232 1% /var/run
swap 4614696 103464 4511232 3% /tmp
/dev/dsk/c0t0d0s5 1018191 230495 726605 25% /opt
/dev/dsk/c0t3d0s7 11214644 4813822 6288676 44% /usr/local
/dev/dsk/c0t3d0s6 4131866 1719578 2370970 43% /export/home
/dev/dsk/c0t0d0s1 1018191 247358 709742 26% /usr/openwin
/proc 0 0 0 0%
/var/opt/SUNWbb/root/proclocalhost:(cifsBrowse)browser 10 10
      0 100% /CIFS
/tmp/SUNWut/sessions 4614696 103464 4511232 3%
/var/opt/SUNWbb/root/tmp/SUNWut/sessions
/tmp/SUNWut/units 4614696 103464 4511232 3%
/var/opt/SUNWbb/root/tmp/SUNWut/units
</label>

I thought that would be more or less totally adequate. However, it was
not as easy as it should have been to get the information I needed. I am
changing my script to make this label more informative by including
prtvtoc for each of the drives included in the backup, and a "cat" of
/etc/vfstab and the backup script, as well as the "df -k" that I have
been putting there. That will give me all the information I need to
replace and repartition a failed drive as well as recovering from hacks
or software failures when I have intact partitions to recover to.

---------------

Chris Hoogendyk

-
    O__ ---- Network Specialist & Unix Systems Administrator
   c/ /'_ --- Library Information Systems & Technology Services
  (*) \(*) -- W.E.B. Du Bois Library
~~~~~~~~~~ - University of Massachusetts, Amherst

<choogend@library.umass.edu>

---------------

-------- Original Message --------
Subject: SunRay server failure
Date: Mon, 01 Mar 2004 22:39:38 -0500
From: Chris Hoogendyk <choogend@library.umass.edu>
To: Sun Managers <sunmanagers@sunmanagers.org>

E450, Solaris 8, SunRay Server Software 1.3, 20 SunRay1's in Restricted
Access Mode.

Last Friday I did the latest Recommended and Security patches. Last done
a little over a month ago. This morning I rebooted the server around 7am.

Mid afternoon today, my SunRays started failing. First two were hung
waiting for DHCP. I tested by recycling my SunRay (logging out) before
going down to look. It started a new session just fine.

I used a fluke to test the connection for the failed SunRays and could
not get a DHCP. I went straight to the switch port with the fluke to
eliminate any intervening wiring questions. No DHCP. I thought perhaps
it was a problem with the switch vlans (CISCO).

This evening I got a call that all the SunRays were failing.

I connected to the server. Looking at the messages file from the SunRay
web admin interface, I found the following error sequence repeated over
and over for one SunRay or another:

Mar 1 17:17:22 amen-ra-01 utauthd: [ID 639584 user.info] Worker2
NOTICE: whichServer pseudo.080020c0c454:
Mar 1 17:17:22 amen-ra-01 utauthd: [ID 641787 user.info] Worker2
NOTICE: CLAIMED by StartSession.m3 NAME: pseudo.080020c0c454 PARAMETERS:
{_=1, rawId=080020c0c454, terminalIPA=192.168.128.61, startRes=1152x900,
state=disconnected, initState=0,
fw=1.3_12.c_111891-05,REV=2002.05.10.11.53,Boot:1.3;
1999.11.29-09:58:55-GMT, pn=34583, rawType=pseudo, sn=080020c0c454,
tokenSeq=1, event=insert, id=080020c0c454, cause=insert, hw=SunRayP1,
type=pseudo, namespace=IEEE802}
Mar 1 17:17:22 amen-ra-01 utauthd: [ID 388005 user.info] Worker2
NOTICE: CONNECT IEEE802.080020c0c454, pseudo.080020c0c454, all
connections allowed
Mar 1 17:17:22 amen-ra-01 utauthd: [ID 475121 user.info] Worker2
NOTICE: SESSION_OK pseudo.080020c0c454
Mar 1 17:52:49 amen-ra-01 utauthd: [ID 794400 user.info]
SessionManager0 NOTICE: EMPTY: ACTIVE session
Mar 1 17:52:49 amen-ra-01 utauthd: [ID 716730 user.info] Terminator
NOTICE: DISCONNECT IEEE802.080020c0c454, pseudo.080020c0c454 session
terminated
Mar 1 17:52:49 amen-ra-01 utauthd: [ID 190098 user.info] Terminator
NOTICE: DESTROY pseudo.080020c0c454 lifetime=2127277
Mar 1 17:52:49 amen-ra-01 utauthd: [ID 927710 user.info]
SessionManager0 NOTICE: TERMINATE: inactive session

followed by this:

Mar 1 18:55:25 amen-ra-01 utauthd: [ID 699394 user.info] Worker3
NOTICE: SESSION_OK pseudo.080020c0c454
Mar 1 19:26:34 [192.168.128.176.2.2] 0x0.0x42e1b9 8:0:20:f9:68:97
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:40 [192.168.128.174.2.2] 0x0.0x42ee63 8:0:20:c0:c5:ea
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:40 [192.168.128.61.2.2] 0x0.0x42ee60 8:0:20:c0:c4:54
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:40 [192.168.128.160.2.2] 0x0.0x42ee2e 8:0:20:b9:66:d7
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:40 [192.168.128.56.2.2] 0x0.0x42ee42 8:0:20:c1:c:44
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:41 [192.168.128.62.2.2] 0x0.0x42ee68 8:0:20:e7:b5:8c
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:42 [192.168.128.171.2.2] 0x0.0x42ef11 8:0:20:c0:bd:f2
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:43 [192.168.128.175.2.2] 0x0.0x42efda 8:0:20:f2:47:7a
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:45 [192.168.128.177.2.2] 0x0.0x42ee4b 8:0:20:f5:76:76
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:45 [192.168.128.177.2.2] 0x0.0x42ee4b 8:0:20:f5:76:76
Kernel: 0x665830-0x6658b7: 0x4a1c8 backtrace_me+0x24(...)
Mar 1 19:26:45 [192.168.128.177.2.2] 0x0.0x42ee4b 8:0:20:f5:76:76
Kernel: 0x6658b8-0x66592f: 0x493fc panic+0x4c(...)
Mar 1 19:26:45 [192.168.128.177.2.2] 0x0.0x42ee4b 8:0:20:f5:76:76
Kernel: 0x665930-0x6659ef: 0x55334 AutoRenewDHCP+0x18c(...)
Mar 1 19:26:45 [192.168.128.177.2.2] 0x0.0x42ee4b 8:0:20:f5:76:76
Kernel: Top: 0x44cc4 proc_spawn_pid+0x3cc(...)
Mar 1 19:26:46 [192.168.128.179.2.2] 0x0.0x42f0de 8:0:20:f0:fd:60
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:46 [192.168.128.68.2.2] 0x0.0x42ee7d 8:0:20:f5:73:4
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:47 [192.168.128.33.2.2] 0x0.0x42ee82 8:0:20:f9:69:aa
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:50 [192.168.128.162.2.2] 0x0.0x42f238 8:0:20:b6:1:69
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:26:57 [192.168.128.69.2.2] 0x0.0x42f253 0:3:ba:d:99:f4
Kernel: panic: AutoRenewDHCP: IPA lease expired -- must restart
Mar 1 19:30:16 amen-ra-01 utauthd: [ID 607465 user.info] Worker3
UNEXPECTED: Terminal.readMesages: java.net.SocketException: Connection
reset by peer
Mar 1 19:30:16 amen-ra-01 utauthd: [ID 181342 user.info] Worker3
NOTICE: DISCONNECT IEEE802.080020f96897, pseudo.080020f96897 destroy
Mar 1 19:30:16 amen-ra-01 utauthd: [ID 791169 user.info] Worker3
UNEXPECTED: during send to: java.net.SocketOutputStream@1bca4f
error=java.io.IOException: Broken pipe
Mar 1 19:30:16 amen-ra-01 utauthd: [ID 151315 user.info] Worker3
NOTICE: DESTROY pseudo.080020f96897 lifetime=43693338
Mar 1 19:30:24 amen-ra-01 utauthd: [ID 607465 user.info] Worker3
UNEXPECTED: Terminal.readMesages: java.net.SocketException: Connection
reset by peer
Mar 1 19:30:24 amen-ra-01 utauthd: [ID 667050 user.info] Worker3
NOTICE: DISCONNECT IEEE802.080020c0c454, pseudo.080020c0c454 destroy
Mar 1 19:30:24 amen-ra-01 utauthd: [ID 118975 user.info] Worker3
UNEXPECTED: during send to: java.net.SocketOutputStream@11bee50
error=java.io.IOException: Broken pipe
Mar 1 19:30:24 amen-ra-01 utauthd: [ID 669981 user.info] Worker3
NOTICE: DESTROY pseudo.080020c0c454 lifetime=2099658
Mar 1 19:30:28 amen-ra-01 utauthd: [ID 607465 user.info] Worker3
UNEXPECTED: Terminal.readMesages: java.net.SocketException: Connection
reset by peer

Rebooting the server accompolished nothing.

 From the web admin interface for the SunRay Server Software, restarting
the service gave the following in the messages file:

Mar 1 22:11:07 amen-ra-01 UTPOLICY: [ID 702911 user.info] Restarting
SunRay services
Mar 1 22:11:07 amen-ra-01 UTPOLICY: [ID 702911 user.info] stopping
authentication manager
Mar 1 22:11:07 amen-ra-01 UTPOLICY: [ID 702911 user.info] starting
session manager
Mar 1 22:11:07 amen-ra-01 UTPOLICY: [ID 702911 user.info] starting
device manager
Mar 1 22:11:07 amen-ra-01 UTPOLICY: [ID 702911 user.info] starting
printer service
Mar 1 22:11:07 amen-ra-01 UTPOLICY: [ID 702911 user.info] starting
serial service
Mar 1 22:11:07 amen-ra-01 UTPOLICY: [ID 702911 user.info] # Using local
policy
Mar 1 22:11:07 amen-ra-01 UTPOLICY: [ID 702911 user.info] starting
authentication manager
Mar 1 22:11:07 amen-ra-01 utauthd: [ID 396523 user.info] main NOTICE:
SmartCardConfigData: LDAP contains no smartcard configuration files
Mar 1 22:11:07 amen-ra-01 utauthd: [ID 253120 user.info] main NOTICE:
SmartCardConfigData: read 17 smartcard configuration files from
directory file: /etc/opt/SUNWut/smartcard/probe_order.conf
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 353254 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/Payflex-All.cfg: 237
tokens processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 762100 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/MondexMM2.cfg: 89 tokens
processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 192469 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/JavaBadge.cfg: 144 tokens
processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 582636 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/OpenPlatform.cfg: 144
tokens processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 462772 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/CyberflexAccess.cfg: 104
tokens processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 240283 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/ActivCardGold.cfg: 100
tokens processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 783214 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/GEMPLUS-MPCOS.cfg: 145
tokens processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 522253 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/GEMPLUS-MPCOS-3DES.cfg:
124 tokens processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 658487 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/GEMPLUS-GPK4000.cfg: 138
tokens processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 293412 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/PKCS15.cfg: 106 tokens
processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 313178 user.info] main NOTICE:
SmartCardConfigData:
/etc/opt/SUNWut/smartcard/SpanishUniversity-TIBC.cfg: 98 tokens processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 486291 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/GD-SMARTCAFE.cfg: 74
tokens processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 858185 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/GD-STARCOS.cfg: 74 tokens
processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 884807 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/BullTB.cfg: 114 tokens
processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 163784 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/MondexUNU.cfg: 67 tokens
processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 524863 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/Cryptoflex.cfg: 144
tokens processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 651628 user.info] main NOTICE:
SmartCardConfigData: /etc/opt/SUNWut/smartcard/UnknownCard.cfg: 63
tokens processed
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 723974 user.info] main NOTICE:
Loaded module /opt/SUNWut/lib/modules/StartSession.m0
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 612231 user.info] main NOTICE:
Loaded module /opt/SUNWut/lib/modules/Authxlation.m1
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 709793 user.info] main NOTICE:
Loaded module /opt/SUNWut/lib/modules/ServerSelect.m2
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 723977 user.info] main NOTICE:
Loaded module /opt/SUNWut/lib/modules/StartSession.m3
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 723978 user.info] main NOTICE:
Loaded module /opt/SUNWut/lib/modules/StartSession.m4
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 745985 user.info] main NOTICE: 5
authentication modules loaded.
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 826448 user.info] deviceManager0
NOTICE: DeviceManager.getDeviceManager: Initiate callback to utdevMgrd
at localhost:7011
Mar 1 22:11:08 amen-ra-01 utauthd: [ID 914482 user.info] deviceManager0
NOTICE: DeviceManager.initiateCallback localhost:7010 established
communication
Mar 1 22:11:29 amen-ra-01 policy[1484]: [ID 702911 user.info] TIMEOUT!!!
Mar 1 22:11:33 amen-ra-01 admincgi[5534]: [ID 702911 user.info]
Mar 1 22:11:33 amen-ra-01 admincgi[5534]: [ID 702911 user.info]
amen-ra-01: Restarting servers... messages will be logged to
/var/opt/SUNWut/log/messages.
Mar 1 22:11:33 amen-ra-01 admincgi[5534]: [ID 702911 user.info]
amen-ra-01: ERROR: Service reset failed. Host unreachable.
Mar 1 22:11:33 amen-ra-01 admincgi[5534]: [ID 702911 user.info]

I'm really at a loss and I have some critical service points down.

Any help would be greatly appreciated.

My server having paniced, myself having paniced, I'm now going to crash.
I'll look at this again and any replies at 7am EST.

TIA

---------------

Chris Hoogendyk

-
    O__ ---- Network Specialist & Unix Systems Administrator
   c/ /'_ --- Library Information Systems & Technology Services
  (*) \(*) -- W.E.B. Du Bois Library
~~~~~~~~~~ - University of Massachusetts, Amherst

<choogend@library.umass.edu>

---------------
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:28:13 EDT