NIS+ intermittent server timeout

From: colin@ccsisupport.com
Date: Fri Feb 03 2006 - 19:22:17 EST


Hi all;

We've got an ugly NIS+ problem, and don't know where to even look for a
solution. This exact issue occurred about six months ago, and Sun was less
than useless, telling us to upgrade. Eventually, the problem just
disappeared,
until now.

The short story is that our NIS+ master (E3000/Solaris 2.6) and slave
(Ultra60/Solaris 2.6) occasionally stop serving requests for about 30
seconds at a time. Oh yes--NIS+ is running in compatability mode, and
our environment is mostly Solaris (2.6--> 10), with a bit of HPUX and
some NetApps.

Irregularly, the number of context switches on the NIS+ slave goes from
~200 to >50000! (as measured with vmstat). At this point the server becomes
unresponsive, and the NIS+ clients hang until they timeout and switch to the
other server. At that point, the same problem will happen on that server,
and will repeat back and forth a few times, until it suddenly cleans itself
up and starts serving data again. While this is happening, the CPU that
rpc.nisd is running on gets pegged.

A truss of the rpc.nisd process during shows endless errors:
write(258, "\0\0\0\v a u t o _ d i r".., 240) Err#11 EAGAIN
(the actual NIS+ map being accessed varies from event to event).

pstack of rpc.nisd shows the following. (LWP#1 is the big change.)

125: /usr/sbin/rpc.nisd -Y -B
lwp#1 ----------
 ef539fd8 _libc_write (102, 0, f0, 0, 1, 38d328) + c
 ef6cb354 write_vc (3901e8, 4174e0, 2328, 2328, f0, 417541) + 68
 ef6a28ec flush_out (350090, 2328, 0, ef707630, 63, 4175c0) + 40
 ef6a27b0 xdrrec_putbytes (1, 38b4b8, 3, 350090, 3, ff00) + 68
 ef6940e0 xdr_opaque (39298c, 38b4a8, 13, ef707630, 1, ff00) + b4
 ef695744 xdr_string (39298c, 456dfc, ffffffff, ffffffff, c, ef6a0338) + 194
 ef69bb24 xdr_nis_name (39298c, 456dfc, 4152d4, 13, 10000, 398) + c
 ef69fb5c xdr_nis_object (39298c, 456de8, 4152ac, fffffff8, 0, 293845) + 78
 ef69bd2c xdr_array (39298c, 396198, 66, ffffffff, 40, 4708c) + 124
 ef6a01e8 xdr_nis_result (39298c, 396190, ef707630, 396190, ef6c2b84,
efffb9b0) + 4c
 ef6cb7e4 svc_vc_reply (3901e8, ef713470, ef707630, 396190, 47140, 392978)
+ dc
 ef6c36a0 svc_sendreply (3901e8, 47140, 396190, 47800, 1, 65c60) + 40
 0002e15c nis_prog_svc (4fc00, 3901e8, 396190, 1aab4, efffbe8c, 47140) + 440
 ef6c3c48 _svc_prog_dispatch (2dd1c, 3, 365b68, ef707630, 0, 6b3f8) + 19c
 ef6c3a18 svc_getreq_common (3901e8, ef70b680, ef713b84, ef713aa8, 408,
365b68) + b0
 ef6c3948 svc_getreq_poll (efffde5c, 1, ef707630, 1, 0, efffc0e4) + 68
 00017c88 actual_main (6, 477dc, 477d8, 477d4, 43e39dba, 43e38ade) + 143c
 00015810 main (3, effffefc, efffff0c, 47400, 0, 0) + 8
 000157f0 _start (0, 0, 0, 0, 0, 0) + dc

When behaving normally it looks like this:
125: /usr/sbin/rpc.nisd -Y -B
lwp#1 ----------
 ef538148 poll (efffde34, 5, 1d4c0)
 00017b94 actual_main (5, 477dc, 477d8, 477d4, 43e39ee6, 43e38ade) + 1348
 00015810 main (3, effffefc, efffff0c, 47400, 0, 0) + 8
 000157f0 _start (0, 0, 0, 0, 0, 0) + dc

Sorry for the rambling post and random bits of data, but I don't know where
to go from here.

Anyone? Any suggestions?

Thanks,
Colin
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:38:50 EDT