V880 ge0 timeout/breaks connections

From: Grzegorz Bakalarski (G.Bakalarski@icm.edu.pl)
Date: Fri Sep 15 2006 - 09:25:51 EDT


Dear Guru's

I know, what you are thinking: "Again this guy? It's becoming boring ..."

Well, the problem is:

I have V880 6x900MHz Ultra Sparc III, 12GB, Solaris 9 , patched
with latest cluster patch and SAN Foundation Kit patches ...
The machine is 4 years old (not on contract).
The machine has two ethernet interfaces (on-board):
100Mbit eri0 BaseT and 1000Mbit ge0 BaseSX (optical).
For years I did not have access to a proper switch in order to
connect this optical interface (I used old legacy unsupported 100MBit
BaseT rh0 realtek PCI card which is out of order now instead).
Now my institute installed new HP 5300cl gigabit switch and I've got
also optical tranceiver. So I connected SC port to the switch
with proper cable, configured and it worked for few months.
>From time to time I observered some problem with transfering
files between my servers but they were temporary...
Now my transfering scripts stopped working ...
After investigations I discover that transfer between master server
(this V880) and slave servers (other V440 - ce gbit) hangs around 835MBytes
(it's happens rarely I need to tranfer so bid chunk of data in one tranfer) ...
The size (at which transfer brakes) is alomost irreleveant on which way of
transfer I choose:
rcp
tar -cf- |rsh "tar -xf"
ftp
... (only scp does the trick but as I learned scp can resend data automaticly)
(but scp is very slow and consumes much CPU).
Also reboot does not change anything (it stops just around 876554220 bytes
(+- 86000 bytes depending on transfer method)). This behaviour is fully repeatable ...
Even more strangely: this happens only when I "put" data on
slave servers. If I "get" them from slaves I can transfer
everything with speed up to 430Mbits/s ... No errors.
I googled around and searched on SunSolve and even found similar
cases for Solaris 2.6 in 2001-2002 and for SOlaris 8 in 2004 ... for exactly
this driver (gem). And I found a latest patch for the driver (dated Jul 1st, 2005)
117119-05 and installed it and rebooted machine. But no better.

Searching more on sunsolve I found today other workaroud (for V490 server
failing during heavy sync-TTCP test suite):

"Edit the /etc/system file with following lines
set ge:ge_put_cfg=0
set ge:ge_nos_tmds=8192
and reboot".

But will it help ??? I can't reboot machine every night ...
Why this was not stated in 117119-05.README ???
Is it common, well known problem with V880's ge driver ???
What other setting for ge in /etc/system should I consider ???

uname -a
SunOS 5.9 Generic_118558-30 sun4u sparc SUNW,Sun-Fire-880
modinfo|grep GEM
101 7829c000 185ae 108 1 ge (GEM Ethernet Driver (B) v2.50 )
showrev -p|grep 117119
Patch: 117119-05 Obsoletes: Requires: 112233-12 Incompatibles: Packages: SUNWged, SUNWgedu, SUNWgedx

Please help!

GB
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:40:48 EDT