RIS client boot problem

From: LHERCAUD@bouyguestelecom.fr
Date: Tue Jun 18 2002 - 12:53:27 EDT


Tru64 Managers,

I want to submit here a problem I found when trying to install Tru64 version
5.1A thru RIS.
Actually, we found 2 problems both related to the reset of the ethernet
controller in conjuction with the topology of the Catalysts.

Symptoms: the RIS kernel booted onto the clients fails to access the root
file system (from the server) thru the network. Both the RIS server and the
clients are on the same subnet.

It appears obvious to me that the reason for that is the RESET of the
ethernet controller done inside "vmunix1" provoques the rehash of the
Catalyst spaning tree for a period of time which is longer than the one hard
coded within the retries attempted by the kernel. The result is the lost of
ability of the booted kernel to timely access the network resulting to
systematic hung or crash thus preventing the installation to continue.
Inserting a HUB between the Catalyst port and the RIS client allows the
reset to be "hidden" thus the installation succeeds. Of course, the
additionnal HUB is not an acceptable solution and was put in place only for
testing and proving reasons.

I do not have access to source code but I managed to find 2 subroutines
that access the ethernet using bootp, have some retry code but not
sufficient to allow enough time for the Catalyst to "calm" down:
======================== boot_info()
/* OLD: retrys=5 */
/* while (retrys--) { */

  [bootp_info:3464, 0xfffffc000067d24c] bis zero, 0x4, s4 /* old
*/
  [bootp_info:3467, 0xfffffc000067d250] bis zero, 0x5, s6 /* old
*/
/* !!! NEW: retrys=31 !!! */
  [bootp_info:3464, 0xfffffc000067d24c] bis zero, 0x1e, s4 /*
new */
  [bootp_info:3467, 0xfffffc000067d250] bis zero, 0x1f, s6 /*
new */
======================== netboot_bootp_open()
/* OLD tries 2 times */
  [netboot_bootp_open:467, 0xfffffc000023ab1c] bis zero, 0x2, s5 /* old
*/
/* NEW tries 30 times */
  [netboot_bootp_open:467, 0xfffffc000023ab1c] bis zero, 0x1e, s5 /*
new */
========================
Given that, I patched the binary Alpha code and augmented the retries which
allowed for the RIS clients to boot with no problem (see above the
assembler lines for the original and the patched code).
Included in this mail, I attach the Patch.ksh script and the "zapb" program
I used to modify the above instructions in the RIS' binary kernel.
 <<Patch.ksh>> <<zapb.c>>
One more thing: I detected the same problem (related to the insufficient
number of retries in bootp_info()) in the RIS kernel coming with Tru64 4.0F
, I developped a similar fix/patch for that kernel and submitted the fix to
the Compaq local support organisation. Unfortunately, it seems that the
fix/patch I proposed did not make it thru to the engineering.

Then, you may ask why am I still sending this mail ?
Here are my 2 reasons:
1. I really think this workaround may help some people on this list fix
their RIS problems.
2. I hope that someone in HP/Compaq Tru64 engineering (maybe Dr. Blinn) has
a look to the fix/patch I propose then to the actual "C" code and decides to
endorse the 2 minor changes regarding the retries count that allow the RIS
kernel to boot despite the Catalyst spaning tree effect. This way I can hope
to get a "supported" solution and obtain the satisfaction of my customer.

Best regards to all of you.

Lucien HERCAUD
Freelance / UNIX Consulting
former DEC OSF/1 Digital Support & UNIX Partner
* Tel. Fixe +33 1 3945 4260 * BOUYGUES Telecom
* Tel. Mobile +33 6 0944 2880 * 24 Avenue de l'Europe
* Fax +33 1 3945 4322 * 78944 VELIZY Cedex
* lhercaud@bouyguestelecom.fr
mailto:lhercaud@mail.dotcom.fr






This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:44 EDT