[SUMMARY] emx Error

From: Regis Carlier (regis.carlier@apx.fr)
Date: Thu Jul 17 2003 - 08:58:58 EDT


thanks to all who replied ,

the Pb reappeared so the faulty card was changed .

Reg/

-----

As with any complex hardware, it is possible to have a transient or
intermittent problem be detected and reported and then corrected in
some way. You have to assume the error was real, but it may have
been a one time event. The "emx" driver makes heroic efforts to
keep the hardware working, and it probably reset the adapter to a
good state (it's sometimes possible to do this, sometimes not, it
all depends on the nature of the fault) and kept going. But keep
an eye on the logs and if you get another instance, you probably
should get the hardware repaired or replaced; if you have a support
contract, file a formal support call.

Tom

   Dr. Thomas P. Blinn + Tru64 UNIX Software + Hewlett-Packard Company
 Internet: tpb@zk3.dec.com, thomas.blinn@compaq.com, thomas.blinn@hp.com
  110 Spit Brook Road, MS ZKO3-2/W17 Nashua, New Hampshire 03062-2698
   Alpha Hardware Platforms and I/O Phone: (603) 884-0646
     ACM Member: tpblinn@acm.org PC@Home: tom@felines.mv.net

  Worry kills more people than work because more people worry than work.

      Keep your stick on the ice. -- Steve Smith ("Red Green")

     My favorite palindrome is: Satan, oscillate my metallic sonatas.
                                -- Phil Agre, pagre@alpha.oac.ucla.edu

     Yesterday it worked / Today it is not working / UNIX is like that
                        -- apologies to Margaret Segall

  Opinions expressed herein are my own, and do not necessarily represent
  those of my employer or anyone else, living or dead, real or imagined.

-----
EMX Adapter Hardware Errors

If the adapter reported a h/w error prior to version V2.00 of
the emx driver, the system would panic.

Starting with the V2.00 version of the emx driver new reset adapter
functionality was added. Instead of panicing, the adapter is reset in an
attempt to recover the adapter since most reported h/w errors are transient
events. If the adapter hangs in reset or fails to
complete the reset correctly, the adapter is marked dead and removed from
the running configuration. The adapter will return to use at the next boot.

So with new versions of the driver, hardware parity errors just become a
nuisance as it will force io retries after the adapter is reset. Cases
have been seen where the adapter can become wedged and hangs causing a
system hang. If there is more than a few parity or other errors
reported over a few days to few weeks typically indicates a board going
bad.

The h/w errors which will cause an adapter reset include:

HW ERR:EBUS Parity Error
HW ERR:BBUS Parity Error
HW ERR:Host Bus(PCI) Error
HW ERR:Sequence Manager Fatal Error
HW ERR:BIU Fatal Error
HW ERR:ENDEC Fatal Error
HW ERR:Context SRAM Fatal Error
HW ERR:Buffer SRAM Fatal Error

The most typical error is BBUS Parity Errors.

BBUS Parity Error
The adapter sets this error to indicate an internal parity error on the
internal BBus.

Robert Mclean
HPTC Support
HP Services Americas

Office 352-726-9087
Pager 352-268-0030
E-mail robert.mclean2@hp.com <mailto:robert.mclean2@hp.com>

-----

        Call your service vendor and see if a single parity error
        on the particular model HBA is grounds to have it replaced.
        Systems, Storage subsystems and even I/O adapters are
        designed to tolerate certain types of errors. If an
        error is detected there are often recovery procedures
        that the hardware uses to allow it to continue running.
        A small number of correctable errors during the lifetime
        of a device are to be expected.

        If the problem is not correctable or the frequency of
        correctable errors is too high, then the part should
        be replaced.

        Particular to this problem, if the system didn't crash,
        the domain didn't panic or the data in transit at the
        time wasn't corrupted, then the hardware and software
        error recovery worked as expected. If errors such as
        this continue, then there is a risk that a more serious
        non-correctable problem will eventually occur. If this
        error is uncommon, then you can probably trust that the
        error recovery will handle it, if it happens again.

        The system error log may have more information about the
        error. DECevent may still work on V5.1B, but uerf(8)
        isn't likely to. Compaq Analyze may have bit-to-text
        translation for the error, if something made it into the
        binary error log.

---
ESC:wq
--
Régis Carlier, APX Computer, 31 rue Denis Papin,
Parc Club des Prés, 59650 Villeneuve d'Ascq
Tel: +33320190018 , Fax: +33320190010 ,
Gsm: +33686943971 , Mail: Regis.Carlier@apx.fr


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:27 EDT