SUMMARY: More tape woes!

From: Andrew Raine (Andrew.Raine@mrc-dunn.cam.ac.uk)
Date: Wed Nov 13 2002 - 05:27:44 EST


> Dear Managers,

(Original question at the end)

Many thanks to Dr Tom, Jim Fitzmaurice and Selden Ball for their
advice. Jim said reboot, Tom said try to delete and re-create the
device files using hwmgr and dsfmgr, and both Tom and Selden suggested
that the HSG might be causing the problem. To quote Tom:

    "I have no idea why the device name got changed, but it sure looks like
    something out on the bus was very confused, and that apparently
    confused the system software, or it may have been a software bug plain
    and simple, there are known to be some interesting problems in some of
    the error recovery paths and your HSG80 might be making them surface."

Anyway, after messing around with hwmgr dsfmgr to try to remove the
rogue device files and recreate them, I resorted to just deleting them
and rebooting. Problem solved, for the moment.

Thanks again,

Andrew

--
Dr. Andrew Raine, Head of IT, MRC Dunn Human Nutrition Unit, 
Wellcome Trust/MRC Building, Hills Road, Cambridge, CB2 2XY, UK
phone: +44 (0)1223 252830   fax: +44 (0)1223 252835
web: www.mrc-dunn.cam.ac.uk email: Andrew.Raine@mrc-dunn.cam.ac.uk
> Dear Managers,
> 
> Thanks to all the help I received here, I fixed the backup/tape/changer
> troubles that I was having.  However .......
> 
> I have a 2-node cluster, running 5.1PK3, with a Neo Overland SDLT
> drive/changer attached to one node (ES40, 4 procs).
> 
> Everything was fine, and my backups were running well enough for me to
> feel confident to replace my HSZ80 with an HSG80.  Things carried on
> working until half-way Sunday morning's backup, when the device file
> for my SDLT drive vanished between vdump savesets!
> 
> I was using /dev/ntape/tape2_d1, but now:
> 
> # mt -f /dev/ntape/tape2_d1 status
> /dev/ntape/tape2_d1: No such device or address
> 
> Also, if I do:
> 
> # hwmgr -v d
>  HWID: Device Name          Mfg      Model            Location
>  ------------------------------------------------------------------------------
> <lines deleted>
>    74: /dev/changer/mc1              TL820            bus-2-targ-6-lun-0
>    75: /dev/disk/dsk30c     QUANTUM  SuperDLT1        bus-2-targ-1-lun-0
>             ^^^^^^^^^^^
> i.e the system thinks that the SDLT drive is now a *disk*!
> 
> However, the robot utility and xrobot can both see the drive and
> sucessfully load and unload tapes.
> 
> Can anyone explain what might have happened?  Is there a way of
> re-building the appropriate device files without bringing the system
> down?  (Actually, I don't know how to do this even if I do bring the
> system down - could someone point me in the right direction?)
> 
> Further diagnostics are appended.....
> 
> Regards,
> 
> Andrew
> 
> --
> Dr. Andrew Raine, Head of IT, MRC Dunn Human Nutrition Unit, 
> Wellcome Trust/MRC Building, Hills Road, Cambridge, CB2 2XY, UK
> phone: +44 (0)1223 252830   fax: +44 (0)1223 252835
> web: www.mrc-dunn.cam.ac.uk email: Andrew.Raine@mrc-dunn.cam.ac.uk
> 
> Further information:
> 
> The system logged:
> 
>     Sequence number of error: 148505507
>     Time of error entry: 10-Nov-2002 01:04:37
>     Host name: beta
>     
>     SCSI CAM ERROR PACKET
>     Controller type: DISK
>     SCSI device class: DEC SIM
>     Bus Number: 2   
>     Target number: 1
>     Lun Number: 3
>     
>     Name of routine that logged the event: ss_perform_timeout
>     Event information: timeout on disconnected request
>                     
>                     ############### Entry End ###############
>     
>     Event information: Active CCB at time of error
>                     
>                     ############### Entry End ###############
> 
> Bus 2 only has the SDLT Drive and the changer on it.  Nothing on bus 2
> has (Target,Lun) = (1,3) though.  Not even the two DLT drives and their
> TL891 changer that I removed at the time of the HSG upgrade.
> 
> There also seemed to be some funny NFS stuff going on at the same time.
> From /var/adm/syslog.dated/09-Nov-14:46/kern.log:
> 
> Nov 10 00:40:24 beta vmunix: NFS3 RFS3_WRITE failed for server ftp-bioinf.mrc-dunn.cam.ac.uk: RPC: Timed out
> Nov 10 00:40:24 beta vmunix: NFS3 RFS3_WRITE failed for server ftp-bioinf.mrc-dunn.cam.ac.uk: RPC: Timed out
> Nov 10 00:40:24 beta vmunix: NFS3 write error 60 on host ftp-bioinf.mrc-dunn.cam.ac.uk
> Nov 10 00:40:24 beta vmunix: NFS3 write error 60 on host ftp-bioinf.mrc-dunn.cam.ac.uk
> <repeated about 2,500 times>


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:59 EDT