More tape woes!

From: Andrew Raine (Andrew.Raine@mrc-dunn.cam.ac.uk)
Date: Mon Nov 11 2002 - 08:33:00 EST


Dear Managers,

Thanks to all the help I received here, I fixed the backup/tape/changer
troubles that I was having. However .......

I have a 2-node cluster, running 5.1PK3, with a Neo Overland SDLT
drive/changer attached to one node (ES40, 4 procs).

Everything was fine, and my backups were running well enough for me to
feel confident to replace my HSZ80 with an HSG80. Things carried on
working until half-way Sunday morning's backup, when the device file
for my SDLT drive vanished between vdump savesets!

I was using /dev/ntape/tape2_d1, but now:

# mt -f /dev/ntape/tape2_d1 status
/dev/ntape/tape2_d1: No such device or address

Also, if I do:

# hwmgr -v d
 HWID: Device Name Mfg Model Location
 ------------------------------------------------------------------------------
<lines deleted>
   74: /dev/changer/mc1 TL820 bus-2-targ-6-lun-0
   75: /dev/disk/dsk30c QUANTUM SuperDLT1 bus-2-targ-1-lun-0
            ^^^^^^^^^^^
i.e the system thinks that the SDLT drive is now a *disk*!

However, the robot utility and xrobot can both see the drive and
sucessfully load and unload tapes.

Can anyone explain what might have happened? Is there a way of
re-building the appropriate device files without bringing the system
down? (Actually, I don't know how to do this even if I do bring the
system down - could someone point me in the right direction?)

Further diagnostics are appended.....

Regards,

Andrew

--
Dr. Andrew Raine, Head of IT, MRC Dunn Human Nutrition Unit, 
Wellcome Trust/MRC Building, Hills Road, Cambridge, CB2 2XY, UK
phone: +44 (0)1223 252830   fax: +44 (0)1223 252835
web: www.mrc-dunn.cam.ac.uk email: Andrew.Raine@mrc-dunn.cam.ac.uk
Further information:
The system logged:
    Sequence number of error: 148505507
    Time of error entry: 10-Nov-2002 01:04:37
    Host name: beta
    
    SCSI CAM ERROR PACKET
    Controller type: DISK
    SCSI device class: DEC SIM
    Bus Number: 2   
    Target number: 1
    Lun Number: 3
    
    Name of routine that logged the event: ss_perform_timeout
    Event information: timeout on disconnected request
                    
                    ############### Entry End ###############
    
    Event information: Active CCB at time of error
                    
                    ############### Entry End ###############
Bus 2 only has the SDLT Drive and the changer on it.  Nothing on bus 2
has (Target,Lun) = (1,3) though.  Not even the two DLT drives and their
TL891 changer that I removed at the time of the HSG upgrade.
There also seemed to be some funny NFS stuff going on at the same time.
>From /var/adm/syslog.dated/09-Nov-14:46/kern.log:
Nov 10 00:40:24 beta vmunix: NFS3 RFS3_WRITE failed for server ftp-bioinf.mrc-dunn.cam.ac.uk: RPC: Timed out
Nov 10 00:40:24 beta vmunix: NFS3 RFS3_WRITE failed for server ftp-bioinf.mrc-dunn.cam.ac.uk: RPC: Timed out
Nov 10 00:40:24 beta vmunix: NFS3 write error 60 on host ftp-bioinf.mrc-dunn.cam.ac.uk
Nov 10 00:40:24 beta vmunix: NFS3 write error 60 on host ftp-bioinf.mrc-dunn.cam.ac.uk
<repeated about 2,500 times>


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:59 EDT