SUMMARY: Tape drive problem.

From: Jim Fitzmaurice (jpfitz@fnal.gov)
Date: Wed Oct 02 2002 - 11:18:46 EDT


    Sorry it took so long to get the summary out, but I just got it fixed
today.

    With no answers from the list, I was once again at the mercy of
HP/Compaq service. I knew it was a software problem and asked nicely for my
call to be sent to software, but he ignored me and sent it to hardware
anyway. Of course the field engineer agreed with my assessment and after
numerous attempts he was unable transfer the problem to software. Out of
frustration I opened another call and got it sent to software. (I didn't ask
nicely this time, it was more of a demand.) After talking with the first
person, and going through his "cookbook" solutions, some that I'd already
tried, he couldn't transfer the call to the experts. I ended up opening a
third call to finally get to speak with a UNIX expert. (Did you ever get the
feeling this "merger" wasn't going as smoothly as planned?) After numerous
attempts with "hwmgr" and "dfsmgr"with an alphabet of options, she
determined it would take a system reboot to get the drive back. The problem
was that the drives "DRIVER OWNER" was set to 4, (hwmgr -show scsi) this
would not allow us to remove and recreate the device. I informed her this
was a cluster and asked if I could just reboot the member the drive was
attached to, or if the entire cluster had to be rebooted. She consulted with
the cluster experts and they determined an entire cluster reboot would be
required. Unfortunately I was unable to get the entire cluster for a reboot
until early this morning. The reboot went well, no problems, and she was
right the reboot did fix the tape drive, it is now accessible.

    Imagine that, reboot the system to fix a problem. I needed and expert to
tell me that?

    (In case anybody's interested, the original problem is below.)

James Fitzmaurice
D0 Online Systems Manager
Fermi National Accelerator Laboratory
(630) 840-4011
jpfitz@fnal.gov

UNIX is very user friendly, It's just very particular about who it makes
friends with.

----- Original Problem -----
Datet: Thursday, September 26, 2002 9:52 AM

> Managers,
>
> I have a 3 system cluster, consisting of one GS80 (Member1) and two
> 4100's (Member2, and Member3) running Tru64 v5.1 and TruCluster v5.1
> PatchKit 5. I have one DLT4000 on each machine in the cluster and Member2
(a
> 4100) has an additional DLT8000 in a tape library which I use for backups.
> This morning I came in and my backups had completely failed, why:
>
> /dev/tape/tape3_d1: No such device or address
>
> The only unusual activity on the cluster occurred yesterday morning.
> HP/Compaq support had determined our Memory Channel Adapter was bad on
> Member3(a 4100). Backups were still running on Member2 when the Field
> Engineer arrived to replace the board. I took down Member3 and we replaced
> the board, backups continued to run normally on Member2, the cluster
> remained up. After restoring Member3, I noticed a GB Ethernet Adapter was
no
> longer working on that machine. The FE ordered a replacement and about 90
> minutes later it arrived and I brought Member3 down again, and we replaced
> the GB Ethernet Adapter. Again the cluster continued to function and
backup
> continued to run normally. Member3 came back 100% this time, and shortly
> after that backups ran to their normal conclusion on Member2.
>
> This morning however, backups failed and /dev/tape/tape3 is acting
> weird.
>
> I can run "scu> show edt" and it sees the device:
>
> 4 1 0 Sequential SCSI-2 QUANTUM DLT8000 0250 W
>
> I can run "hwmgr -view devices and it shows up there too:
>
> 493: /dev/ntape/tape3 QUANTUM DLT8000 bus-4-targ-1-lun-0
>
> And the device files exist as well:
>
> crw-rw-rw- 1 root system 19,2002 Jul 30 17:00 /dev/tape/tape3
> crw-rw-rw- 1 root system 19,2006 Jul 30 17:00 /dev/tape/tape3_d0
> crw-rw-rw- 1 root system 19,2008 Sep 26 09:28 /dev/tape/tape3_d1
> crw-rw-rw- 1 root system 19,2010 Jul 30 17:01 /dev/tape/tape3_d2
> crw-rw-rw- 1 root system 19,2012 Jul 30 17:01 /dev/tape/tape3_d3
> crw-rw-rw- 1 root system 19,2014 Jul 30 17:01 /dev/tape/tape3_d4
> crw-rw-rw- 1 root system 19,2016 Jul 30 17:01 /dev/tape/tape3_d5
> crw-rw-rw- 1 root system 19,2018 Jul 30 17:01 /dev/tape/tape3_d6
> crw-rw-rw- 1 root system 19,2020 Jul 30 17:01 /dev/tape/tape3_d7
> crw-rw-rw- 1 root system 19,2004 Jul 30 17:00 /dev/tape/tape3c
>
> However, I try to run any other command to actually access the drive
and
> it's not there:
>
> > mt -f /dev/tape/tape3 status
> /dev/tape/tape3: No such device or address
> > tar -cvf /dev/tape/tape3_d1 /etc/brutab
> tar: cannot open /dev/tape/tape3_d1: No such device or address
> > dd if=/etc/brutab of=/dev/tape/tape3_d1
> /dev/tape/tape3_d1: No such device or address
>
> All cables have been reseated, and the bus is properly terminated, I
> tried rebooting the library and the drive. Rebooting Member2 or the entire
> Cluster will not be feasible for at least a week. Nothing unusual in the
> messages log, and the last error in the binary error log was yesterday
> afternoon, and it was a disk error, wrong Target, different LUN, not this
> device.
>
> I realize that something done to one Member of a cluster can effect
the
> other members, but I wouldn't think a couple reboot's to replace bad cards
> could cause a device to behave like that. What could have happened? But
more
> importantly, does anybody know how I can fix it?
>
> Any help would be greatly appreciated.
>
> James Fitzmaurice
> D0 Online Systems Manager
> Fermi National Accelerator Laboratory
> (630) 840-4011
> jpfitz@fnal.gov
>
> UNIX is very user friendly, It's just very particular about who it makes
> friends with.
>
>



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:55 EDT