NFS issue

From: Michael DeSimone (michael@desimone.net)
Date: Mon Sep 24 2007 - 20:48:53 EDT


Hello Gurus!

Background
We have a group Solaris 8 servers that share a netapp filer for shared
storage. These servers are the development servers for a bunch batch
processes. Since they are just development the don't use cron, or any other
scheduler to run their jobs, they rely on using a trigger file' placed in a
specific directory by a 'master' server. The 'slaves' have a process (just a
shell script) that runs and checks for the existence of that file, if it is
there they fire off the processes that process whatever data it is that they
process.

Problem
The issue that we have run into is that, since we patched the servers with
the latest jumbo patch, the job no longer picks up the trigger file when it
is written to the directory. We noticed that if you log into a slave, cd to
the directory and do an ls the process will find the file. Of course this
defeats the purpose of the whole process. So we placed ( cd $DIR; ls >
/dev/null 2>&1) that into the script and it provides a valid workaround.

Impact
This is fairly high priority for us because we are behind in our patch
schedule and, since we use a lot of NFS in our environment we need to make
sure we know exactly what caused the problem to we do not propagate it to
our higher environments.

Troubleshooting to date
My theory on why the problem started is related to patch 116959-17(*1). This
patch is included in the bundle we downloaded (Generic_117350-47 is our
kernel revision downloaded in late August). The patch notes for that patch
indicate that in revision -14 a bug (6342430)was fixed that "NFS client
doesn't notice the change of the file in the server because of fix for
bug#4407669."(*2)
A showrev -p shows that we had -13 of that patch before the bundle was
applied.
Patch: 116959-09 Obsoletes: 108727-26 Requires: 108528-29 Incompatibles:
Packages: SUNWcarx, SUNWcsr, SUNWhea
Patch: 116959-13 Obsoletes: 108727-26 Requires: 108528-29 Incompatibles:
Packages: SUNWcarx, SUNWcsr, SUNWhea
Patch: 116959-17 Obsoletes: 108727-26 Requires: 108528-29 Incompatibles:
Packages: SUNWcarx, SUNWcsr, SUNWhea

We tried to disable the follow NFS kernel parameters and the problem still
persists: nfs_disable_rddir_cache, nfs_lookup_neg_cache,
nfs3_lookup_neg_cache.

We ended up backing out -17 and our process was able to see the trigger file
as intended. We reapplied -13 and are still fine.

We also have a NetApp guy onsite and have engaged him to look into the issue
because 1. Just in case it is specific to netapp and 2. The theory that
NetApp has better access into the specific engineers at Sun to work the
issue. So far have not gotten anything back from them.

I have highly recommended opening our own ticket with Sun. Which should
happen tomorrow.

My question to my fellow Sun Managers

Has anyone else run into this problem? Has anyone been in contact with Sun
about it? Does anyone have the the time to do a little peer review and see
if I am correct?

I can't check my Sun Managers mail from the office so if you do reply please
reply to mdesimone *at* gmail <dot> com.

Thank you and you can expect a Summary.
 
-michael
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:42:21 EDT