is it ufs problem?

From: Grzegorz Bakalarski (G.Bakalarski@icm.edu.pl)
Date: Tue Sep 12 2006 - 09:28:36 EDT


Dear All,

I have a problem which makes me hopeless ...
I have an application (from external vendor) - aka
database UI and admin utilities.
Each week I run update using admin utilities
(perl scripts) to add data to existing indexes.
The application has run excellent for weeks ...
But suddenly started to fail in updateing process.
The error message says that "run" (i.e. update) file
is not synchronized (i.e. expected - calculated
offset of pointer in a "run" file is different
than real offset of pointer in a run file returned
by ftell C function).
The very, very , very strange thing is that the error
message appears in a random way. Starting from the same
backup copy (ufsdump/ufsrestore) the update can proceed
without error or can crash ...
E.g. one time I can load data from week 21,22,23 (one data by
one) and other time I can load data from week 21, 22 and I fail
to load data from week 23 ...
I contacted to vendor support team and they claim the problem
may be in hardware disk error or ufs filesystem corruption...
While I can agree with them the point is that I tried to run updates
on different devices (SUN StorEdge 3310 SCSI array, NetApp FAS 3050 FC or
SATA arrays - two different NetApp FC connected arrays), starting
from new ufs filesystem ... ANd starting from the same backupe
copy leaded to different results - the same input data & the same
application = different results...

Now I started thinking about serious ufs bug ...
The database update process is very heavy disk based task.
Maybe not extremmaly like in big Oracle installations however
I can see tranferrates about 120-160MBytes/s which is much
for my V440 server ... So with default mount option there may be
a lot of cached/dirty data in memory ...

What is also strange that no changes to the server have been
made since May 4th ...
The only application installed during that time was QLogic SANSurfer
(we have QLogin 2342 FC HBA in this server)
During May & June all was fine.
First error appaered in last week of June (and has been
manually corrected by support) , and the second error appeared
in last week of July, and then I could not advance updates anymore
(however I could load previous data on backup copies )

Currently I'm testing the following mount options:
-o forcedirectio,nologging,noatime and I successfully
loaded 3 weekly data.
The vendor on its production servers uses VxFS (I do not have a license).
Has anyone of you heard of similar problem or seen bug report or
workaround?

My system is:

SUN Fire V440,
SUNOS 5.9 Generic_118558-26 sun4u sparc SUNW,Sun-Fire-V440
8GB memory
4x 1.062GHz Ultra Sprac IIIi processors
I've done full hardware test during start up
I've run extensive memory and processor tests from SunVTS ..
No single sign of problems

Please help,

GB
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers



This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:40:47 EDT