SUMMARY: Troubleshooting NTP time synchronization

From: Chris Los (clos@trentu.ca)
Date: Fri May 02 2003 - 14:49:28 EDT


Thanks to all who replied:

Here are my replies:

From: "Medaglia, Chris" <cmedagli@Lifelinesys.com>
To: 'Chris Los' <clos@trentu.ca>
Date: 4/30/03 1:47PM
Subject: RE: Troubleshooting NTP time synchronization

Do you use ntpq to check to see if they are actually synching to the time
servers?

From: Mike Iglesias <iglesias@draco.acs.uci.edu>
To: Chris Los <clos@trentu.ca>
Date: 4/30/03 2:02PM
Subject: Re: Troubleshooting NTP time synchronization

We've never seen problems with ntp on our 4.0x systems, but there are
problems with RH 7.2/7.3 with the newer kernels. RH changed the
MHZ kernel parameter from 100 to 512, and the newer kernels cannot
keep time worth a damn. A bug report was filed with RH, although
I cannot remember the bug number right now. As far as I know they
never did anything about it.

I ended up rebuilding the RH 7.2 kernel with MHZ set to 100 and that
fixed my problems.

Mike

From: "Deiss, Mark" <Mark.Deiss@acs-inc.com>
To: 'Chris Los' <clos@trentu.ca>
Date: 5/1/03 5:14AM
Subject: RE: Troubleshooting NTP time synchronization

4.0F... ack... no longer an actively supported version. Point patches are
hit-an miss and no more cluster patches from Compaq. Hope you are running at
at least kit 7. You will also want to keep an eye on patches in the 5.1 tree
as to what they are fixing. Security corrections in the supported versions
are not always made available for the older Tru64 versions - so you may have
to come up with your own corrections.

I am going to guess that you may have some servers that are treating the 4
minute discrepancy as being too long to sync up against their stratum
resource (even though Tru64's default is suppose to be 16 minutes before
xntp coughs). The ntp client daemon will not permit time syncronization if
the time differential is too large. In cases like this, you may need to
force the ntp sync by temporarily turning off your ntp/xntp daemon and
manually running something like:

/usr/bin/ntp -s -f reference_clock_hostname

Once you have the errant system forced to correct time, turn your ntp/xntp
daemon back on. Then you can run a job every N minutes off your cron
scheduler that reports to some log file, how badly things are drifting away
from your time standard

00 * * * * /usr/bin/ntp -v reference_clock_hostname >> /some_log_file

Cannot stress enough the importance of installing the Compaq point patches.
We had recent situation with a bunch of 5.1A servers running at kit 3 with
some point patches. One box was consistently running it's scheduler jobs off
by one hour. Lot of blank staring at the scheduler and a lot of "what the
....". Installed kit 4 along with point patches, and problem has gone away.

As far as overall monitoring, you may want to check http://ftp.deadcat.net
for ntp related checks. These widgets are designed to work with the Big
Brother package (www.bb4.com) which you may also want to consider. Even if
you do not want to pursue the Big Brother package, these widgets tend to
support multiple vendor platforms and would provide a template for you to
create your own variants. The widgets are predominately shell scripts and
sometimes Perl procedures. Probably the best part of the Big Brother package
is the support list community that is geared specifically to system
monitoring issues.

Here's my original question:

We're seeing time discrepencys of up to 4 minutes in a few of our Unix =
boxes and are trying to troubleshoot the cause. All our unix servers get =
their time from the same 2 local primary NTP servers. I'm wondering about =
the best way to go about troubleshooting the discrepency and also if there =
are any free tools out there that might assist in this exercise and =
ongoing monitoring of time for all our servers from a single point of =
management. We are currently running a mix of Tru64 v4.0F and RH Linux 7.x. =

TIA,
clos@trentu.ca



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:17 EDT