SEMI-SUMMARY: System running out of memory periodically, can't find "missing memory"

From: Reed, Judith (JReed@NaviSite.com)
Date: Wed Feb 26 2003 - 14:36:35 EST


Thanks to all the people who replied. I described a problem with an ES40
(8GB memory/8GB swap/4 cpus) where the system periodically ran out of
memory for periods of about 5 minutes, even though only about 4GB could
be easily accounted for. Most pointed to the UBC as the probable
location of "missing memory" and gave me good info on how to look at
things and what might be going on. I have been watching more closely,
and understand what's
happening, but still can't really pin down where the memory is being
used.

I've been using "collect -sm" and "vmstat -P" to watch usage.

"collect" shows UBC remaining nearly constant, HIT/PP/ALL always 0.
During a cycle of memory depletion:
        1. Active memory quickly rises while Free memory goes to 0,
Inactive
           stays constant. Active + Inactive + Wired + UBC = approx. 8GB
        2. After some period, Active drops, Inactive goes up high, Free
stays
           at 0. It remains this way for an extended period.
        3. Next, Inactive drops, Free briefly goes high (to amount
removed
           from Inactive), then Active quickly goes high, with memory
           apparently sucked in from Free (which goes down again)
        4. Lastly, Active drops, Inactive stays down, Free goes up,
normal
           operation resumes.
During this whole period, UBC value never rises, and "swapon -s" shows
swap util. never changes. This seems to say that UBC is not a factor.
However, watching the processes and their utilization with "top" never
shows our big
app (Java code) going above about 4GB RSS. I talked with a developer,
and he
observed that when memory goes way down the java code is working on
freeing up unneeded memory - the low memory periods coincide with this,
roughly.

I'm guessing that when Java is working on freeing up unused memory it
never actually "owns" memory it is recovering somehow? That's the only
explanation for why its "rss" never goes beyond 4GB, even though Active
plus Inactive during this period is almost 8GB, and UCB is unchanging.
Does this make sense? Regardless, the fact that it stays in these low
memory states (no Free, low Active, high Inactive, system very
unresponsive) for about 2 minutes is very problematic - why would it do
this?

Parameters are very generous: max_per_proc_address_space = nearly 10GB
                                        max_per_proc_data_size = about
4GB
                                        per_proc_data_size = about 1GB
                                        ubc_minpercent = 10 (so it could
use 800MB,
                                                        `
never exceeds 400)
                                        ubc_maxpercent = 100
                                        ubc_borrow_percent = 20
Nothing unusual elsewhere. System basically runs one large Java app
(typically around 4GB memory use) and a few small processes that consume
little memory or cpu.

Insights? Comments?

Here is a brief summary of relevant replies:
Tom Smith:
Active, inactive, wired, and free should always add up to 100% of
physical memory. Adding up RSS is not a very reliable measure. It does
not include the memory used by UBC, which is usually a substantial
fraction of total memory. That memory is accounted for in "active"
memory and certain AdvFS access structures are accounted for in "wired"
memory.

Dr. Tom Blinn:
"vmstat -P" (the "physical" memory view) will help you see just where
memory is being used... the UBC (unified buffer cache) can grow to the
point where it's competing with active processes for memory.

Jasper Frank Nemholt (lots of good UBC explanations, but his last
example is much like what I'm seeing):
The missing memory is likely allocated as UBC (Unified Buffer Cache). If
you run "/usr/sbin/collect -sm" you will see how much is UBC.
UBC is dynamic and the kernel will adjust it according to a few rules
set in the vm subsystem in sysconfigtab. Normally, the kernel will try
to cache as much as possible, or as much as it think it'll gain from.
This often leads to systems running with very little free memory, which
isn't necessarily a bad thing. If some process needs more memory, the
kernel will rather release some memory from UBC instead of swapping out.
This is probably what you see....when it runs close to 0% free and some
process needs more memory, it releases a chunk of UBC and suddenly you
have free memory again.

Normally the kernel takes care of the UBC just fine, but in some cases
you may want to force some "laws" upon the UBC behavior, this is
especially true if you run Oracle or some other database system on the
machine, and especially if that database isn't using Direct I/O or raw
devices and thus bypasses UBC. Databases normally have their own "UBC"
cache, such as Oracle's SGA/PGA and a OS based cache won't help
anything...in fact it'll often lead to worse performance. Normally, in
such situations it's a good idea to decrease ubc_maxpercent from the
default 100 to something below to avoid that UBC is getting too big.
Somewhere between 50 and 80 is fine. Don't make it too small as normal
unix I/O outside the database usually benefits from the UBC cache.

Secondly, Oracle or other databases may perform badly during backup,
creation of datafiles or massive inserts where there is massive
sequential I/O activity. To avoid having these activities are cached in
UBC (which isn't useful) you can lower the vm_ubcseqpercent to 5-10 and
vm_ubcseqstartpercent to 10-20. This way, the kernel will sort of bypass
the UBC when it detects that it's a massive sequential action that is
taking place.

Another thing that could be what you're seeing is when a system has lots
of active memory. At the moment the system reaches a state where it runs
out of free memory, can't lower the UBC and thus normally would swap,
it'll decrease/rename the active memory as inactive. I've seen this in a
few cases where there are lots of users each using a fairly big chunk of
memory. [on this kind of system]...the interesting thing is that when
the machine hits 0% free and can't lower the UBC, it (according to
Collect) converts several GBs from active to inactive memory in a few
seconds and ends up with free memory again. As you can see in the graph,
the UBC on this machine is untouched in this case as it's always at it's
minimum (ubc_minpercent)

TIA!
Judith Reed



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:49:09 EDT