SUMMARY: File I/O performance question

From: fred@wiley.gsm.ucdavis.edu
Date: Fri May 17 2002 - 13:35:41 EDT


Thanks to all the suggestions; there isn't really one good answer that
would fit all such situations, but I learned a lot about
ADVFS performance and tuning.

I ended up revising my algorithm as suggested by several writers and
created 1000 files each with a subset of the data; this only took 36
hours. By keeping all files open for append I avoided the multiple
open/close operations. The files range from 60-200Mb and are easily
sorted by the built-in sort utility (this takes about 48 hours running
two sorts on different cpu's simultaenously). All processes involved
run close to 100% cpu, so you couldn't ask for much more.

I know RAID-5 isn't really appropriate for creating lots of files as
mentioned by more than one person, but, again, this is a one time
operation.

thanks, I lost a few of the names, but all answers are below.

My original question:

I have an 85Gb data set in 1400 (daily) files. The individual files are
sorted by the primary key. I need to create either a sorted single
file, or more likely, sorted individual files (there are about 4 million
different keys). Creating individual files is done by reading each daily
file and appending each record to an appropriate file (done with Perl).
The number of different keys per daily file runs in the range of 700,000
to 1.5 million. This, of course, leads to a large number of open/close
file operations.

In searching the archives I confirmed that it wouldn't be very effective
to put all 4 million files in a single directory. I tested breaking
them up into both 100 different directories and 1000 different
directories. Using more directories increased the run time by about
30%. However, even using 100 different directories isn't very
effective; right now the estimated run time is on the order of 5+ weeks.
I suspect this is due to the 700,000+ open/write/close file operations
(per file) as I can read thru the entire data set in a day or so.

I would appreciate any suggestions on a better way to deal with the
creation/updating of 4 million files :-)

This is, luckily, a one time operation.

The disk is a 500 Gb Raid-5, using ADVFS.

===============================================================

From: alan@nabeth.cxo.cpqcorp.net

        Suggestion #1, useless as it may be... Don't try create/update
        4 million files. Find an algorithm that will get the job done
        without having to deal with lots of files.

        If you're stuck with lots of files take a careful look at the
        I/O load and see if the RAID-5 is really appropriate for it.
        Random, relatively small writes to a RAID-5 can take a lot
        more I/O than other methods of getting redundancy. If the
        particular RAID subsystem supports a write-back cache, see
        if you're getting good use of it. A sufficiently heavy load
        will eventually saturate the cache and it won't be very
        useful.

        For large numbers of files in a single directory, AdvFS
        is probably more suitable. Spreading those files out
        over multiple directories in some useful organization,
        UFS may have some advantages. A large namei cache can
        help either case. A domain created on V5 has advantages
        for large numbers of files over domains created on older
        versions.

        If the load is dominated by operations that require log
        writes, move the log elsewhere. I read recently of someone
        that created a multiple domain, moved the log to one of the
        members and then filled the rest of that volume with a file
        that wasn't going to be used. This ensured that about the
        only I/O on that volume was for the log. A small, very
        fast device (mirror for protection) may help keep the log
        from becoming a bottleneck.

        On the "don't use 4 million files" side of things... How big
        is the key? How big is the record? In the right application
        a simple relative file can be very fast. If the primary key
        is unique, then it may be possible to use it to generate an
        offset into a single large file. Even if such a file were
        larger than the array, if the data is sufficiently sparse,
        you won't have blocks allocated for all the data and it may
        fit.

        Finally, there's the ultimate enemy... Time. A better
        solution can always be found, given enough time. If the
        brute force approach takes 5+ weeks and the total time
        to find a better solution, implement and run it, takes that
        or longer, then the brute force approach wasn't that slow.
        If a better solution can be found and run in a few days
        or even hours, it is obviously time well spent.

From: "Degerness, Mandell MSER:EX" <Mandell.Degerness@gems2.gov.bc.ca>

The guys at Syncsort (www.syncsort.com) are always asking me to try their
products.

You might want to ask if you can demo something from them.

That said, you should be opening as many source files as you are able (up to
the open file limit when taking into consideration the result file(s)) and
keeping the files open as you write out your result file(s) (merge sort).
This minimizes the open/close cycles.

Another writer:

well, I wouldnt use raid 5 for this - use disk striping for
performance (I assume you could get back your data in the event
of a disk fail). the perf. hit with raid 5 is just too great for
this. As for perl, it might go faster with a compiled prog like
C (might even be able to use a c to perl converter for the
source).

Another writer:

This may not be cost effective, but you'll get increased performance if
you spread the directories out among multiple disk volumes with each
volume being on a separate SCSI bus.

From: alan@nabeth.cxo.cpqcorp.net

        The observations in this particular message are quite a
        few years old and may not be entirely application to
        modern versions of AdvFS or UFS. They're almost certainly
        not to new on-disk structure used by AdvFS to Tru64 UNIX
        V5.

        Some years ago, I experimented with how long it took to
        create 1,000,000 files. I used UFS and AdvFS. For most
        of the experiments the files were in a single directory.
        UFS was I/O bound, having to scan the directory over and
        over. I'd have though caching would take care of some
        of that but it didn't seem to, or the writes which had
        to be synchronous were eating me alive. If spread the
        files out over more directories each individual directory
        became small enough that UFS didn't have a problem. This
        particular arrangement is common to newsservers.

        When I tested on AdvFS, it often didn't matter if the
        files were all in one directory or spread over many.
        As it processed more and more files it became CPU
        bound. What I've learned since then suggests it may
        have been related to the size of the namei cache. If
        the cache is small, then the lookup path is longer and
        probably more CPU intensive. If AdvFS caches the meta-
        data better then UFS, then this makes some sense.

        The tuning guide, may have suggestions on how to change
        the size of the namei cache for interesting file system
        loads. If you don't have a paper copy, there are HTML
        and PDF copies on the documentation CDROM.

        p.s. Lots of customers are still running V4 as well as V5.
        Some are still running V3. Please provide the operating
        system version when asking a question. Sometimes answers
        are very version dependent.

From: James Sainsbury <J.Sainsbury@chem.usyd.edu.au>

If you had a database on your system I would consider dumping the data
into that - they are good at this sort of manipulation.

But as I understand it you have 1400 files that each of which contains
<key,value> sorted by key. I am not clear whether the key is unique in
each file (ie is really a primary key.)
And that you wish to aggregate all the <value>s for a particular key into
a one file.

One approach I would try is to open a large number of the input files
(say 1000) read one record from each and then open the key files specified
in each input record and append the record to the correct key file.
The sorted input means you will know when you can close the key file
ie
when all the keys in the input records are lexically greater than the key
of the (output) key file you can close that (output) key file.

Note that at any time you have no more than 2 x number of input files open.
Repeat 14 times.

This reduces the number of opens and closes but the number of reads and writes
will be similar. Another approach is to partition your keys into classes and
process each record based on its class. ie if you split your 4M keys into
4000 key classes and processed your 1400 files into 1000 files corresponding
to the 4000 classes then processed each class file into the final 4000 key
files. (This could be multilevel eg all keys into 2 classes process and then
split each of these 2 classes into a further 2 classes and process etc.)

From: "Colin Bull" <c.bull@videonetworks.com>

This is of course an appication for a proper database, but failing that why
not use a combination of sort and or grep to pre-sort the data into
intermediate files.

Awk allows associative arrays, and I would be very surprised if perl does
not also give you this function to assist. In the "Awk programming Language"
manual by Aho, Weinberger & Kernighan,(Addison Wesley) they describe a
pseudo database using awk, which might be woth reading if you can get to see
it.

From: tmark@zk3.dec.com
AdvFS Development

If the files are small, you may find better performance
by turning off AdvFS fragging. AdvFS will store small
files (7k or less) in one monster hidden file we call
"the frags file". This saves space but incurs a performance
penalty. In 5.1A and above you can turn off this fragging
on a per-fileset basis via the chfsets and mkfset utilities.
 



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:41 EDT