Summary: Question concerning large number of files and transfer t o another directory.

From: Brewer, Edward (BREWERE@OD.NIH.GOV)
Date: Thu Apr 25 2002 - 15:31:32 EDT


Admins,

        Thanks to all of the helpful responses, I copied all of the files
over fast and effectively.

Most of the responses pointed out that having that many files (282,137) in
one directory was a bad idea. I concur, but I am forced to live with that
for now. Our database points to these files and a change of the database
isn't going to happen anytime soon. So the task at hand was for me to copy
all of these files from our production system to development. My first step
was to NFS mount the production system to the development. Then I issued
the following

cd <old directory>
(find . -type f -name "[a-dA-D]*" -print | xargs -i cp -pf {} <new dir>) &
(find . -type f -name "[e-hE-H]*" -print | xargs -i cp -pf {} <new dir>) &
(find . -type f -name "[i-lI-L]*" -print | xargs -i cp -pf {} <new dir>) &
(find . -type f -name "[m-rM-R]*" -print | xargs -i cp -pf {} <new dir>) &
(find . -type f -name "[s-wS-W]*" -print | xargs -i cp -pf {} <new dir>) &
(find . -type f -name "[x-zX-Z]*" -print | xargs -i cp -pf {} <new dir>) &
(find . -type f -name "[1-3]*" -print | xargs -i cp -pf {} <new dir>) &
(find . -type f -name "[4-6]*" -print | xargs -i cp -pf {} <new dir>) &
(find . -type f -name "[7-9]*" -print | xargs -i cp -pf {} <new dir>) &
(find . -type f -name "[._]*" -print | xargs -i cp -pf {} <new dir >) &

The prevailing ideas were to use cpio, tar, vdump and xargs with cp...I
found that the fastest, by using the time command,was spawning several cp
commands using xargs. It was able to achieve 6 MB/sec output over the
network and 120,000 files/hour xfer.

Thanks to all!!! Here are the responses

The Tuning Guide says NEVER to increase maxusers beyond 2048!
   I don't understand why, but there you have it.

   -- mahendra

IMVHO, this is a design error, regardless of any tuning etc.,
and you should (try to) change to a more hierarchical structure ASAP.

I would try
  cd <old directory> ; find . -print | cpio -pdmav <new directory>
(add more options to the "find" if applicable) and wait / see.

  -- Joerg Bruehe

use
find . |cpio -pdmvl "dest-dir"

please check options of cpio(p-create parent dir soft
link ...), since it will work even if it is nfs mount.

  -- Jay Nash

is this something that you'll need to do more than once? Meaning, will the
nfs files change and you need to recopy the files? if so, the rsync
program could be a good choice ( https://rsync.samba.org ). Otherwise, I
would use tar/pax/cpio in the same manner you're using vdump. I don't
really know which is the most effiecient.

  -- Paul

When I want to copy a group of files I use find and cpio.
Find allows me to select the files if I need to and cpio will preserve
timestamps and so on. You might replace find with your own script that
selects just the files you need.

My typical usage is like this:

mkdir /new/directory/area
cd /old/directory
find . -print | cpio -pdumv /new/directory/area

The cpio options are all important, except the v, verbose, option, but
use it while testing at least. Build a sample directory structure to play
with to see the effect of the directory where you start find and the find
directory argument and the directory argument passed to cpio. Once you
get all the paths set properly, the command becomes second nature.

Good luck,
   -- Steve Herber

maxusers is typically increased by powers of 2. Our documentation states
that you should never increase maxusers to 4096 except in a direct response
from our Unix engineering.

For #2, I think I'd create a tar file if you have the room, then extract it
where you want it. But your vdump command looks like it would work

  -- Lavelle, Bryan

> BTW: For now I can not break up the directory into smaller directories.

Why not? It's a bad idea to have that many files in a single directory.
UNIX filesystems are generally very bad at it. You'll find that NFS
doesn't support that many files at all (limit of about 128,000 I think),
and performance will be poor (and I believe the performance degradation
is worse than linear with respect to the number of files). Use some
sort of hashing algorithm to put the files in a hierarchy of smaller
directories. There's a reason netscape does this in your
.netscape/cache directory! It'll cause you a lot less pain in the long
run, I promise you.

Tim.

If the directory is an entire file system, then the vdump
        example is as good a way as any. As far as I known vdump
        is limited to backing up file system as opposed to directory
        trees. If the directory is just part of a directory, then
        you can also use tar and probably cpio. The tar syntax
        is much the same:

                cd /old-directory
                tar cf - . | ( cd /new-directory ; tar xpf - . )

        As for the performance, it depends on whether there are
        250,000 at one level of a directory or 250,000 files in
        the directory tree. Both UFS and AdvFS prior to the on-
        disk format changes in V5, as sensitive to a large number
        of files in one single directory. UFS is much less sensitive
        to a large number spread over many directories. My limited
        experience testing AdvFS in this environment suggested it
        had a comparible problem whether it was many files in a
        single directory or spread over multiple directories.

        If you're running on V5 and using AdvFS, it may be worth-
        while to create a domain for these files and migrate them
        to the new file system. There were changes made to the
        on-disk structure that apparently better support a lot of
        files in a single directory.

        For UFS there isn't much you can do expect to spread the
        files out over more directories.

  --alan

General remarks:
In most Unixes directory lookups for such a directory is an absolute
disaster. (There
are supposed to be improvements in the 5.x code for directory structures, so
you may survive.)

Your life, and the behaviour of any applications that use this directory,
will be much easier
if you organize the files hierarchically; say by a key based on the 1st two
letters if that is a sufficient
distinguishing characteristic.

vdump/vrestore should certainly work. The other standard way to make mv or
cp or other
utilities behave well with large numbers of matching file names is to use
xargs in conjunction with
find.

E.g., to copy all files matching a pattern to a new directory
find . -type f -name 'pattern' -print | xargs -i cp {} new-directory

You might also find cpio useful:
find . -type f -print | cpio -p new-directory
(See example 8 in the man page for cpio for how to copy subdirectories too.)

Good luck!

  -- Oisin McGuinness

Why do you use simply "cp -pR /dir1 /dir2" ?

Lee: The arguement list is too long.

If I left anyone out, please forgive me, thanks again for all of the
responses.

Lee



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:39 EDT