SUMMARY: large data repository sync over WAN

From: Rob Windsor (windsor@warthog.com)
Date: Thu Feb 22 2007 - 12:47:24 EST


I received many responses, some pointed at tools (which is what I was
looking for, honestly), but most had a common theme to them. :)

Original Post:
> We need to sync 10TB of data in small files from one North American
> coast to the other.
>
> Our tenative plans are to sneakernet the data and then use some form of
> sync to catch up the delta.
>
> Aside from bandwidth constraints, we found that rsync quickly craps out
> with large numbers of files.
>
> What tools have you used to do this?

Most popular question:
> Does all 10TB of it change daily?

No. The data comes in two flavors:
* Oracle DBF files (yes, changes daily), less than a TB here
* Small static files, the files themselves don't change, their count
   simply increases
   - side note: These files are about 16 subdirs deep and heavily
     scattered (er.. I mean.. "distributed")

Other common questions/comments:
> You didn't specify how rsync craps out, but i'm guessing

I forget the specifics, but it was basically "out of memory" due to the
number of files and subdirs it has to dig in.

> what version of rsync you're using

2.6.8 (looking at 2.6.9 now to see if it addresses any of the problems
we've had)

> but you can often throw ram at the issue.

Not in this case, unfortunately.

> In addition, you can fire off rsync on a subtree so it has less work to
> do.

That's certainly a consideration. It won't be easy (c.f. "about 16
subdirs deep" above).

Then there were these:
> (Deborah Santomauro) Have you tried "rdist"?
and
> (Anthony D'Atri) rdist 6 from www.magnicomp.com with SSH as the transport works great for managing files.

Holy cow, now that's oldschool love! I'll look into that.

Brad Morrison mentioned:
> I think cpio has a flag to skip files with equal or newer mod dates,

Yeah, we've also considered something like:
    "rsync -av `find . -newer <somefile> -print` dest:/path"
just to limit the volume of files that rsync has to consider.

There was mention of NetApp, zfs, VxFS/VxVM, which aren't options in
this situation. As much as I tried to get to zfs, it wasn't available
at the time we upgraded the DB/file servers to Sol10.

Hutin Bertrand mentioned an app called "aide", which is an Intrusion
Detection tool (think tripwire) that you can use to spot files/subdirs
that have changed. interesting find.
(http://sourceforge.net/projects/aide)

Gedaliah Wolosh pointed me at http://www.openafs.org
Karl Rossing mentioned http://opensolaris.org/os/project/avs/

AFS is quite an endeavor, we're not quite prepared to go that route.

AVS looks interesting, we might be able to do something with that, if
heavy rsync-frobulation doesn't work out.

Thanks all!

Rob++

-- 
Internet: windsor@warthog.com                             __o
Life: Rob@Carrollton.Texas.USA.Earth                    _`\<,_
                                                        (_)/ (_)
"They couldn't hit an elephant at this distance."
   -- Major General John Sedgwick
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers


This archive was generated by hypermail 2.1.7 : Wed Apr 09 2008 - 23:41:41 EDT