SUMMARY: fdupes / how to find dupe files

From: Christian Wessely (christian.wessely@uni-graz.at)
Date: Fri Jun 14 2002 - 02:50:17 EDT


Hi Admin Wizards,

A big thanx to the list - once again, helpful responses within hours (or
even minutes).
The best solution for my problem was (1), but I the other responses are
highly appreciated - I learned much from these.

Credits go to (in order of appearance):
James Sainsbury, Joerg Bruehe, Alan@Nabeth, Charles Ballowe

Solutions:
-----------------------------------------------------------------------
1) make fdupes "makeable" (by James Sainsbury):

The reason the compile doesn't proceed is that the header and object files
for getopt_long() do not exist on tru64.
One solution:
        goto your source/compile directory for gnu tar (1.13)
        look in lib for
                getopt.h
                getopt.o
                getopt1.o
        copy these to your fdupes compile directory
        edit the fdupes Makefile
....
#EXPERIMENTAL_RBTREE = -DEXPERIMENTAL_RBTREE
INCLUDES=-I.
LIBES=getopt.o getopt1.o
DEBUG=-O2
CFLAGS=$(INCLUDES) $(LIBES) $(DEBUG)
....
fdupes: fdupes.c md5/md5.c
        $(CC) fdupes.c md5/md5.c $(CFLAGS) -o fdupes
-DVERSION=\"$(VERSION)\" $(
EXTERNAL_MD5) $(EXPERIMENTAL_RBTREE)
----------------------------------------------------------------------------
-----------
2) Shellscript that could do the job (Charles Ballowe):

something like:
find /directory-of-big-storage -exec cksum {} >> /tmp/output \;
sort -n /tmp/output > /tmp/output.sorted

and then something to go over and compare the first two fields of
each line with the previous line to determine if the files are likely
to be the same. It's going to take a while and beat on the disk a bit,
but it will get the job done. (and since it's sorted output - you only
have to worry about the lines next to the current line.)

-----------------------------------------------------------------------
3) Joerg Bruehe offered a script he has:

I have a tool (shell script) that traverses two trees, "old"
and "new", gets the file names from "old", looks for a file
with identical name in "new", compares them, and (if equal
contents) replaces the file in "new" by a hardlink to that
in "old".
(Mail me if you want that script.)

Once things were even worse: We had changed several names.

In that moment I did (unchecked - from memory):
  cd common_tree_root
  find . -type f -print | \
    xargs ls -ldi | sort -n +6 -n +1 > list
where the "sort" is on the size field first, inode second.
>From there I proceeded manually ("cmp" on files with same
size but different inodes, possibly followed by "ln -f"),
but this might be a start point for automation.
----------------------------------------------------------------------------
-----
4) and Alan@Nabeth suggested to:

        1. Get a list of all the regular file names on the
            target file system:
                find /file-system -type f -print > list

        2. Run the checksum program against each file. This
            will give you tuples of checksum, size and file
            name. Files with the size and checksum may be the
            same.
                xargs < list sum > checksums

            The next substep is a bit more complicated. You want
            to compare the checksums and sizes. You can almost do
            this with sort and uniq, but since the name is part of
            the line everything will be different. A custom awk
            or perl script might do the job.

        3. For files that had the same size (non-zero) and checksum,
            compare them with diff or cmp. The checksum calculation
            with sum(1) is only 16 bits. You can have files with the
            same checksum that are different. cksum(1) uses a 32 bit
            checksum, which should give a better first cut separation
            before having to do detailed compares.

----------------------------------------------------------------------------
---------

YS, CW

------------------------------------------
Dr. Christian Wessely
christian.wessely@uni-graz.at
url: www-theol.uni-graz.at



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:44 EDT