Saturday, October 18, 2008

Tarring up lots of duplicate files

Lately I've been building tar archives with lots of directories that have almost exactly the same contents (and it'd be impractical to reduce them to forests of symlinks). I was disappointed at how poorly the tarfiles compress, considering how many copies of the exact same data are there. Both gzip and bzip2 do poorly. (And zip is even worse, since it only compresses one file at a time.)

So I figured I'd give the compression algorithms a little help, by feeding in the identical files in groups. This works fine since tar doesn't care about the file order being different (even if the files are in different subdirectories).

1. I ran find -not -type d on the root directory of what I wanted to tar up, in order to get a list of files. -not -type d is important, since if you tell tar to tar up foo and foo/bar, it'll recurse into foo/ and pick up bar, then pack up foo/bar agin, and you'll end up with two copies of foo/bar in your tarball. Weird, eh? (Beware of symlinks and other non-file types when you do things like -not -type d, of course).

2. I wrote a little perl script to sort the list of files by their basenames (the "baz.txt" in "/foo/bar/baz.txt"), and put the list in /tmp/files.

3. I ran tar -jcvf foo.tar.bz2 -T /tmp/files

So, all the files with the same basename got tarred up together regardless of what directory they were in, and that was enough to give bzip2 the hint that they were thus very compressible.

How well did it work? Pretty well, at least for my present dataset:

In all I had 6100 files totalling 177MB (so an average of 30k/file), with each file duplicated about 100 times (so, 60 or so unique files). Doing an ordinary tar -jcvf produced a 94MB .tar.bz2 file, so about 2:1 compression. gzip fared almost as well, at 96MB.

After sorting the file list, gzip got down to 83MB, but bzip2 did much better: 17MB! So, it went from 2:1 compression to 10:1, just by sorting the file list.

For comparison, I also built a tarball with essentially one copy of each of the 60 or so unique files, and came up with about a megabyte. So, we should have been able to do almost 200:1, but 10:1 is still a big improvement over 2:1.

Of course, this trick only works if the identical files have the same filenames. But here's another trick: instead of sorting the files by base filename, sort by md5sum:

find ./tar-me-up/ -type f -exec md5sum \{\} \; | sort | cut -f3 -d' ' >/tmp/files

It's slower, since it has to read and hash all the files, but now all the identical files will be together regardless of filename.

Now, what'd be really great for me is if tar could run those checksums internally, and simply point to the contents of an earlier file when it encounters a duplicate. That would be pretty simple, but would break tar's streaming nature, and get complicated when it comes to modifying existing tarballs. Really it's the compression algorithm's problem to notice redundancy like that; maybe later I'll play around with bzip2 and see if it can be persuaded to do a better job of noticing patterns.

2 comments:

Anonymous said...

7z already does that sort of scanning, sorting and sometimes deduplicating. If you're not constrained to a "standard" tar.gz/tar.xz/tar.bz2 format, try it.

Unknown said...

i would suggest you to try DuplicateFilesDeleter , it can help resolve duplicate files issue.