This page contains descriptions and links to various scripts that I've written to manage files and directories quickly and easily.
Table of Contents
Duplicate File Scanner (fast-dupscan.pl)
This script does exactly what the title says: it scans a directory for duplicate files. The script is optimized for time and reduced disk I/O, at the expense of memory usage. When run on a directory with 264,410 files, it uses about 70MB of RAM.
It works by adding all of the files in the directories passed to it into a hash. From there, it groups files that are the same size and compares them to one another using MD5. Files are duplicate when they are both the same size and have the same MD5 hash, so grouping same-sized files saves a large amount of I/O over simply comparing the MD5 hash of all files in the directories.
I have also added a feature where it can log the MD5 sums calculated from a previous execution, so that the log can be read in future executions. The logged MD5 sum for a given file is only 'trusted' if the mtime and size in the log match those of the file at the time of scanning. This way, scans of a large directory can go by much quicker when being scanned on a routine basis. On a particular tree with many files totalling over 750GB, the script run time went from over 4 hours to under 20 minutes.
Another feature is the "whitelist". Files whose name appear in the whitelist file are not compared to other files as being duplicate.
The output of the script is as follows. For the purposes of this example, the file FOO is identical to the file BAR:
##Duplicate: s=1048576 md5=b6d81b360a5672d80c27430f39153e2c #/fs/public/some/other/dir/BAR /fs/public/some/dir1/FOO
The first file in a group of duplicates is always commented out, so that the output of the script can be piped to another program to remove all of the duplicates found, saving one. The output of each duplicate group can be sorted, so that the commented one can be the oldest of the group, the first alphabetically, or any of the other implemented sorting methods. All sorting methods are reversible as well.
Make a Tree from a List (mktreefromlist.pl)
This script is one of the smallest, yet most useful ones I've ever made. It takes a list of files from standard input, and moves them into a directory tree. It might be easiest to explain using an example:
cd $MUSICFOLDER
find . -type f -not -iname \*.mp3 -print \
| mktreefromlist.pl \
-a move -b $MUSICFOLDER -d $SOMEOUTFOLDER -v
The above set of commands would extract any file that does not
end in .mp3 from the music
directory and move it into a new directory, while preserving
the subdirectory structure. For instance:
/music/Linkin Park/Meteora/Picture.jpg ->
/somedir/Linkin Park/Meteora/Picture.jpg
The ability to preserve the subdirectory structure is the real party piece.
Otherwise, it would just be a riced-out version of /bin/mv.
Splitting a Large Directory (dvd-fit.pl)
Also uses mktreefromlist.pl
I frequently want to put a large directory onto a DVD-R, CD-R, or similar, but the directory is larger than the medium. Manually splitting a directory into media-sized volumes is often a tedious task, especially on directories with many files, such as my images directory, which has over 230,000 entries.
To remedy this, I have written two Perl scripts. The first one takes a directory and generates lists of file/dirs from it, with each list containing files whose size summation is equal to or smaller than the specified volume size. These lists can then be sent to the second script, which extracts (copies or moves) those files into a new directory, preserving the tree sctructure. From there, I can burn each of these new directories onto the chosen media.
The first script, dvd-fit.pl, makes
the lists in a manner that attempts to use the fewest number
of volumes. It uses the "Sorted Packing" bin packing algorithm,
as described at the bottom of
this page. It is not always accurate, and I always make sure to check the
size of each directory when I'm finished, to ensure that there aren't two or more
small directories which could be merged. I have not yet had this happen.
This /bin/sh script, below, can be used to do the whole
she-bang without any intervention. Simply edit the variables at the top.
#!/bin/sh
SRC=/home/craig/
DEST=/fs/tmp/dvd/
SIZE=dvd
DEPTH=2
LISTPREFIX=/tmp/mylist
FIT_SCRIPT=/path/to/dvd-fit.pl
MKTREE_SCRIPT=/path/to/mktreefromlist.pl
for F in `perl ${FIT_SCRIPT} \
-s ${SIZE} -l ${DEPTH} -o ${LISTPREFIX} ${SRC}`
do
mkdir -p ${DEST}/`basename ${F}` \
&& cat $F | perl ${MKTREE_SCRIPT} \
-a move -b ${SRC} -d ${DEST}/`basename ${F}` -v
rm $F
done
Notes for All Scripts
You can run the scripts without any options to see what options are available. If there are none, nothing will be displayed.
Installing Perl Modules
If you get errors about not being able to run a script due to missing Perl modules,
you can look at the top few lines of the script for lines in the form:
use Foo::Bar; and install them, using the following command:
perl -MCPAN -e 'install Foo::Bar'
In FreeBSD, Perl modules can often be installed via the ports tree:
portinstall p5-Foo-Bar
No Warranty
NOTE THAT WHILE I HAVE TAKEN CARE TO MAKE SURE THAT THESE SCRIPTS ARE SAFE, I CANNOT GUARANTEE THAT THEY WON'T CAUSE PROBLEMS, INCLUDING DATA LOSS. USE AT YOUR OWN RISK. If you see errors about uninitialised variables, or similar, that is a bug. It may or may not be harmless. It is best to only run these scripts on directories whose contents are "pristine," that is, ones whose contents all have the necessary permissions, and no special files such as pipes or character devices.
Please contact me if you have problems or patches.