Remove Duplicate Files in Subdirectories Using MD5sum

Sometimes I find myself in a situation where I have combined tons of directories and files into one parent directory and I want to delete all copies of files. For example I may have combined thousands of MP3s into one directory. Even though the names may be different, some of the files may be the same. This is a script which will keep only the first file found of each type. So in other words if you have 3 songs that are called the same thing or even different things, but they are in fact the same exact file, this script will leave you with only 1.

Warning! This script does not ask you any questions and it tells no no lies. It will systematically destroy all matching files without a second thought. It will also follow symlinks so you’ve been warned.

#!/bin/bash
# clear out previous md5sums
echo > /md5s

# this will find all gz files. I wrote this script in Solaris so some things are a bit more generic. For example on Linux you can use -iname instead of -name, to get .Gz, .GZ. .gZ and .gz . However the Solaris version I was on did not support this. Also you can usually leave out the '.' in linux.
# to search other files simply change the value inside the quotation marks, examples: "*.mp3". On many versions of find you can use more advanced syntax with boolean operators as well.
for x in $( find . -name "*.gz" )
     do
          sum=$( md5sum $x | cut -d' ' -f1 )
          echo trying $sum
          if grep $sum /md5s
               then
                    echo removing duplicate: $x
                    # remove this rm line to do a dry run
                    rm $x
               else
                    echo $sum >> /md5s
               fi
     done

Leave a Reply

Your email address will not be published. Required fields are marked *