Find Duplicate Files - A Comparison of fdupes and fslint

This article compares fdupes and fslint in regards to finding duplicate files.

Application Info

fdupes

fslint

Install Application

fdupes

sudo apt-get install fdupes (takes roughly ~49.2 kB of space)

fslint

sudo apt-get install fslint (takes roughly ~868 kB of space)

The Test

The test consists of running the application on the same machine against the same set of files to determine which app finds the best set of duplicates and the time it takes for each app to do so.

I have put together a simple script that performs the following for the test:

  • show how large the directory being checked is
  • run the deduplication test (just a check, no action)
  • count the number of files vs number of dups found
  • calculate how long the run takes and report it as "Total Runtime" in hh:mm:ss format

fdupes

dupdir=/media/share/archive
du -hs ${dupdir}
startdt=`date +%s`
fdupes --recurse ${dupdir} > fdups.out
enddt=`date +%s`
((diff_sec=enddt-startdt))
runtime=(`echo - | awk '{printf "  %d:%d:%d","'"$diff_sec"'"/(60*60),"'"$diff_sec"'"%(60*60)/60,"'"$diff_sec"'"%60}'`)
echo "Total Runtime: ${runtime}"
cntOfDupes=`grep ${dupdir} fdups.out | wc -l`
((cntOfFD=`find ${dupdir} | wc -l`-1))   #subtract 1 as this counts the current dir
echo "Count of Duplicates: ${cntOfDupes}"
echo "Count of Files/Directories: ${cntOfFD}"

fslint

PATH=$PATH:/usr/share/fslint/fslint #note: findup (cli form of fslint) is not on path by default so add it.
dupdir=/media/share/archive
du -hs ${dupdir}
startdt=`date +%s`
findup ${dupdir} > fslint.out
enddt=`date +%s`
((diff_sec=enddt-startdt))
runtime=(`echo - | awk '{printf "  %d:%d:%d","'"$diff_sec"'"/(60*60),"'"$diff_sec"'"%(60*60)/60,"'"$diff_sec"'"%60}'`)
echo "Total Runtime: ${runtime}"
cntOfDupes=`wc -l fslint.out`
((cntOfFD=`find ${dupdir} | wc -l`-1))   #subtract 1 as this counts the current dir
echo "Count of Duplicates: ${cntOfDupes}"
echo "Count of Files/Directories: ${cntOfFD}"

The Test Results

The results of the above tests.

fdupes

  • Size: 43G    /media/share/archive
  • Total Runtime: 0:21:15
  • Count of Duplicates: 3521
  • Count of Files/Directories: 16507
Analysis of Results

I've used fdupes for a few years and like the cli interface (full disclosure). It took just over 21 minutes and found 3,521 duplicate files. A lot of the files are license.txt type files which are bundled in software that are the same file. I could delete all but one version and symlink to one but this is more effort than its worth (at the risk of screwing something up).

fslint

  • Size: 43G    /media/share/archive
  • Total Runtime: 0:19:10
  • Count of Duplicates: 4759
  • Count of Files/Directories: 16507
Analysis of Results

fslint is a newcomer to me. To be honest, I was leary simply because it is GUI focused rather than CLI. I know the GUI is a simple wrapper for the CLI BUT the CLI is not well documented. After using fslint I quickly found that it is faster and the GUI can be beneficial in some circumstances (say sorting through 100 dups quickly and easily).

Summary

A summary of the test and results yields a few key items:

  • fslint's command line interface is lacking in documentation
  • fslint has a few additional comparisons options (partial md5 and sha1)
  • fslint is quicker
  • fdupes has byte-by-byte comparisonfs

Both tools are adequate in checking for deplicate files. fslint's speed and GUI inclusion will probably put it atop the list for most users, as for me, I will use both for a while...