how to find identical files on Linux using python script

In situation where you have two directories (A & B) with a lot of files, where some files in B directory are same as some files in A directory, you can use this python script to find identical files.

#!/usr/bin/env python
import os
import sys
import commands
#import string

if len(sys.argv) != 3:
print """not enough arguments!
usage: find_identical.py source destination"""


#print "src: %s, dest: %s" % (src, dest)
print "#!/bin/sh"

srcText = commands.getoutput("md5sum %s/*" % src)
destText = commands.getoutput("md5sum %s/*" % dest)

def dictFromString(val):
lines = val.split("\n")
result = {}
for line in lines:
line = line.split()
result[line[0]] = line[1]
return result

def listFromString(val):
result = val.split("\n")
for i in range(len(result)):
result[i] = result[i].split();
return result;

srcDict = dictFromString(srcText)
destList = listFromString(destText)
filesFound = False
for file in destList:
if srcDict.has_key(file[0]):
print "rm \"%s\" #identical to %s" % (file[1], srcDict[file[0]])
filesFound = True;

if not filesFound:
print "#no identical files found"

Script needs two arguments - name of first and second directory. By default script prints (to standard output) shell script that would remove files from B directory which present in A directory.
So, to remove files from B which already present in A, you'll need to run:
find_identical.py A B >applychanges.sh
bash applychanges.sh

This done this way so you can examine list of files which will be removed before removing them. This behavior can be easily changed by modifying line:
 print "rm \"%s\" #identical to %s" % (file[1], srcDict[file[0]])

to something you need.

Script needs md5sum to work.
Notice, that there is already linux software available that does same thing (finds identical files). So this script is mostly useful for learning python, or as base for making another script.

No comments:

Post a Comment