I'm writing a file comparison function. I know of filecmp.cmp but in my dataset it is expected that a lot of the files will be the same so I thought rather than comparing each potential match with each other it would be better to implement a multi-file comparison that can compare them all at once. (Also, since I'm new to python I thought it was a good learning exercise.) It seems to be going O.K. so far, in fact with some inputs it is faster then unix's cmp (which actually has me a little worried, because I don't quite believe that's possible and therefore think there maybe something wrong with my implementation!)
So, I have the code written, but I'm now trying to determine what the ideal chunk size for each read would be. Part of me is thinking that the data retrieved will all have to be compared anyway so, the more I can get into memory at one time the better, but I wonder if there are limitations of the python data structures that may influence this above that. For example, I'm maintaing, potentially large, lists of chunks and using dictionaries where the keys are the read chunks.
So, what should I be aware of in the python built-in data structures that may affect this, or is this something that will only be determined by hardware and should be determined by profiling on a particular machine?
Reading that back I realise it's not the most clear question, but (despite attempts) I'm not sure how to clarify it. I'm happy to post my code if that will make things clearer, but it's a bit longer than your average code sample (not too bad though). Please, comment if further clarification is needed.
Thanks.
Update Re. SHA1: I have tested my algorithm vs a SHA1 on just 2 identical input files (more are expected in the real data), running each 100 times. I realise this is not a thorough test but the results are different enough for it to be worth commenting on.
(The computer wasn't under any other load during either test, and despite what I said in the comments, this wasn't running on a target machine, it was running on one with quite reasonable specs. Both tests had the potential to run in two threads; that is the SHA1 occurred in two threads, and two threads were started for mine but due to the implementation only one would have been used. The single threaded SHA1 version took much longer. Both tests read the same size of chunk at a time. Three sets of results are given.)
Now I'm confused. Are the comments (re. SHA1) right? Therefore this is indicative of a faulty implementation or is something else going on?
SHA1:
real    5m35.865s    6m17.737s    5m57.010s
user    10m18.963s   11m34.178s   10m58.760s
sys     0m47.030s    0m52.707s    0m47.807s
Mine:
real    3m47.185s    4m31.548s    4m40.628s
user    2m47.849s    3m26.207s    3m36.013s
sys     0m59.193s    1m5.139s     1m4.406s
 
    