In python/numpy, I have a 10,000x10,000 array named random_matrix.  I use md5 to compute the hash for str(random_matrix) and for random_matrix itself.  It takes 0.00754404067993 seconds on the string version, and 1.6968960762 on the numpy array version.  When I make it into a 20,000x20,000 array, it takes 0.0778470039368 on the string version and 60.641119957 seconds on the numpy array version.  Why is this?  Do numpy arrays take up a lot more memory than strings?  Also, if I want to make filenames identified by these matrices, is converting to a string before computing hashes a good idea, or are there some drawbacks?
            Asked
            
        
        
            Active
            
        
            Viewed 1,116 times
        
    1
            
            
         
    
    
        mlstudent
        
- 948
- 2
- 15
- 30
1 Answers
7
            str(random_matrix) will not include all of the matrix due to numpy's eliding things with "...":
>>> x = np.ones((1000, 1000))
>>> print str(x)
[[ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 ..., 
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]]
So when you hash str(random_matrix), you aren't really hashing all the data.
See this previous question and this one about how to hash numpy arrays.
 
    