64

I have 2 excel documents and I want to check if they are exactly the same, apart from the file name.

For example, the files are called fileone.xls and filetwo.xls. Apart from the file names, their contents are presumed to be identical but this is what i want to check.

I've been looking for ways to review this and without installing a bunch of plugins. There doesn't seem a straight forward way.

I've tried generating MD5 hashes for both files. When the hashes are identical, does this mean that the file contents are 1:1 the same?

sam
  • 4,359

17 Answers17

95

When the hashes are identical, does this mean that the file contents are 1:1 the same?

All files are a collection of bytes (values 0-255). If two files MD5 hashes match, both those collections of bytes are extremely likely the exact same (same order, same values).

There's a very small chance that two files can generate the same MD5, which is a 128 bit hash. The probability is:

Probability of just two hashes accidentally colliding is 1/2128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456. (from an answer on StackOverflow.)

Hashes are meant to work in "one direction only" - i.e. you take a collection of bytes and get a hash, but you can't take a hash and get back a collection of bytes.

Cryptography depends on this (it's one way two things can be compared without knowing what those things are.)

Around the year 2005, methods were discovered to take an MD5 hash and create data that matches that hash create two documents that had the same MD5 hash (collision attack). See @user2357112's comment below. This means an attacker can create two executables, for example, that have the same MD5, and if you are depending on MD5 to determine which to trust, you'll be fooled.

Thus MD5 should not be used for cryptography or security. It's bad to publish an MD5 on a download site to ensure download integrity, for example. Depending on an MD5 hash you did not generate yourself to verify file or data contents is what you want to avoid.

If you generate your own, you know you're not being malicious to yourself (hopefully). So for your use, it's OK, but if you want someone else to be able to reproduce it, and you want to publicly publish the MD5 hash, a better hash should be used.


Note that it's possible for two Excel files to contain the same values in the same rows and columns, but for the bytestream of the file to be completely different due to different formatting, styles, settings, etc.

If you are wanting to compare the data in the file, export it to CSV with the same rows and columns first, to strip out all formatting, and then hash or compare the CSV's.

Stevoisiak
  • 16,075
LawrenceC
  • 75,182
38

In practice, yes, an identical cryptographic hash means the files are the same, as long as the files were not crafted by an attacker or other malicious entity. The odds of random collisions with any well-designed cryptographic hash function is so small as to be negligible in practice and in the absence of an active attacker.

In general, however, no, we cannot say that two arbitrary files having the same hash definitely means that they are identical.

The way a cryptographic hash function works is to take an arbitrary-length input, and output a fixed-length value computed from the input. Some hash functions have multiple output lengths to choose from, but the output is still to some degree a fixed-length value. This value will be up to a few dozen bytes long; the hash algorithms with the longest output value in common use today have a 512-bit output, and a 512-bit output is 64 bytes.

If an input to a hash function is longer than the output of the hash function, some fidelity must be removed to make the input fit in the output. Consequently, there must exist multiple inputs of lengths greater than the length of the output, which generate the same output.

Let's take the current workhorse, SHA-256, as an example. It outputs a hash of 256 bits, or 32 bytes. If you have two files which are each exactly 32 bytes long, but different, these should (assuming no flaw in the algorithm) hash to different values, no matter the content of the files; in mathematical terms, the hash is a function mapping a 2256 input space onto a 2256 output space, which should be possible to do without collisions. However, if you have two files that are each 33 bytes long, there must exist some combination of inputs that give the same 32-byte output hash value for both files, because we're now mapping a 2264 input space onto a 2256 output space; here, we can readily see that there should, on average, exist 28 inputs for every single output. Take this further, and with 64-byte files there should exist 2256 inputs for every single output!

Cryptographic hash functions are designed such that it's computationally difficult to compose an input that gives a particular output, or compose two inputs that give the same output. This is known as preimage attack resistance or collision attack resistance. It's not impossible to find these collisions; it's just intended to be really, really, really, really hard. (A bit of a special case of a collision attack is a birthday attack.)

Some algorithms are better than others at resisting attackers. MD5 is generally considered completely broken these days, but last I looked, it still sported pretty good first preimage resistance. SHA-1 is likewise effectively broken; preimage attacks have been demonstrated, but require specific conditions, though there's no reason to believe that will be the case indefinitely; as the saying goes, attacks always get better, they never get worse. SHA-256/384/512 are currently still believed safe for most purposes. However, if you're just interested in seeing if two non-maliciously-crafted, valid files are the same, then any of these should be sufficient, because the input space is sufficiently constrained already that you'd be mostly interested in random collisions. If you have any reason to believe that the files were crafted maliciously, then you need to at the very least use a cryptographic hash function that is currently believed safe, which puts the lower bar at SHA-256.

First preimage is to find an input that yields a specific output hash value; second preimage is to find one input that gives the same output as another, specified input; collision is to find two inputs that yield the same output, without regard to what that is and sometimes without regard to what the inputs are.

All that said, it's important to keep in mind that the files may have very different data representations and still display exactly the same. So they can appear to be the same even though their cryptographic hashes don't match, but if the hashes match then they are extremely likely to appear the same.

user
  • 30,336
10

It's a probability game... hashes are able to represent a finite number of values.

If we consider a hypothetical (and very weak) 8-bit hashing algorithm, then this can represent 256 distinct values. As you start to run files through the algorithm, you will start to get hashes out... but before long you will start to see "hash collisions". This means that two different files were fed into the algorithm, and it produced the same hash value as its output. Clearly here, the hash is not strong enough, and we cannot assert that "files with matching hashes have the same content".

Extending the size of the hash, and using stronger cryptographic hashing algorithms can significantly help to reduce collisions, and raise our confidence that two files with the same hash have the same content.

This said, we can never reach 100% certainty - we can never claim for sure that two files with the same hash truly have the same content.

In most / many situations this is fine, and comparing hashes is "good enough", but this depends on your threat model.

Ultimately, if you need to raise the certainty levels, then I would recommend that you do the following:

  1. Use strong hashing algorithms (MD5 is no longer considered adequate if you need to protect against potentially malicious users)
  2. Use multiple hashing algorithms
  3. Compare the size of the files - an extra data point can help to identify potential collisions, but note that the demonstrated MD5 collision did not need to alter the data's length.

If you need to be 100% sure, then by all means start with a hash, but if the hashes match, follow it up with a byte-by-byte comparison of the two files.


Additionally, as pointed out by others... the complexity of documents produced by applications such as Word and Excel means that the text, numbers, visible layout can be the same, but the data stored in the file can be different.

Excel is particularly bad at this - simply opening a spreadsheet saving it (having done nothing) can produce a new file, with different content.

Attie
  • 20,734
6

Short answer: A cryptographic hash is supposed to help you be reasonably confident that files with matching hashes are the same. Unless deliberately crafted, the chances of two slightly different files having similar hash values is ridiculously small. But when it comes to comparing and verifying files that could be deliberately tampered with, MD5 is poor choice. (Use another hash function like SHA3 or BLAKE2.)

Long answer: An ideal hash function is one that creates an almost unique cryptographic hash for a every unique piece of data. In other words, we definitely know that there are two files in this universe whose hash values collide, the chance of these two files naturally coming together is ridiculously small.

Ten years ago, I decided I must stay as far as I can from MD5. (Of course, until yesterday, I remembered the wrong reason for doing so; ten years is a long time, you see. I revisited my past memos to remember why and edited this answer.) You see, in 1996, MD5 was found to be susceptible to collision attacks. 9 years later, researchers were able to create pairs of PostScript documents and (ouch!) X.509 certificates with the same hash! MD5 was clearly broken. (Megaupload.com was also using MD5, and there was a lot of hanky-panky around hash collisions that gave me trouble at the time.)

So, I concluded that while MD5 was (and still is) reliable for comparing benign files, one must stop using it altogether. I reasoned that reliance on it has the risk of turning into indulgence and false confidence: Once you start comparing files using their MD5 hashes, one day you forget the security fineprint and compare two files that are deliberately crafted to have the same hash. In addition, CPUs and cryptoprocessors were unlikely to add support for it.

The original poster, however, has even less reasons to use MD5, because:

  1. As long as one is comparing two files only, byte-for-byte comparison is actually faster than generating one's own MD5 hashes. For comparing three or more files... well, now you have a legitimate cause.
  2. The OP specified "ways to review this and without installing a bunch of plugins". Windows PowerShell's Get-FileHash command can generate SHA1, SHA256, SHA384, SHA512 and MD5 hashes. On modern computers with hardware support for SHA hash functions, generating them is faster.
6

If two files have the same MD5 hash, and they haven't both been specially crafted, then they're identical. How hard it is to craft files with the same MD5 hash depends on the file format, I don't know how easy it is with Excel files.

So if you have files of your own that are just lying around and want to find duplicates, MD5 is safe. If you wrote one of the files, and the other file is of dubious origin, MD5 is still safe (the only way to get different files with the same MD5 checksum is to craft both files). If someone you don't trust sends you a budget proposal, and later sends another file which they claim is the same, then MD5 may not be enough.

To avoid any risk, use SHA-256 or SHA-512 instead of MD5. If two files have the same SHA-256 hash, then they're identical. The same goes for SHA-512. (There's a theoretical possibility that they could be different, but the probability of this happening accidentally is so much less than the probability of your computer flipping a bit during the verification than it just isn't relevant. As for someone deliberately crafting two files with the same hash, nobody knows how to do this for SHA-256 or SHA-512.)

If two Excel files have different hashes, then they're different, but there's no way to know by how much they differ. They could have identical data but different formatting, or they could just differ in the properties, or they might have been saved by different versions. In fact if Excel is anything like Word then merely saving a file updates its metadata. If you only want to compare the numerical and text data and ignore formatting and properties, you can export the spreadsheets to CSV to compare them.

If you have Unix/Linux tools available, then you can use cmp to compare two files. To compare two files on the same machine, checksums only make things more complicated.

5

Hashes such as MD5 or SHA have fixed length, lets say it's 300 alphanumeric characters (in reality they are shorter and don't use the whole set of alphanumeric characters).

Lets say that files are made of alphanumeric characters and up to 2GB in size.

You can easily see that there are way more files (with size of up to 2GB) than possible hash values. The pigeonhole principle says that some (different) files must have the same hash values.

Also, as demonstrated on shattered.io1 you can have two different files: shattered.io/static/shattered-1.pdf and shattered.io/static/shattered-2.pdf that have the same SHA-1 hash value while being completely different.

1SHA1 is a "stronger" hashing algorithm than md5

5

I have 2 excel documents and I want to check if they are exactly the same, apart from the file name.

From a practical perspective, directly comparing the files to find out if they're different will be faster than computing a hash for each file and then comparing that hash.

To compute the hashes you have to read the entirety of the contents of both files.

To determine if they're identical through a direct comparison, you just need to read the contents of both files until they don't match. Once you find a difference, you know the files aren't identical and you don't have to read any more data from either file.

And before you do either, you can simply compare the sizes of the two files. if the sizes differ then the contents can't be the same.

4

NO. Different values guarantee the files are different. The same values are not a guarantee the files are the same. It is relatively easy to find examples using CRC16.

On the balance of probability with contemporary hashing schemes they are the same.

mckenzm
  • 946
3

Your question is backwards, though - let's assume that the hash means that they have the same data (which isn't 100% guaranteed, but is good enough for a lifetime of comparing files every second to not hit a collision). It doesn't necessarily follow that having the same data means that they'll have the same hash. So no - you can't compare the data in an excel file with the data in another excel file by hashing the file because there are a lot of ways that two files can differ without the underlying data being different. One obvious way - the data is stored as XML, each cell has its own XML node. If those nodes are stored in different orders then the data is the same but the file is different.

3

To add on the other answers, here are many examples of couples of files with the same MD5 hash and different content.

2

The answer for this OP has been given but might benefit from a summary.

If you want to check whether two files are the same, a lot depends on whether or not the files and hashes are under your control.

If you generate the hashes yourself from the files, and you are pretty sure nobody else had opportunity/skill/motivation to deliberately try and make you reach the wrong conclusion, then almost any hash - even "known broken" hashes like MD5 and SHA1 are almost certain to be sufficient. But that, I mean you could generate files at high speed for millions of years and you'd still be unlikely to end up with any two files that are actually different but have the same hash. It's almost certainly safe.

This is the scenario you have, when you want to quickly check if two directories on your PC or file server have the same content, if any files in a directory are exact duplicates, etc, and you're pretty sure the files haven't been engineered/illicitly modified, and you trust your hashing app/utility to give correct results.

If you are in a scenario where one of the files - or a precalculated hash - might have been manipulated or engineered to fool you into a wrong conclusion, then you need a stronger (unbroken) hash, and/or other security. For example, if you download a file and check if it's valid by examining a hash, then an attacker might be able to engineer a bad file with the correct hash, or attack the website to place an incorrect hash when you look for the "right" (expected) value. This comes down to wider security issues.

Stilez
  • 1,825
2

On the Windows command line, you can use the comp utility to determine whether two files are exactly the same. For example:

comp fileone.xls filetwo.xls
Chad
  • 1,649
1

When the hashes are identical, does this mean that the file contents are 1:1 the same?

No. If the hashes are different, it does mean that the contents are different. Equal hashcodes do not imply equal content. A hashcode is a reduction of a large domain to a smaller range, by definition: the implication is that hascodes over unequal content can be equal. Otherwise there would be no point in computing them.

1

This answer is intended to be a handy map of scenarios that can or cannot happen, and reasonings you can apply. Refer to other answers to learn why hash functions work this way.


After you choose a hash function and stick to it, these are all combinations to consider:

identical hash values different hash values
identical files can happen, common cannot happen, impossible
different files can happen, rare* can happen, common

* rare, unless whoever generates (at least one of) the files purposely aims at this scenario

The scenario in which identical files generate different hash values is the only one that is strictly impossible.


Two reasonings that always apply:

  • If files are identical then hash values are identical for sure.
  • If hash values are different then files are different for sure.

Two reasonings that are not strict:

  • If files are different then hash values are probably different.
  • If hash values are identical then files are probably identical.
0

For your purposes, yes, identical hashes means identical files.

As other answers make clear, it's possible to construct 2 different files which result in the same hash and MD5 is not particularly robust in this regard.

So use a stronger hashing algorithm if you plan on comparing a large number of excel documents or if you think someone might want to manipulate the comparison. SHA1 is better than MD5. SHA256 is better again and should give you complete confidence for your particular usage.

jah
  • 243
-1

The files are probably identical if their hashes are identical. You can increase confidence by modifying both files in an identical way (e.g. put the same value in the same unused cell) then comparing hashes of the modified files. It is hard to create a deliberate collision for a file which is changed in a way not known in advance.

ibft2
  • 7
-2

Let's look at this in a practical way. Instead of saying "the hashes are identical" I'll say "I wrote a computer program that calculates the hashes of two files and prints out whether they are the same or not", and I run the program with two files, and it says "identical". There are several reasons why it might do that:

The files may be identical. My code may have bugs (one that has actually happened in practice was comparing two long (256 byte) hashes not with memcmp but with strcmp: The comparison will return "same" if the first byte in each hash is zero, and the chance for that is 1 in 65536. There may be a hardware fault (cosmic ray hitting a memory cell and switching it). Or you may have the rare case of two different files with identical hash (a hash collision).

I would say that for non-identical files, by far the most likely cause is programmer error, then comes the cosmic ray that changed a boolean variable with the result of comparing the hashes from "false" to "true", and much later comes the coincidence of a hash collision.

There are enterprise backup systems that avoid backing up identical files from 10,000 users by hashing each file and checking for a file with an identical hash already stored on the server. So in case of a collision a file won't get backed up, possibly leading to data loss. Someone calculated that it is much more likely that a meteorite hits your server and destroys all backups than losing a file because its checksum matched a different file.

gnasher729
  • 443
  • 2
  • 5