We have an internal web application that accepts a file of varying formats from a user in order to import large amounts of data into our systems.
One of the more recent upgrades that we implemented was to add a way to detect if a file was previously uploaded, and if so, to present the user with a warning and an option to resubmit the file, or to cancel the upload.
To accomplish this, we're computing the MD5 of the uploaded file, and comparing that against a database table containing the previously uploaded file information to determine if it is a duplicate. If there was a match on the MD5, the warning is displayed, otherwise it inserts the new file information in the table and carries on with the file processing.
The following is the C# code used to generate the MD5 hash:
private static string GetHash(byte[] input)
{
using (MD5 md5 = MD5.Create())
{
byte[] data = md5.ComputeHash(input);
StringBuilder bob = new StringBuilder();
for (int i = 0; i < data.Length; i++)
bob.Append(data[i].ToString("x2").ToUpper());
return bob.ToString();
}
}
Everything is working fine and well... with one exception.
Users are allowed to upload .xlsx files for this process, and unfortunately this file type also stores the metadata of the file within the file contents. (This can easily be seen by changing the extension of the .xlsx file to a .zip and extracting the contents [see below].)
Because of this, the MD5 hash of the .xlsx files will change with each subsequent save, even if the contents of the file are identical (simply opening and saving the file with no modifications will refresh the metadata and lead to a different MD5 hash).
In this situation, a file with identical records, but created at different times or by different users will slip past the duplicate file detection, and be processed.
My question: is there a way to determine if the content of an .xlsx file matches that of a previous file without storing the file content? In other words: is there a way to generate an MD5 hash of just the contents of an .xlsx file?
