Automatically 'brute force' a few bytes to recover a corrupt file

Question

Does anyone out there know of a way to brute force values at a particular offset in a file? It's 4 consecutive bytes which would need to be brute forced. I know the correct SHA-1 of the corrupt file. So, what I'd like to do is compare the complete file SHA-1, each time it changes the byte value.

I know the exact 4 bytes which were changed, because the file was given to me by a data recovery expert, as a recovery challenge. For those who are interested in knowing, the rar file has 4 bytes which were intentionally changed. I was told the offsets of the changed 4 bytes and the original SHA-1. The person said it's IMPOSSIBLE to recover the exact file in the archive once the 4 bytes were changed. Even if it was only a few bytes and you knew exactly where the corruption was located. Since it doesn't have a recovery record. I'm trying to see if there is a way for those particular 4 bytes to be filled in correctly so the file will decompress without error. The file size is around 5mb.

Example:

I uploaded photos so it's more clearly defined of exactly what I'm looking to do. I believe someone can post them here for me with more rep.

The example offset I'm focusing on is 0x78 where the first pic shows the value as CA I want the script to the take the value up by 1 so it becomes CB as shown in the second pic. I want it to keep increasing the value by 1 and then compare the whole file SHA-1 each time. Only making changes to those 4 bytes at the specified offset.

It will try CAC5C58A and compare the SHA-1. If doesn't match, then it will try CBC5C58A.Then once the first value reaches FF it will then go to 00C6C58A and so on. Basically, I would like it to be able to go from 00000000-FFFFFFFF but to also have the option to choose where you want it start and end. I know it could take some time but I would still like to try it. Keep in mind I know the exact offset of the bytes which are corrupt. I just need the correct values.

If you search on Google: "How to fix a corrupted file by brute force" There's a person that wrote a Linux program. However, it only works against the files included with the program. I'm looking for some way to use the same process with my file.

tripleee · Accepted Answer · 2018-04-20T08:44:43.573

Here's a small Python program which does what you seem to be describing.

#!/usr/bin/env python3
from hashlib import sha1

with open('binaryfile', 'rb') as bin:
    binary = bin.read()

base = 0x0078
# ... is not valid Python; add more sequences, or take it out (or see below)
for seq in [[0xCA, 0xC5, 0xC5, 0x8A], [0xCB, 0xC5, 0xC5, 0x8A], ...]:
    copy = binary[0:base]
    copy += bytes(seq)
    copy += binary[base+len(seq):]
    if sha1(copy).hexdigest() == '9968733ce3ff0893bbb0a19e75faaf2fb0000e19':
        print('success with bytes {0}'.format(seq))
        break
else:
    print('no success')

UnOnly briefly tested; please ping me if you find typos.

The base specifies where to try to apply the four bytes, and the long string '996873... is the hex representation of the expected SHA1. The line for seq in... defines the bytes to try; and of course replace 'binaryfile' with the path to the file you want to attempt to salvage.

You can replace the literal list [[0xCA, 0xC5,...]] with something to actually loop over all possible values but it's basically just a placeholder for something more useful because I'm not really sure what exactly you want there.

Something like for seq in itertools.product(range(256), repeat=4)): will loop over all possible values from 0 to 2³²-1. (You will need to add import itertools near the top then.) Or perhaps you could simply add an offset; update the script to replace the current for seq in with the following (where again the import needs to go before the main program);

import struct

for n in range(2**32):
    val=(n+0x8AC5C5CA) % 2**32  # notice reverse order
    seq=list(reversed(struct.pack(">I", val)))
    copy = ...

I reversed the order of the bytes so that it naturally increments from 0x8AC5C5CA to 0x8AC5C5CB but then the next increment will be 0x8AC5C5CC etc. The struct magic is to convert this to a sequence of bytes (had to look it up from https://stackoverflow.com/a/26920983/874188). This will start at 0x8AC5C5CA and go to 0xFFFFFFFF, then wrap around to 0x00000000 and climb back up to 0x8AC5C5C9.

If you have multiple candidate ranges you would like to examine in a particular order, maybe something like

for rge in [(0x8AC5C5CA, 0x8AFFFFFF), (0x00C6C58A, 0x00FFFFFF),
        (0x00000000, 0x00C6C589), (0x01000000, 0x8AC5C5C9)]:
    for val in range(*rge):
        seq=list(reversed(struct.pack(">I", val)))
        copy = ...

but then you'll need to make sure yourself that the (start, end) pairs in rge cover all of the space between 0x00000000 and 0xFFFFFFFF if you really want to examine all of it. (And again, notice that the range increments the last byte and that seq applies the bytes of the value in reverse, in accordance with your stated requirements.)

If you wanted to use two different base addresses, you quickly run up against the limits of what's feasible to do in your lifetime with brute force; but you could, for example, split the 4-byte number into two 2-byte parts and apply those at different offsets.

base1 = 0x1234
base2 = 0x2345

for seq in range(whatever):
    copy = binary[0:base1]
    copy += bytes(seq[0:1])
    copy += binary[base1+2:base1+base2]
    copy += bytes(seq[2:3])
    copy += binary[base2+2:]

score 4 · Answer 2 · edited Jun 12 '20 at 13:48

No, no, no and again NO!

Seldom the answer you get is not what you expect.

Some questions for you:

Is it possible that an expert doesn't know that it is possible to brute force a string of for bytes and iteratively try the SHA-1 until it converges? No
Is it possible he forget it ? No
Is it possible that you cannot do it on a rar file? No
Is the other answer wrong? absolutely NO

So what? ... Time.

The point is that you have to change so few bytes... only 4!

What does it means? 256⁴ that is 256x256x256x256 possibilities, a really really big number.
If your computer were able to process 1 operation per second (substitution in the file + sha1)...
you should wait more then 136 years, or if you prefer more then 49710 days.

You're enough lucky, a 5MB pre-cached file (already loaded in ram and in the cache) asks only about 0.03 seconds (min 0.025s), on an old computer. That shrinks your expecting time down to 1242-1492 days (something more then 3 years).

It is true, BTW, that statistically you should have a positive answer in half of the time. Nonetheless you should wait till you will have tried all the possibilities to be sure that there is only 1 substitution that will give you the same SHA-1 checksum...

Now that IMPOSSIBLE sounds as "not possible in a WORTHWHILE amount of time".

How to proceed

A more proper answer to your technical question: when you speak about brute force it have not to be necessary blind brute force.

It is just stated in a comment in the other answer that you do not need to calculate the sha1 checksum on the part before the corruption. You do the 1st time and you save time for each successive iteration (maybe a factor 2 it depends from the position).
Something that can change the worthless of the effort is to write a parallel code that will run on the GPU. If you have a good graphic card you may have around 1000 cores that can compute for you in parallel (even more but they have a frequency lower than the cpu, but still they are a lot). If you are able to decrease the time from 1400 to 1.4 days maybe you can even do it.
A different approach can lead you to a faster solution.
You said it is a rar file. The rar file structure is divided into blocks. If you take count of it you can see where the corruption falls. If it is on the part of the data, on the part of the headers or on both. Then you can act consequently. For sake of simplicity let's suppose it is over the data:
you can do the brute force of your offset, check for each positive CRC of that block if it is even positive the SHA1 on the whole file. Again you can do a parallel code.

Final note

If they were 6 bytes instead of 4 you were out of the game with the present technology.

Automatically 'brute force' a few bytes to recover a corrupt file

2 Answers2

No, no, no and again NO!

So what? ... Time.

How to proceed

Final note