2

I have a huge log file (several GBs), but somehow there is binary data (grep is annoyed by it) in there, which of course shouldn't be.

I know how I can read the file anyway.

I don't know how I can find the bad binary data in it so I maybe can pin point where it's logged looking at the text around it?

keiki
  • 403
  • 5
  • 12

4 Answers4

3

I've just hit the exact same problem (although it's only a multi-megabyte log file). As with a lot of problems, it just takes a couple of commands together.

cmp /path/to/file.log <(strings /path/to/file.log)

cmp compares files and tells you where they differ (unlike diff, which tells you how they differ). strings returns valid text strings from binary files. <(…) lets you treat the output of a command as a file descriptor into another command.

Basically, you compare the log file with the text strings in the log file so that you find where they first differ.

For example, I get A and B differ: byte 1450315, line 6390. Running tail -n +6390 /path/to/file.log | less shows the log starting at the "bad" line, or you can pipe through | hexdump -C | less to see the hex (piping through head -n 1 didn't work for me because the binary was \x00 characters, which only showed when there was a pager)

(Note: This may not work well with multi-gigabyte logs if the machine does not have sufficient memory - I don't know how memory efficient strings and cmp are)

IBBoard
  • 296
1

Building on @ibboard's great idea: find-non-printable.sh:

#!/bin/sh
usage="$0 FILE - Locates first non-printable byte, as in 'FILE - differ: byte 21881, line 507'";
n_bytes=$(stat --printf='%s' "$1");
# -w: --include-all-whitespace
strings -w "$1" | cmp -n "$n_bytes" "$1";
  • No output when no non-printable characters are found, instead of message like cmp: EOF on FILE after byte 1677, line 47.
  • Simple pipe instead of using <(). Works with a POSIX shell instead of requiring Bash or ZSH. Output shows - as file name instead of /dev/....
  • Works as expected when the file contains blank lines. This requires a strings implementation that supports -w, such as GNU strings. Without this, the result just points to the first blank line, which is often not the desired result.
  • User needs to type file name only once.
0

I have solved the same problem. What worked for me, was to simply head the file incrementally and grep it to see at which line the binary characters occur.

At the beginning, I called head -n 1, there was no binary character. Then head -n 2, then head -n 3... and so on. Soon I found the line, where the binary character were present.

AleKi
  • 1
0

If you have my_file that contains (in vim):

test data 1
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@test data 2
test data 3

grep is bothered by the null byte (hex value 00, represented as ^@ in vim). If you search for a regular string 'data' with $ grep 'data' my_file you get Binary file my_file matches which is not the expected result. If you want to examine/remove null bytes by hand, you can find the offending bytes with:

$ < my_file hexdump -C | grep -C2 ' 00'
00000000  74 65 73 74 20 64 61 74  61 20 31 0a 00 00 00 00  |test data 1.....|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 74 65  |..............te|
00000020  73 74 20 64 61 74 61 20  32 0a 74 65 73 74 20 64  |st data 2.test d|
00000030  61 74 61 20 33 0a                                 |ata 3.|

and see the regular strings nearby the null bytes, which you can then search for in vim and edit (don't search for the periods; they represent whitespace characters like newline and null byte rather than literal ..) If you want to remove them programmatically:

$ < my_file sed 's/\x0//g' > my_file_without_nulls