Why does my "grep" stop filtering a non-ASCII file it thinks is "binary"?

Question

I'm working with a Windows-10 computer, using a WSL.

I'm investigating a logfile, produced by NLog in a C# application. I'm expecting log entries to appear everywhere throughout the file, but I see the following:

Linux prompt> grep "geen mengcontainer" logfile.log
2023-03-07 07:25:08.7971 | Warn | ... | geen mengcontainer.
2023-03-07 07:25:09.8285 | Warn | ... | geen mengcontainer.
2023-03-07 07:25:10.8754 | Warn | ... | geen mengcontainer.
Binary file logfile.log matches

As you see, after 07:25:10, the grep stops, even though the file goes further for the rest of the day. There seems to be some character, telling grep that the file is not a textfile, but a binary file, causing grep to stop working.

Some more information about the file:

Linux prompt>file logfile.log
logfile.log: ASCII text, with CRLF line terminators

Some more information about my Linux WSL installation:

Linux prompt>uname -a
Linux ComputerName 4.4.0-19041-Microsoft
  #2311-Microsoft Tue Nov 08 17:09:00 PST 2022 
  x86_64 x86_64 x86_64 GNU/Linux
Linux prompt> cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
...
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Some more information about my grep installation:

Linux prompt> grep --version
grep (GNU grep) 3.4

What can I do?

Does anybody know how to find and replace the character, which is responsible for grep to stop filtering?
Does anybody know which extra parameter or switch I can add to grep in order not to stop filtering?
Does anybody know about a grep version which does not behave like this? (Please take into account that apt update things don't work on my environment)

Thanks in advance

grawity · Accepted Answer · 2023-03-10T11:42:19.233

Use grep -a to force a file to always be treated as text.

The "binary file" detection is codepage-sensitive – if grep expects UTF-8 input as usual on Linux, it will actually end up detecting "ANSI" (Windows-125x, ISO 8859-x) encoded text files as binary files. Running grep under the "C" locale with LC_CTYPE=C grep or LC_ALL=C grep may also avoid this problem.

(Also, what 'file' says about the input being "ASCII" is based entirely on a quick look at the initial bytes within the file; it doesn't actually scan the entire thing, whereas 'grep' of course does.)

Usually the entire file is in the same encoding (i.e. all of it is likely to be non-UTF-8), so an easy way to find the problematic characters is to search for non-ASCII bytes (LC_ALL=C may be needed):

grep -a -P -n --color '[^\x00-\x7F]' logfile.log

perl -ne 'print "Line $.:\t$_" if /[^\0-\177]/' < logfile.log

This would also highlight the bytes in question:

perl -ne 'print "Line $.:\t$_" if s/[^\0-\177]/sprintf"\e[41m<%02X>\e[m",ord$&/ge' < logfile.log

If the file is valid UTF-8 except with some odd lines, use a similar approach to print lines that fail UTF-8 decoding:

perl -MEncode -ne 'print "Line $.:\t$_" if !eval{decode("UTF-8", $_, Encode::FB_CROAK)}' < logfile.log

Why does my "grep" stop filtering a non-ASCII file it thinks is "binary"?

1 Answers1