15

I'm working with a Windows-10 computer, using a WSL.

I'm investigating a logfile, produced by NLog in a C# application. I'm expecting log entries to appear everywhere throughout the file, but I see the following:

Linux prompt> grep "geen mengcontainer" logfile.log
2023-03-07 07:25:08.7971 | Warn | ... | geen mengcontainer.
2023-03-07 07:25:09.8285 | Warn | ... | geen mengcontainer.
2023-03-07 07:25:10.8754 | Warn | ... | geen mengcontainer.
Binary file logfile.log matches

As you see, after 07:25:10, the grep stops, even though the file goes further for the rest of the day. There seems to be some character, telling grep that the file is not a textfile, but a binary file, causing grep to stop working.

Some more information about the file:

Linux prompt>file logfile.log
logfile.log: ASCII text, with CRLF line terminators

Some more information about my Linux WSL installation:

Linux prompt>uname -a
Linux ComputerName 4.4.0-19041-Microsoft
  #2311-Microsoft Tue Nov 08 17:09:00 PST 2022 
  x86_64 x86_64 x86_64 GNU/Linux

Linux prompt> cat /etc/os-release NAME="Ubuntu" VERSION="20.04.2 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.2 LTS" VERSION_ID="20.04" ... VERSION_CODENAME=focal UBUNTU_CODENAME=focal

Some more information about my grep installation:

Linux prompt> grep --version
grep (GNU grep) 3.4

What can I do?

  • Does anybody know how to find and replace the character, which is responsible for grep to stop filtering?
  • Does anybody know which extra parameter or switch I can add to grep in order not to stop filtering?
  • Does anybody know about a grep version which does not behave like this? (Please take into account that apt update things don't work on my environment)

Thanks in advance

Peter Cordes
  • 6,345
Dominique
  • 2,373

1 Answers1

34

Use grep -a to force a file to always be treated as text.

The "binary file" detection is codepage-sensitive – if grep expects UTF-8 input as usual on Linux, it will actually end up detecting "ANSI" (Windows-125x, ISO 8859-x) encoded text files as binary files. Running grep under the "C" locale with LC_CTYPE=C grep or LC_ALL=C grep may also avoid this problem.

(Also, what 'file' says about the input being "ASCII" is based entirely on a quick look at the initial bytes within the file; it doesn't actually scan the entire thing, whereas 'grep' of course does.)

Usually the entire file is in the same encoding (i.e. all of it is likely to be non-UTF-8), so an easy way to find the problematic characters is to search for non-ASCII bytes (LC_ALL=C may be needed):

grep -a -P -n --color '[^\x00-\x7F]' logfile.log
perl -ne 'print "Line $.:\t$_" if /[^\0-\177]/' < logfile.log

This would also highlight the bytes in question:

perl -ne 'print "Line $.:\t$_" if s/[^\0-\177]/sprintf"\e[41m<%02X>\e[m",ord$&/ge' < logfile.log

If the file is valid UTF-8 except with some odd lines, use a similar approach to print lines that fail UTF-8 decoding:

perl -MEncode -ne 'print "Line $.:\t$_" if !eval{decode("UTF-8", $_, Encode::FB_CROAK)}' < logfile.log
grawity
  • 501,077