1

Is there any software that can be used to scrutinize all visible or invisible characters in a text file (characters like BOM, direction mark, line feed ...)?

Showing the Unicode name of characters also is a useful feature.

I want use such app for analyzing text files before parsing parsing them with a programming language.

Real Dreams
  • 5,408

5 Answers5

3

A good hex editor is probably your best bet. Try FrHed (http://frhed.sourceforge.net/en/) if you're on windows or bless (http://home.gna.org/bless/) on linux.

2

The BabelPad editor is great: when you place the cursor after a character, it shows you the Unicode number and the Unicode name. And it has a built-in Unicode information viewer, which shows many Unicode properties for characters. Unfortunately, it processes BOM instead of showing it, and it also interprets line break characters instead of showing them. There might be a way to change this; its documentation is... well, not the best part of it. But it will show invisible controls like LRM and can distinguish between a space and a no-break space etc.

1

Maybe this is helpful, though the answer is more fitting to Stack Overflow. I built a small parser in Perl which does what you want. Shame there's no highlighting here.

#!/usr/bin/perl
use strict; use warnings;
use feature qw(say);
use Data::Dumper;
use Unicode::String;
use utf8;

my $line_no = 1;
# Read stuff from the __DATA__ section as if it were a file,
# one line at a time
while (my $line = <DATA>) {
  # Create a Unicode::String object
  my $us = Unicode::String->new($line);

  # Iterate over the length of the string
  for (my $i = 0; $i < $us->length; $i++) {
    # Get the next char
    my $char = $us->substr($i, 1);
    # Output a description, one line per character
    printf "Line %i, column %i, 0x%x '%s' (%s)\n",
      $line_no,         # line number
      $i,               # colum number
      $char->ord,       # the ordinal of the char, in hex
      $char->as_string, # the stringified char (as in the input)
      $char->name;      # the glyph's name
  }
  # increment line number
  $line_no++;
}

# Below is the DATA section, which can be used as a file handle
__DATA__
This is some very strange unicode stuff right here:
٩(-̮̮̃-̃)۶ ٩(●̮̮̃•̃)۶ ٩(͡๏̯͡๏)۶ ٩(-̮̮̃•̃).

Let's see what this does:

  • Read from a file handle (the DATA section can be used like that) line by line.
  • Create an object that represents a Unicode string from the line.
  • Iterate the chars in that string
  • Output name, number and stuff about each char

It's really very straightforward. Maybe you can adapt it to php, though I don't know if there's a handy library around for the names.

Hope it helps.


I lifted the smiley thingies here: Which Unicode characters do smilies like ٩(•̮̮̃•̃)۶ consist of?

simbabque
  • 481
1

UltraEdit is a multi-platform text editor with Unicode support and a Hex mode that will show you the hex codes for everything side-by-side with the characters they generate. It even has a Hex find/replace dialog (at least on the Mac version, which is what I'm using at the moment). It's a bit pricy, but it does a lot of other stuff as well.

adv12
  • 160
1

I'd recommend Notepad++. If you go under View->Show Symbol and select "Show All Symbols" it will show any invisible characters with it's name. For example, it will show newlines as LF, CRLF, or CR depending on the newline format you're using.