I have some text files with a lot of Unicode Hebrew and Greek in them which need to be enclosed within an HTML <span class ="hebrew">...</span> element. These files belong to a project which has been running for some years.
Around eight years ago we successfully used this Perl script to do the job.
#!/usr/bin/perl
use utf8;
my $table = [
  {
    FROM  => "\\x{0590}",
    TO    => "\\x{05ff}",
    REGEX => "[\\x{0590}-\\x{05ff}]",
    OPEN  => "<span class =\"hebrew\">",
    CLOSE => "</span>",
  },
  {
    FROM  => "\\x{0370}",
    TO    => "\\x{03E1}",
    REGEX => "[\\x{0370}-\\x{03E1}]|[\\x{1F00}-\\x{1FFF}]",
    OPEN  => "<span class =\"greek\">",
    CLOSE => "</span>",
  },
];
binmode(STDIN,":utf8");
binmode(STDIN,"encoding(utf8)");
binmode(STDOUT,":utf8");
binmode(STDOUT,"encoding(utf8)");
while (<>) {
  my $line = $_;
  foreach my $l (@$table) {
    my $regex          = $l->{REGEX},
    my ($from, $to)    = ($l->{FROM},$l->{TO});
    my ($open, $close) = ($l->{OPEN},$l->{CLOSE});
    $line =~ s/(($regex)+(\s+($regex)+)*)/$open\1$close/g;
  }
  print $line;
}
That scans the text file looking for the defined Unicode ranges, and inserts the appropriate span wrapper.
I haven't used this script for some time, and I now need to process some more text files. But somehow the Unicode is not being preserved: the Unicode text is being corrupted instead of being wrapped in <span> tags.
I need help with a fix before I can proceed.
Here's some sample input
Mary had a little כֶּבֶשׂ, its fleece was white as χιών. And πάντα that Mary went, the כֶּבֶשׂ was sure to go.
And here's what I'm getting as output:
Mary had a little ×Ö¼Ö¶×ֶש×, its fleece was white as ÏιÏν. And ÏάνÏα that Mary went, the ×Ö¼Ö¶×Ö¶×©× was sure to go.
Just at the moment I'm on a machine with Linux Mint 13 LTS. My other OS is Ubuntu 14.04. The Perl version is reported as v. 5.14.2. I'm running the script like this
perl uconv.pl infile.txt > outfile.txt
I'm not sure what's happening, and in spite of looking at quite a few Stack Overflow questions and answers (this one for example), I'm none the wiser. Perhaps I need to set some environment variable? Or is something in that script now deprecated? Or...?
 
     
    