kafsemo.org

Perl Unicode notes

2005-01-08

(Presented herein, a few notes about Unicode in Perl, written up for XML::Writer support and recorded for posterity.) Perl has fairly good Unicode support, but the default behaviour is a little problematic (and also has changed between releases). Being clear about how you want to deal with Unicode avoids many problems.

Firstly, non-ASCII characters in Perl source should probably be avoided in favour of Unicode literals.

my $currency = "£";

This can break if the code ever moves to an environment with a different default encoding, whereas:

my $currency = "\x{A3}"; # U+00A3 POUND SIGN

is totally safe. As with HTML and XML numeric references, it’s unambiguously a Unicode character, rather than a sequence of bytes. (Is there a module to automate this conversion? Java has native2ascii.) This is particularly a problem with distributed development – if different developers use different native encodings, things will start to fail when code is combined.

Output is a case where Perl’s DWIMmery gets it into trouble.

print "\$\n";       # Dollar sign
print "\x{A3}\n";   # Pound sign
print "\x{20AC}\n"; # Euro sign

This turns out as 24 0a, a3 0a, e2 82 ac 0a: dollar fits into seven bits; no problem. Pound fits into eight bits, so it gets written as a single octet. The Euro sign doesn’t, so it gets written as UTF-8; that creates an output stream that uses multiple encodings, which is not only wrong but even breaks recovery heuristics on the reading side. Choose an encoding, declare it explicitly, and stick to it. Just set the encoding before printing any output:

binmode(STDOUT, ':encoding(iso-8859-1)');

or

binmode(STDOUT, ':encoding(utf-8)');

(or ':encoding(windows-1252)' if you’re more of a pragmatist).

Input is much the same:

$ { echo $; echo £; echo €; } |
  perl -ne 'chomp; print length($_),"\n";'
1
2
3

Here Perl is counting bytes, rather than characters; in a UTF-8 environment that’s not ideal. In another case, for an 8-bit Euro sign, this code was run with an ISO 8859-15 locale:

$ { echo $; echo £; echo €; } |
  perl -ne 'chomp; print sprintf("U+%04X", ord($_)),"\n";'
U+0024
U+00A3
U+00A4

It’s subtle, but failure to declare the encoding has turned the Euro sign into U+00A4, or ‘¤’ – the generic “currency sign”. It’s the same problem that turns Windows-1252 “smart” quotes into the control characters U+0093/U+0094.

Again, explicit declaration of encoding makes it all okay:

$ { echo $; echo £; echo €; } |
  perl -e 'binmode(STDIN, ":encoding(utf-8)"); while (<>) {
   chomp; print length($_),"\n";}'
1
1
1

and

$ { echo $; echo £; echo €; } |
  perl -e 'binmode(STDIN, ":encoding(iso-8859-15)"); while (<>) {
   chomp; print sprintf("U+%04X", ord($_)),"\n";}'
U+0024
U+00A3
U+20AC

If you’re in an all-UTF-8 environment, giving the command-line switch ‘-CDA’ to perl will make UTF-8 the default for all input, output and command-line arguments; setting PERL_UNICODE to ‘DA’ does the same. ‘-C’ on its own tries to do the right thing according to your locale; however, this only works for UTF-8 – there’s no special handling for other encodings.

As soon as you use anything outside ASCII, you need to think about encoding; be wary of default behaviour. The currency symbols are great test characters, too: there’s something reassuringly commercial about them, in case anyone tries to spin Unicode as something that only comes into play when you start blogging about maths.

FileCache::Handle

Seems like FileCache doesn’t work with XML::Writer; I took the opportunity to learn some more about PerlIO and wrote FileCache::Handle. It’s a module that provides an unlimited number of writeable filehandles, opening and closing the underlying files as necessary to avoid OS-imposed limits on numbers of open files. Rather than doing clever things with symbols, it uses instances of IO::Handle.

(Music: Whipping Boy, “We Don’t Need Nobody Else”)
(More from this year, or the front page? [K])