fun with character encoding
Posted by mop Tue, 29 Jun 2004 03:45:00 GMT
More fun with character encoding... this time with Perl. I’ve been down the encoding path with Java, discovering on the way some of the flaws in the IO libraries (see esp. FileWriter and FileReader).
Similar to Java, Perl does support Unicode in it’s string representation, UTF-8 actually. It’s nice, ’cause Perl regular expressions can act on Unicode strings -- as long as you’re consistent in using UTF-8 characters. The problems start when using modules that don’t handle anything other than 7-bit ASCII. Same mistakes as you’ll find in Java libraries. To be fair, the Perl modules are centrally organized via CPAN, but not designed/written/QA’d by the owners of Java, so we should cut the Perl module authors some slack.
But you’d think that a Perl module related to XML would be different. The XML:CSV module is a handy tool that uses the Text::CSV_XS library to parse comma delimited (whatever that is) records and spit out proper XML. Unfortunately the CSV_XS module silently fails when it encounters non-ascii data.
Perl itself does a decent job of explicit handling of character ecoding when doing IO. Perl started as a text processing engine, and text based IO is still the bread and butter for the Perl sandwich. There is even an extensive description of Perl’s Unicode and encoding if you go looking. Specify the encoding for any file or stream, before or after it’s been opened, and Perl transparently converts as required.
Ok, so we know that Perl uses Unicode internally, and that some modules expect ASCII only. My immediate problem involves XML so I’m quite happy to use 7-bit ASCII plus entities for the extended characters. Turns out this is easily done with the Encode module. Problem solved, I think.
Update: of course, all this applies to recent versions of Java/Perl. In particular, the clever conversion from ISO-8859-1 to ASCII doesn’t work with the version of Perl in Debian woody. yuck.
