utf8::encode $_[0] if utf8::is_utf8 $_[0];
`utf8::encode if utf8::is_utf8` is a bug. Don't do it.
There are reasons URI::Escape provides two functions, uri_escape and uri_escape_utf8. The former handles arbitrary byte strings, whether it's utf-8 or not, and the latter handles given strings as (possibly wide) characters.
Doing utf8::encode based on the utf8 flag is so wrong. It just tells the internal representation of a scalar and some latin-1 range characters might be encoded in a bogus way unless you explicitly call utf8::ugprade on it before passing to the function.
Nothing's wrong with URI::Escape providing the uri_escape that handles arbitrary encodings. While I agree most web pages should just use UTF-8 for everything in 2010, using other text encodings such as EUC-JP, or even arbitrary binary data (such as JPEG data) in URL is not invalid either.
Mark's quote from RFC3986 is done without its context. It says "When a new URI scheme defines a component that represents textual data consisting of characters from [UCS] ..." which doesn't apply when we encode parameters for the web URLs. It's not a new URI scheme, it shouldn't necessarily represent "textual data" either.
Don't rely on utf8 flags of the strings. See perlunitut and perlunifaq for more details.
Thanks for the feedback. I updated my post to note which modules use the code you suggest is buggy. (They are CGI::Util and Mojo::Util).
The W3C once clarified that part of URI percent encoding should be first converting the data to UTF-8. They spell out that step here:
http://www.w3.org/International/O-URL-code.html
But, perhaps based on the warning at the top of the page, that advice is not completely current.
Would you say then that the current URI::Escape approach is best then-- providing one method which encodes arbitrary data, and one which first encodes it first as UTF-8 (regardless of the state of the UTF-8 flag?).
Posted by: Mark | 2010.12.17 at 21:00
Also, what is your opinion on the appropriateness of including handling of UTF-16 surrogate pairs in a URI percent-encoding solution? CGI.pm and URI::Escape::XS do this (using code from the same author).
Posted by: Mark | 2010.12.17 at 21:05
I wan to further clarify what the bug is here. According to the Perl docs is it against best practices to check the UTF-8 flag. Instead the programmer should keep track of the encodings of her strings and explicitly encode if necessary. Since this code sample does check the UTF-8 flag, not following the recommended best practices could be considered a bug of sorts. I get that.
However, in practical terms what is wrong with encoding something as UTF-8 that is marked by the language as being UTF-8? In your comment you mentioned EUC-JP and JPEG data. I can't see that either of these would have the UTF-8 flag set, and would thus not be UTF-8 encoded by the code snippet.
In practice, both CGI.pm and Catalyst have been using this method for years without any bug reports that I can see in the bug queues.
Posted by: Mark | 2010.12.18 at 11:47
> However, in practical terms what is wrong with encoding something as UTF-8 that is marked by the language as being UTF-8?
Well the pedantic answer is that the utf8 flag is just a mark *for perl* (not for programmers) to figure out that the contained string in a scalar has a wide character (> 255) and does not necessarily mean it is actually a unicode decoded strings.
My pragmatic answer for that is though, is that there's less risk handling strings *with* utf8 flag as decoded character strings. That's okay. The problem is to handle everything else as a byte string, which causes a bug with latin-1 characters.
Think of this code:
use HTML::Entities;
my $s1 = "Héllo";
my $s2 = "Héllo ሴ";
my $t1 = decode_entities($s1);
my $t2 = decode_entities($s2);
$t1 and $t2 are both generated from an HTML snippet, using the exact same function, decode_entities. Now $t1 does not have the utf8 flag, while $t2 does. In this case, whether a scalar variable having utf8 flag or not totally depends on the *data* not *code*.
Imagine the case we read data from a file or database to store $s1 and $s2. You should be aware how dangerous and risky to rely on the data input to change the behavior of the code.
The practical workaround for this is to clearly document that by saying "If you want your strings be handled as character strings rather than byte strings, your strings always have to be upgraded i.e. utf8::upgrade". A couple of my modules from around 2005-6 was doing things like this, and I admit that was based on my misunderstandings.
Actually perl 5's regular expression has the same sort of bug - matches like \w don't match against latin1 characters unless the strings has the utf8 flag. It's being worked on on perl 5.13 (and soon to be 5.14). See Unicode::Semantics for this bug.
Posted by: miyagawa | 2010.12.18 at 12:12
I have continued this conversation through a follow-up blog post on my own site:
http://mark.stosberg.com/blog/2010/12/best-practice-for-handling-utf-8-when-percent-encoding.html
Continued feedback there (or here) is welcome.
Posted by: Mark Stosberg | 2010.12.20 at 17:41
Mark: I posted my comment to your blog and it's in the moderation queue, maybe your MT installation is b0rked or loses my comment, so here you go:
I'm not sure what you refer to as a best practice employed by Catalyst because Catalyst in its core doesn't automatically handle unicode/utf8 stuff. It does it with the Unicode plugin. And look at what the Unicode plugin doc says:
http://search.cpan.org/~bobtfish/Catalyst-Plugin-Unicode-0.93/lib/Catalyst/Plugin/Unicode.pm
Note that this plugin tries to autodetect if your response is encoded into characters before trying to encode it into a byte stream. This is bad as sometimes it can guess wrongly and cause problems.
As an example, latin1 characters such as e' (e-accute) will not actually cause the output to be encoded as utf8.
So it admits what it is doing is wrong, and suggests using the other module, Catalyst::Plugin::Unicode::Encoding - which doesn't do this utf8::encode if utf8::is_utf8.
My best advice is to provide two functions, one that assumes the input is already encoded (utf8 or whatever encoding) and the one that assumes the input is strings and that encodes everything into utf-8 - that's to keep your application all Unicode centric and all I/O in utf-8.
Posted by: miyagawa | 2010.12.21 at 11:34