If your application (script) is located in /foo, and when a request is made on /foo/bar%2fbaz (where %2f means an URI encoded forward slash "/"), what would the PATH_INFO value be? /bar%2fbaz
(undecoded) or /bar/baz
(decoded)?
First of all, Apache has a problem using %2c in the URL anyway: they 404 by default, and you should add AllowEncodedSlashes On to accept those requests. More annoyingly, even though the document says:
Allowing encoded slashes does not imply decoding. Occurrences of %2F or %5C (only on according systems) will be left as such in the otherwise decoded URL string.
This is not actually true, and mod_cgi and other Apache handlers decode those characters. It's reported as a bug but we're still seeing it today on Apache 2.x.
Back to the original question, I *think* PATH_INFO should be there undecoded, so I added a test to see how our Plack server implementations behave, and interestingly our CGI server and HTTP::Server::Simple backend failed the automated unit tests. I also confirmed it fails on FCGI with lighttpd frontend as well as Apache2 mod_perl handler.
I looked at the code that handles this thing, and HTTP::Request::AsCGI and HTTP::Server::Simple both decode PATH_INFO intentionally, with a note saying "we do this because Apache and lighttpd do this".
UPDATE: HTTP::Request::AsCGI leaves URI reserved characters encoded, like %2F because that made Catalyst tests fail. This is actually an incompatibility with Apache, and I confirmed their TestApp tests fail when tested with Apache 2.x CGI mode: here's a patch for Catalyst and HTTP::Request::AsCGI so the app should work correctly under Apache CGI as well.
Python's WSGI 2.0 wiki page also complains about this issue, linking to a detailed analysis against lots of different web servers, and suggests to include RAW_PATH_INFO in addition to PATH_INFO to avoid potential issues like this. Apache's mod_cgi and lighttpd contains REQUEST_URI environment variables which are undecoded, so it's possible to construct those RAW variables (otherwise we can't tell if it was encoded or not in the beginning).
I'm also interested how Rack deals with this issue. The spec says "the value MAY be % encoded" so it's not saying anything about the requirement.
So I think this is an Apache bug but unfortunately most software have been living with this bug, so changing PATH_INFO meaning might cause confusions even if PSGI is a new spec that can be free from the existent CGI spec (or in this case, implementations). So adding RAW_PATH_INFO, or REQUEST_URI which is currently not in the spec, to be undecoded might make more sense.
Thoughts?
Indeed, adding RAW_PATH_INFO could be a good idea, I also think that changing the PATH_INFO meaning could be worse than living with that bug.
Posted by: twitter.com/sukria | 2009.09.27 at 04:17
"So adding RAW_PATH_INFO, or REQUEST_URI which is currently not in the spec, to be undecoded might make more sense."
Probably a bad idea to duplicate that kind of information in the environment. I don't understand what's the big deal about specifying PATH_INFO (and SCRIPT_NAME) remain undecoded? Middleware authors SHOULD read the spec and understand that -- everyone else can just use a nicer request object.
Posted by: Dean Landolt | 2009.09.27 at 11:27
As a spec author, I really like the idea of specifying PATH_INFO and SCRIPT_NAME strictly (at least stricter than WSGI and Rack) so that they should be undecoded. And I agree with you that middleware authors should read it.
The only problem is that app side developers, say the web application framework that uses CGI.pm, should be aware of the difference whether PATH_INFO is decoded or undecoded when they switch to PSGI. PSGI clones the most of environment variables so that it's easy enough if they have a CGI implmementation. On the library side I could probably just handle them in CGI::PSGI, but there's still a lot to cover, like PSGI adapters for Jifty, Catalyst, Squatting.
Posted by: miyagawa | 2009.09.27 at 23:07
I reread RFC 3875 CGI spec and it actually says PATH_INFO is not URI encoded (though it doesn't say it SHOULD be URI decoded!) and Apache's default behavior to reject requests with %2F is allowed.
Unlike a URI path, the PATH_INFO is not URL-encoded, and cannot contain path-segment parameters. ... The server MAY impose restrictions and limitations on what values it permits for PATH_INFO, and MAY reject the request with an error if it encounters any values considered objectionable. That MAY include any requests that would result in an encoded "/" being decoded into PATH_INFO, as this might represent a loss of information to the script.
Alas.
Posted by: miyagawa | 2009.09.28 at 14:36
Also, exact same discussion in Rack mailing list:
http://groups.google.com/group/rack-devel/browse_thread/thread/ddf4622e69bea53f
Posted by: miyagawa | 2009.09.28 at 15:21