Raising URI::InvalidURIError from a perfectly valid URI
I was puzzled by URI::parse raising an URI::InvalidURIError on a perfectly well formed URI recently.
-
URI::InvalidURIError: bad URI(is not URI?): http://practicalguile.com/articles?query=latest
-
from /opt/local/lib/ruby/1.8/uri/common.rb:436:in `split'
-
from /opt/local/lib/ruby/1.8/uri/common.rb:485:in `parse'
-
from (irb):2
-
from :0
What's not apparent in this exception message is that the url contained a trailing space and this was causing URI.parse to fail. The following specifications demonstrate how it can trigger this particular exception.
uri.spec.rb
-
require 'rubygems'
-
require 'spec'
-
require 'uri'
-
-
describe URI do
-
it "should raise an InvalidURIException with leading whitespace in url" do
-
lambda{ URI.parse(' http://www.ruby-lang.org') }.should raise_error(URI::InvalidURIError)
-
end
-
-
it "should raise an InvalidURIException with trailing whitespace in url" do
-
lambda{ URI.parse('http://www.ruby-lang.org ') }.should raise_error(URI::InvalidURIError)
-
end
-
end
Running the spec will get you the result below.
ruby uri.spec.rb ..Finished in 0.030051 seconds 2 examples, 0 failures
Looking at the stacktrace in the exception, it's being raised by URI.split after URI.parse is invoked with the offending URL.
RUBY_INSTALL/1.8/uri/common.rb
-
def self.parse(uri)
-
scheme, userinfo, host, port,
-
registry, path, opaque, query, fragment = self.split(uri)
-
-
if scheme && @@schemes.include?(scheme.upcase)
-
@@schemes[scheme.upcase].new(scheme, userinfo, host, port,
-
registry, path, opaque, query,
-
fragment)
-
else
-
Generic.new(scheme, userinfo, host, port,
-
registry, path, opaque, query,
-
fragment)
-
end
-
end
Nothing weird happening in URI.parse, its a straightforward call to URI.split. So I'll go into URI.split, comments removed for brevity.
-
def self.split(uri)
-
case uri
-
when ''
-
when ABS_URI
-
scheme, opaque, userinfo, host, port,
-
registry, path, query, fragment = $~[1..-1]
-
-
if !scheme
-
raise InvalidURIError,
-
"bad URI(absolute but no scheme): #{uri}"
-
end
-
if !opaque && (!path && (!host && !registry))
-
raise InvalidURIError,
-
"bad URI(absolute but no path): #{uri}"
-
end
-
when REL_URI
-
scheme = nil
-
opaque = nil
-
-
userinfo, host, port, registry,
-
rel_segment, abs_path, query, fragment = $~[1..-1]
-
if rel_segment && abs_path
-
path = rel_segment + abs_path
-
elsif rel_segment
-
path = rel_segment
-
elsif abs_path
-
path = abs_path
-
end
-
else
-
raise InvalidURIError, "bad URI(is not URI?): #{uri}"
-
end
-
-
path = '' if !path && !opaque # (see RFC2396 Section 5.2)
-
ret = [
-
scheme,
-
userinfo, host, port, # X
-
registry, # X
-
path, # Y
-
opaque, # Y
-
query,
-
fragment
-
]
-
return ret
-
end
URI.split is matching the incoming url with an empty string as well as regular expressions for absolute and relative URIs. It's obvious from the specifications earlier that urls with leading/trailing whitespace do not match any of these and the case statement raises InvalidURIError, with the rather misleading message.
The regexes used for matching absolute and relative URIs is shown below, if you really want to know.
-
require 'uri'
-
include URI::REGEXP
-
-
ABS_URI
-
/^
-
([a-zA-Z][-+.a-zA-Z\d]*): (?# 1: scheme)
-
(?:
-
((?:[-_.!~*'()a-zA-Z\d;?:@&=+$,]|%[a-fA-F\d]{2})(?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*) (?# 2: opaque)
-
|
-
(?:(?:
-
\/\/(?:
-
(?:(?:((?:[-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*)@)? (?# 3: userinfo)
-
(?:((?:(?:(?:[a-zA-Z\d](?:[-a-zA-Z\d]*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:[-a-zA-Z\d]*[a-zA-Z\d])?)\.?|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\]))(?::(\d*))?))?(?# 4: host, 5: port) |
-
((?:[-_.!~*'()a-zA-Z\d$,;+@&=+]|%[a-fA-F\d]{2})+) (?# 6: registry)
-
)
-
|
-
(?!\/\/)) (?# XXX: '\/\/' is the mark for hostport)
-
(\/(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)? (?# 7: path)
-
)(?:\?((?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))? (?# 8: query)
-
)
-
(?:\#((?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))? (?# 9: fragment)
-
$/xn
-
-
REL_URI
-
/^
-
(?:
-
(?:
-
\/\/
-
(?:
-
(?:((?:[-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*)@)? (?# 1: userinfo)
-
((?:(?:(?:[a-zA-Z\d](?:[-a-zA-Z\d]*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:[-a-zA-Z\d]*[a-zA-Z\d])?)\.?|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\]))?(?::(\d*))? (?# 2: host, 3: port)
-
|
-
((?:[-_.!~*'()a-zA-Z\d$,;+@&=+]|%[a-fA-F\d]{2})+) (?# 4: registry)
-
)
-
)
-
|
-
((?:[-_.!~*'()a-zA-Z\d;@&=+$,]|%[a-fA-F\d]{2})+) (?# 5: rel_segment)
-
)?
-
(\/(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)? (?# 6: abs_path)
-
(?:\?((?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))? (?# 7: query)
-
(?:\#((?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))? (?# 8: fragment)
-
$/xn
Looks rather intimidating, doesn't it? However, we're more interested in the beginning and end of the regular expressions so its safe to ignore all the stuff in between. Narrowing our focus down to the regex anchors (^ and $), we can see that there is no matching of whitespace, thus preventing a valid URI from being matched in URI.split.
This all means that URI.split has a undocumented pre-condition on the uri parameter being stripped of any whitespace around it.
Andy Croll said,
September 15th, 2007 at 2:05 pm
Submit a patch?
Doug said,
September 15th, 2007 at 11:46 pm
I've posted a thread on the ruby core mailing list, I'll see what the maintainers have to say about it first.