Raising URI::InvalidURIError from a perfectly valid URI

I was puzzled by URI::parse raising an URI::InvalidURIError on a perfectly well formed URI recently.

RUBY:
  1. URI::InvalidURIError: bad URI(is not URI?): http://practicalguile.com/articles?query=latest
  2. from /opt/local/lib/ruby/1.8/uri/common.rb:436:in `split'
  3. from /opt/local/lib/ruby/1.8/uri/common.rb:485:in `parse'
  4. from (irb):2
  5. from :0

What's not apparent in this exception message is that the url contained a trailing space and this was causing URI.parse to fail. The following specifications demonstrate how it can trigger this particular exception.

uri.spec.rb

RUBY:
  1. require 'rubygems'
  2. require 'spec'
  3. require 'uri'
  4.  
  5. describe URI do
  6. it "should raise an InvalidURIException with leading whitespace in url" do
  7. lambda{ URI.parse(' http://www.ruby-lang.org') }.should raise_error(URI::InvalidURIError)
  8. end
  9.  
  10. it "should raise an InvalidURIException with trailing whitespace in url" do
  11. lambda{ URI.parse('http://www.ruby-lang.org ') }.should raise_error(URI::InvalidURIError)
  12. end
  13. end

Running the spec will get you the result below.

ruby uri.spec.rb

..Finished in 0.030051 seconds

2 examples, 0 failures

Looking at the stacktrace in the exception, it's being raised by URI.split after URI.parse is invoked with the offending URL.

RUBY_INSTALL/1.8/uri/common.rb

RUBY:
  1. def self.parse(uri)
  2. scheme, userinfo, host, port,
  3. registry, path, opaque, query, fragment = self.split(uri)
  4.  
  5. if scheme && @@schemes.include?(scheme.upcase)
  6. @@schemes[scheme.upcase].new(scheme, userinfo, host, port,
  7. registry, path, opaque, query,
  8. fragment)
  9. else
  10. Generic.new(scheme, userinfo, host, port,
  11. registry, path, opaque, query,
  12. fragment)
  13. end
  14. end

Nothing weird happening in URI.parse, its a straightforward call to URI.split. So I'll go into URI.split, comments removed for brevity.

RUBY:
  1. def self.split(uri)
  2. case uri
  3. when ''
  4. when ABS_URI
  5. scheme, opaque, userinfo, host, port,
  6. registry, path, query, fragment = $~[1..-1]
  7.  
  8. if !scheme
  9. raise InvalidURIError,
  10. "bad URI(absolute but no scheme): #{uri}"
  11. end
  12. if !opaque && (!path && (!host && !registry))
  13. raise InvalidURIError,
  14. "bad URI(absolute but no path): #{uri}"
  15. end
  16. when REL_URI
  17. scheme = nil
  18. opaque = nil
  19.  
  20. userinfo, host, port, registry,
  21. rel_segment, abs_path, query, fragment = $~[1..-1]
  22. if rel_segment && abs_path
  23. path = rel_segment + abs_path
  24. elsif rel_segment
  25. path = rel_segment
  26. elsif abs_path
  27. path = abs_path
  28. end
  29. else
  30. raise InvalidURIError, "bad URI(is not URI?): #{uri}"
  31. end
  32.  
  33. path = '' if !path && !opaque # (see RFC2396 Section 5.2)
  34. ret = [
  35. scheme,
  36. userinfo, host, port,         # X
  37. registry,                        # X
  38. path,                         # Y
  39. opaque,                        # Y
  40. query,
  41. fragment
  42. ]
  43. return ret
  44. end

URI.split is matching the incoming url with an empty string as well as regular expressions for absolute and relative URIs. It's obvious from the specifications earlier that urls with leading/trailing whitespace do not match any of these and the case statement raises InvalidURIError, with the rather misleading message.

The regexes used for matching absolute and relative URIs is shown below, if you really want to know.

RUBY:
  1. require 'uri'
  2. include URI::REGEXP
  3.  
  4. ABS_URI
  5. /^
  6. ([a-zA-Z][-+.a-zA-Z\d]*):                     (?# 1: scheme)
  7. (?:
  8. ((?:[-_.!~*'()a-zA-Z\d;?:@&=+$,]|%[a-fA-F\d]{2})(?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)              (?# 2: opaque)
  9. |
  10. (?:(?:
  11. \/\/(?:
  12. (?:(?:((?:[-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*)@)?  (?# 3: userinfo)
  13. (?:((?:(?:(?:[a-zA-Z\d](?:[-a-zA-Z\d]*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:[-a-zA-Z\d]*[a-zA-Z\d])?)\.?|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\]))(?::(\d*))?))?(?# 4: host, 5: port)               |
  14. ((?:[-_.!~*'()a-zA-Z\d$,;+@&=+]|%[a-fA-F\d]{2})+)           (?# 6: registry)
  15. )
  16. |
  17. (?!\/\/))                              (?# XXX: '\/\/' is the mark for hostport)
  18. (\/(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?              (?# 7: path)
  19. )(?:\?((?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?           (?# 8: query)
  20. )
  21. (?:\#((?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?            (?# 9: fragment)
  22. $/xn
  23.  
  24. REL_URI
  25. /^
  26. (?:
  27. (?:
  28. \/\/
  29. (?:
  30. (?:((?:[-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*)@)?       (?# 1: userinfo)
  31. ((?:(?:(?:[a-zA-Z\d](?:[-a-zA-Z\d]*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:[-a-zA-Z\d]*[a-zA-Z\d])?)\.?|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\]))?(?::(\d*))?  (?# 2: host, 3: port)
  32. |
  33. ((?:[-_.!~*'()a-zA-Z\d$,;+@&=+]|%[a-fA-F\d]{2})+)             (?# 4: registry)
  34. )
  35. )
  36. |
  37. ((?:[-_.!~*'()a-zA-Z\d;@&=+$,]|%[a-fA-F\d]{2})+)              (?# 5: rel_segment)
  38. )?
  39. (\/(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?                  (?# 6: abs_path)
  40. (?:\?((?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?              (?# 7: query)
  41. (?:\#((?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?           (?# 8: fragment)
  42. $/xn

Looks rather intimidating, doesn't it? However, we're more interested in the beginning and end of the regular expressions so its safe to ignore all the stuff in between. Narrowing our focus down to the regex anchors (^ and $), we can see that there is no matching of whitespace, thus preventing a valid URI from being matched in URI.split.

This all means that URI.split has a undocumented pre-condition on the uri parameter being stripped of any whitespace around it.

2 Comments »

  1. Andy Croll said,

    September 15th, 2007 at 2:05 pm

    Submit a patch?

  2. Doug said,

    September 15th, 2007 at 11:46 pm

    I've posted a thread on the ruby core mailing list, I'll see what the maintainers have to say about it first.

RSS feed for comments on this post · TrackBack URL

Post a Comment