Raising URI::InvalidURIError from a perfectly valid URI

I was puzzled by URI::parse raising an URI::InvalidURIError on a perfectly well formed URI recently.

RUBY:
  1. URI::InvalidURIError: bad URI(is not URI?): http://practicalguile.com/articles?query=latest
  2. from /opt/local/lib/ruby/1.8/uri/common.rb:436:in `split'
  3. from /opt/local/lib/ruby/1.8/uri/common.rb:485:in `parse'
  4. from (irb):2
  5. from :0

What's not apparent in this exception message is that the url contained a trailing space and this was causing URI.parse to fail. The following specifications demonstrate how it can trigger this particular exception.

uri.spec.rb

RUBY:
  1. require 'rubygems'
  2. require 'spec'
  3. require 'uri'
  4.  
  5. describe URI do
  6. it "should raise an InvalidURIException with leading whitespace in url" do
  7. lambda{ URI.parse(' http://www.ruby-lang.org') }.should raise_error(URI::InvalidURIError)
  8. end
  9.  
  10. it "should raise an InvalidURIException with trailing whitespace in url" do
  11. lambda{ URI.parse('http://www.ruby-lang.org ') }.should raise_error(URI::InvalidURIError)
  12. end
  13. end

Running the spec will get you the result below.

ruby uri.spec.rb

..Finished in 0.030051 seconds

2 examples, 0 failures

Looking at the stacktrace in the exception, it's being raised by URI.split after URI.parse is invoked with the offending URL.

RUBY_INSTALL/1.8/uri/common.rb

RUBY:
  1. def self.parse(uri)
  2. scheme, userinfo, host, port,
  3. registry, path, opaque, query, fragment = self.split(uri)
  4.  
  5. if scheme && @@schemes.include?(scheme.upcase)
  6. @@schemes[scheme.upcase].new(scheme, userinfo, host, port,
  7. registry, path, opaque, query,
  8. fragment)
  9. else
  10. Generic.new(scheme, userinfo, host, port,
  11. registry, path, opaque, query,
  12. fragment)
  13. end
  14. end

Nothing weird happening in URI.parse, its a straightforward call to URI.split. So I'll go into URI.split, comments removed for brevity.

RUBY:
  1. def self.split(uri)
  2. case uri
  3. when ''
  4. when ABS_URI
  5. scheme, opaque, userinfo, host, port,
  6. registry, path, query, fragment = $~[1..-1]
  7.  
  8. if !scheme
  9. raise InvalidURIError,
  10. "bad URI(absolute but no scheme): #{uri}"
  11. end
  12. if !opaque && (!path && (!host && !registry))
  13. raise InvalidURIError,
  14. "bad URI(absolute but no path): #{uri}"
  15. end
  16. when REL_URI
  17. scheme = nil
  18. opaque = nil
  19.  
  20. userinfo, host, port, registry,
  21. rel_segment, abs_path, query, fragment = $~[1..-1]
  22. if rel_segment && abs_path
  23. path = rel_segment + abs_path
  24. elsif rel_segment
  25. path = rel_segment
  26. elsif abs_path
  27. path = abs_path
  28. end
  29. else
  30. raise InvalidURIError, "bad URI(is not URI?): #{uri}"
  31. end
  32.  
  33. path = '' if !path && !opaque # (see RFC2396 Section 5.2)
  34. ret = [
  35. scheme,
  36. userinfo, host, port,         # X
  37. registry,                        # X
  38. path,                         # Y
  39. opaque,                        # Y
  40. query,
  41. fragment
  42. ]
  43. return ret
  44. end

URI.split is matching the incoming url with an empty string as well as regular expressions for absolute and relative URIs. It's obvious from the specifications earlier that urls with leading/trailing whitespace do not match any of these and the case statement raises InvalidURIError, with the rather misleading message.

The regexes used for matching absolute and relative URIs is shown below, if you really want to know.

RUBY:
  1. require 'uri'
  2. include URI::REGEXP
  3.  
  4. ABS_URI
  5. /^
  6. ([a-zA-Z][-+.a-zA-Z\d]*):                     (?# 1: scheme)
  7. (?:
  8. ((?:[-_.!~*'()a-zA-Z\d;?:@&=+$,]|%[a-fA-F\d]{2})(?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*)              (?# 2: opaque)
  9. |
  10. (?:(?:
  11. \/\/(?:
  12. (?:(?:((?:[-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*)@)?  (?# 3: userinfo)
  13. (?:((?:(?:(?:[a-zA-Z\d](?:[-a-zA-Z\d]*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:[-a-zA-Z\d]*[a-zA-Z\d])?)\.?|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\]))(?::(\d*))?))?(?# 4: host, 5: port)               |
  14. ((?:[-_.!~*'()a-zA-Z\d$,;+@&=+]|%[a-fA-F\d]{2})+)           (?# 6: registry)
  15. )
  16. |
  17. (?!\/\/))                              (?# XXX: '\/\/' is the mark for hostport)
  18. (\/(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?              (?# 7: path)
  19. )(?:\?((?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?           (?# 8: query)
  20. )
  21. (?:\#((?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?            (?# 9: fragment)
  22. $/xn
  23.  
  24. REL_URI
  25. /^
  26. (?:
  27. (?:
  28. \/\/
  29. (?:
  30. (?:((?:[-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*)@)?       (?# 1: userinfo)
  31. ((?:(?:(?:[a-zA-Z\d](?:[-a-zA-Z\d]*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:[-a-zA-Z\d]*[a-zA-Z\d])?)\.?|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\]))?(?::(\d*))?  (?# 2: host, 3: port)
  32. |
  33. ((?:[-_.!~*'()a-zA-Z\d$,;+@&=+]|%[a-fA-F\d]{2})+)             (?# 4: registry)
  34. )
  35. )
  36. |
  37. ((?:[-_.!~*'()a-zA-Z\d;@&=+$,]|%[a-fA-F\d]{2})+)              (?# 5: rel_segment)
  38. )?
  39. (\/(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)?                  (?# 6: abs_path)
  40. (?:\?((?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?              (?# 7: query)
  41. (?:\#((?:[-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))?           (?# 8: fragment)
  42. $/xn

Looks rather intimidating, doesn't it? However, we're more interested in the beginning and end of the regular expressions so its safe to ignore all the stuff in between. Narrowing our focus down to the regex anchors (^ and $), we can see that there is no matching of whitespace, thus preventing a valid URI from being matched in URI.split.

This all means that URI.split has a undocumented pre-condition on the uri parameter being stripped of any whitespace around it.

2 Comments

Listening to your tests

One of the challenges I've been trying to overcome in practicing Test First Development(TFD) has been making sense of the feedback that comes from TFD. It was not obvious to me till recently, after I've read an excellent article (IEEE Explorer account required) by Bas Vodde and Lasse Koskela in IEEE Software. Bas and Lasse recount their experiences in conducting TFD workshops in Nokia and in particular the insights gleaned from a TFD coding exercise.

One key point made by the authors was that although the participants in the coding exercise followed the test-code-refactor cycle, their code became progressively complex and littered with nested branching constructs. It made keeping track of the software's behaviour difficult. Bas and Lasse observed that once the the initial design approach was chosen, none of the participants thought about whether the design was still suitable for the current requirements.

Essentially, the test-code-refactor cycle was taking longer to complete and the code was turning out to be an unmaintainable mess. This feedback was lost on the participants and while some decided to hide the code's complexity behind refactorings that make the code read better, others simply added more tests and attempted to make them pass.

It should be obvious that emergent design will only occur when there is constant reflective thinking about the state of the code. This takes a bit of skill and confidence on the part of the developer. Simply going through the motions of test-code-refactor to the simplest design without this reflective thinking will lower the effectiveness of TFD as a design technique.

Comments

Programming Erlang, almost

So I've bit the bullet and purchased the PDF of Programming Erlang from the Pragmatic Programmers. This despite the fact that I've got 6 other technical books to finish. This would probably be a good time to get used to skip reading books the first time through.

1 Comment

Using Factories for Rails Fixtures and Test Doubles

Chris Wanstrath has written about making Rails fixtures less painful than they need to be with the FixtureScenarios plugin. Personally, I prefer the Factory approach, nicely explained by Daniel Manges.

I've been using factory methods to create in-database ActiveRecord objects for a project that I've been working on in Bezurk. Reading Daniel's article gave me a few ideas on improving the way I create fixtures and mocks. Since I've been using RSpec extensively in this project, I'll present the examples in RSpec.

As the models evolve with the design and its behaviour change accordingly, there is a need to go through all the specifications that create this model and make sure that its created in a valid state. This is more pronounced with the use of test doubles, the test doubles also need to have its method stubs changed to reflect the latest state of the model that its is representing. I happen to make much use of test doubles for test isolation, so trying to manage all these objects became an exercise in patience. As it was getting painful, It's time to change the way I create these models and test doubles.

As always, a layer of indirection will always go some way to solving a software problem. We introduce a Factory that encapsulates the creation of ActiveRecord objects by providing creation methods.

RUBY:
  1. module FixtureFactory
  2. def create_user(attributes = {})
  3. User.create!(ModelAttributes.user(attributes))
  4. end
  5. end

We'll have a Factory for test doubles too.

RUBY:
  1. module MockFactory
  2. def mock_user(method_stubs = {})
  3. mock_model(User, ModelAttributes.user(method_stubs))
  4. end
  5. end

And the attributes for this model will be declared in a module that's used by both Factories

RUBY:
  1. module ModelAttributes
  2. def self.user(attributes)
  3. attributes.reverse_merge({:name => 'doug'})
  4. end
  5. end

The Factory modules are then included in Spec::Runnner

RUBY:
  1. Spec::Runner.configure do |config|
  2. include FixtureFactory
  3. include MockFactory
  4. end

The objects can now be created using the factory methods available to all specifications.

RUBY:
  1. doug = create_user
  2. doppelganger = mock_user

Update
Added links to Chris Wanstrath and Daniel Manges' articles on managing Rails fixtures.

Comments