Sunday, March 15, 2009

Handling HTTP Redirection in Ruby

I have a Ruby project where I'm dumping a bunch of bookmarks from delicious.com, then fetching each bookmarked page for analysis.

One of the problems I encountered early on is that the some of the web pages bookmarked would redirect to some other location. Simply checking for HTTP response code 200 was insufficient. I needed to check for redirection as well.

A quick Google search for "ruby follow http redirect" yields lots of results. Unfortunately, they're all very similar, and not quite right. In general, the examples you come across (even the one in the official Ruby documentation) don't handle the case when the redirected location is path relative to the original location. So you end up doing a get on a URL that looks like "../../redirected/location/index.html," which clearly won't work.

It turns out that detecting relative redirection is fairly straightforward:

until( found || attempts>=@@MAX_ATTEMPTS)
attempts+=1
http=Net::HTTP.new(url.host,url.port)
http.open_timeout = 10
http.read_timeout = 10
path=url.path
path="/" if path==""

req=Net::HTTP::Get.new(path,{'User-Agent'=>@@AGENT})
if url.instance_of? URI::HTTPS
http.use_ssl=true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
end
resp=http.request(req)
if resp.code=="200"
break
end
if (resp.header['location']!=nil)
newurl=URI.parse(resp.header['location'])
if(newurl.relative?)
puts "url was relative"
newurl=url+resp.header['location']
end
url=newurl

else
found=true #resp was 404, etc
end #end if location
end #until


The trick here is to ask the redirected url object if it is relative. If it is, then add the redirected path onto the old url object. the URI class overrides the '+' operator (what is this, C++?) so that you can concatenate the new path onto the old URL, by doing:
newurl=url+resp.header['location']