uri_pathname

UriPathname eases the conversions between URIs and unique valid pathnames. This feature might be useful, for instance, when:

  • Web Spidering: You want to save webpages to files, or save all their contents within a directory, or only some scraped data, … and don’t know how to name them. UriPathname can assign easily unique valide names to files or directories from already known URIs, combinig scheme, hostname, path and vars.

  • Web Stubbing / Testing: You need to retrieve previously saved webpages by means of their URIs. UriPathname guesses the pathname from a given URI, then you can use, for instance, your File.read to read that file/page.

Installation

$ sudo gem install uri_pathname

The main source repository is github.com/syborg/uri_pathname.

Examples

(see examples directory)

First of all

Require at least …

require 'rubygems'
require 'uri_pathname'

Generate pathnames from URIs

Theses examples are useful when web spidering. You will need to generate a pathname given an URI.

# From URI with path and query
puts up.uri_to_pathname("http://www.fake.fak/path1/path2?query")
# => www.fake.fak__|path1_|_path2?query(http)

# From URI without path and query
puts up.uri_to_pathname("http://www.fake.fak")
# => www.fake.fak__|_NOPATH_(http)

# Fragments or other URI parts are silently ignored
puts up.uri_to_pathname('http://donaldfagen.com/disc_nightfly.php#rubybaby')
# => donaldfagen.com__|disc_nightfly.php(http)

# If a directory is given as a second arg, pathname will be prepended with that
# expanded directory
puts up.uri_to_pathname("http://www.fake.fak", "~/my_webdumps")
# => /home/marcel/my_webdumps/www.fake.fak__|_NOPATH_(http)

# Also a third argument can be given and will be appended as an extension
puts up.uri_to_pathname("http://www.fake.fak/path","", ".html")
# => www.fake.fak__|path(http).html

Recovering URIs from (correct) pathnames

When stubbing with tools like fakeweb, you can use reverse conversion to register fake accesses to real URIs.

# pathnames can be parsed if correctly generated as above examples (see docs)
p up.parse('/home/marcel/my_webdumps/www.fake.fak__|path1_|_path2?query(http).html.gz')

# URI can be also retrieved from correct (Uri)pathnames
puts up.pathname_to_uri('/home/marcel/my_webdumps/www.fake.fak__|_NOPATH_(http).html.gz')
# => http://www.fake.fak
puts up.pathname_to_uri('www.fake.fak__|path?query(http).html.gz')
# => http://www.fake.fak/path?query

Web Spidering / Dumping / Stubbing

This example shows how tu use UriPathname to assign names to files and also registering those files to stubb real accesses later.

require 'rubygems'
require 'fileutils'
require 'open-uri'
require 'uri_pathname'
require 'fakeweb' # gem install fakeweb

# put here whatever temporary directory name to use
MY_DIR=File.expand_path '~/my_dumps'
# put here whatever URIs u want to access
MY_URIS = [
  'http://en.wikipedia.org/wiki/Ruby_Bridges',
  'http://donaldfagen.com/disc_nightfly.php',
  'http://www.rubi.cat/ajrubi/portada/index.php',
  'http://www.google.com/cse?q=array&cx=013598269713424429640%3Ag5orptiw95w&ie=UTF-8&sa=Search'
]

# some convenient defs
def prepare_example
  File.makedirs(MY_DIR) unless (File.exist?(MY_DIR) and File.directory?(MY_DIR))
end

# preparation (comment this if you've already got your test dir)
prepare_example

up = UriPathname.new

# 1st round: Capture MY_URIS, and save them with appropiate UriPathname
puts "1- Capturing URIs"
data = nil
sizes = []
MY_URIS.each do |uri|
  open uri do |u|
    data=u.read
    pathname = up.uri_to_pathname(uri,MY_DIR,".html")
    File.open(pathname,'w') do |f|
      f.write data
      sizes << data.size
      puts "SAVED #{uri} : #{data.size} bytes"
    end
  end
end

# 2nd round: checking saved files and preparing fake web accesses
puts "\n2- CHECKING CAPTURED FILES AND PREPARING FAKE WEB ACCESSES"
FakeWeb.allow_net_connect=false
Dir[File.join(MY_DIR,"*")].each do |name|
  uri = up.pathname_to_uri name
  FakeWeb.register_uri :any, uri, :body=>name, :content_type=>"text/html"
  puts "#{name}\n\tcorresponds to #{uri}"
end

# 3nd round: Access Web without actually accessing web
puts "\n3- FAKE WEB ACCESSES"
MY_URIS.each_with_index do |uri,i|
  open uri do |u|
    data=u.read
    puts "FAKE #{uri} ACCESS #{(data.size == sizes[i]) ? 'OK' : 'KO'}: #{data.size} bytes"
  end
end

Release Notes

  • At present, UriPathname uses only some parts of an URI (scheme, hostname, path and query) to generate a valid and unique pathname that can be backconverted to URI. Port, User and other URI features are not yet used. I haven’t had the necessity to include them too ;-).

  • Only Linux pathnames have been taken into account. I don’t know if UriPathname will generate correct Windows or OSX pathnames, for instance. Test it and feel free to collaborate.

  • This is a very early release. I haven’t got the time to study and prepare tests, nonetheless, some examples will become tests in the future

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)

  • Send me a pull request. Bonus points for topic branches.

Copyright © 2011 Marcel Massana. See LICENSE for details.