SpiderMech

SpiderMech crawls a given domain, and reports on the pages linked to from given urls, and the assets that said page depends on.

Installation

Add this line to your application's Gemfile:

gem 'spidermech'

And then execute:

$ bundle

Or install it yourself as:

$ gem install spidermech

Gem Usage

require 'spidermech'
spider = SpiderMech.new 'http://google.com'
spider.run # returns the sitemap hash
spider.save_json # saves the sitemap hash as google.com.json

Command Line Usage

The gem provides a command line tool. You can invoke it via

bundle exec spidermech http://google.com

It will crawl the page and give you the appropriate output.

Sample Output

[{:url=>"http://localhost:8321", 
    :assets=>
        {:scripts=>["https://ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js", "http://getbootstrap.com/dist/js/bootstrap.min.js"],

        :images=>[],

        :css=>
        ["http://getbootstrap.com/dist/css/bootstrap.min.css", "http://getbootstrap.com/examples/starter-template/starter-template.css"]}, 

        :links
        =>["/", "/about.html", "/contact.html"]},
]

Contributing

  1. Fork it ( http://github.com//crawler/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request