CommonCrawlIndex
Use this gem to access Common Crawl URL Index with ruby. You can get more information on format and original python implementation at https://github.com/trivio/common_crawl_index
Installation
Add this line to your application's Gemfile:
gem 'common-crawl-index'
And then execute:
$ bundle
Or install it yourself as:
$ gem install common-crawl-index
Usage
When using with Rails in config/initializers/common_crawl_index.rb
CommonCrawlIndex::Client.config({
:access_key_id => "amazon aws access_key",
:secret_access_key => "amazon aws secret_key",
:cc_index_path => "s3://aws-publicdatasets/common-crawl/projects/url-index/url-index.1356128792" # optional
})
And to find URLs matching certain prefix use following syntax
client = CommonCrawlIndex::Client.new(AMAZON_ACCESS_KEY_ID, AMAZON_SECRET_ACCESS_KEY)
# or
client = CommonCrawlIndex::Client.new() # already configured
url = "http://www.amazon.com/"
client.find_by_prefix(url) do |url_data|
# get all URLs starting with http://www.amazon.com/
end
See spec/basic_spec.rb
for more examples on usage.
Contributing
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request