Tsumamigui
Tsumamigui(つまみぐい) is a simple and hussle-free Ruby web scraping library.
Requirement
Ruby 2.1+
Installation
Add this line to your application's Gemfile:
gem 'tsumamigui'
Or install it yourself as:
$ gem install tsumamigui
Usage
You just give it a URL(or URLs) and Xpath to data you want to get with its label as a hash. Then you can get scraped and parsed data as array.
Tsumamigui.scrape('http://example.com', {h1: 'html/body/div/h1/text()'})
# Returns:
# [
# {h1: 'Example Domain', scraped_from: 'http://example.com'}
# ]
You can specify multiple URLs if you want to scrape different pages which they have the same HTML structure.
urls = ['http://example.com/page/1', 'http://example.com/page/2']
Tsumamigui.scrape(urls, {h1: 'html/body/div/h1/text()'})
# Returns:
# [
# {h1: 'Example Domain 1', scraped_from: 'http://example.com/page/1'}
# {h1: 'Example Domain 2', scraped_from: 'http://example.com/page/2'}
# ]
Important: Tsumamigui requests each urls at intervals of 1.0~3.0sec automatically.
TODO
- [ ] Custom request headers.
etc...
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/obiyuta/tsumamigui. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
Guideline
- Fork it ( http://github.com/obiyuta/tsumamigui )
- Create your feature branch (git checkout -b my-new-feature)
- Write codes and specs.
- Run test suite with
bundle exec rspec
and confirm that it passes - Run lint checker with the
bundle exec rubocop
and confirm that it passes
- Run test suite with
- Commit your changes (git commit -am 'Add some feature')
- Push to the branch (git push origin my-new-feature)
- Create new Pull Request
License
The gem is available as open source under the terms of the MIT License.
Copyright (c) 2017 Obi Yuta. See MIT-LICENSE for details.