Tsumamigui

Gem Version circleci Code Climate Test Coverage Dependency Status Inline docs codebeat badge

Tsumamigui(つまみぐい) is a simple and hussle-free Ruby web scraping library.

Requirement

Ruby 2.1+

Installation

Add this line to your application's Gemfile:

gem 'tsumamigui'

Or install it yourself as:

$ gem install tsumamigui

Usage

You just give it a URL(or URLs) and Xpath to data you want to get with its label as a hash. Then you can get scraped and parsed data as array.

Tsumamigui.scrape('http://example.com', {h1: 'html/body/div/h1/text()'})

# Returns:
# [
#   {h1: 'Example Domain', scraped_from: 'http://example.com'}
# ]

You can specify multiple URLs if you want to scrape different pages which they have the same HTML structure.

urls = ['http://example.com/page/1', 'http://example.com/page/2']
Tsumamigui.scrape(urls, {h1: 'html/body/div/h1/text()'})

# Returns:
# [
#   {h1: 'Example Domain 1', scraped_from: 'http://example.com/page/1'}
#   {h1: 'Example Domain 2', scraped_from: 'http://example.com/page/2'}
# ]

Important: Tsumamigui requests each urls at intervals of 1.0~3.0sec automatically.

TODO

  • [ ] Custom request headers.

etc...

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/obiyuta/tsumamigui. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

Guideline

  1. Fork it ( http://github.com/obiyuta/tsumamigui )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Write codes and specs.
    • Run test suite with bundle exec rspec and confirm that it passes
    • Run lint checker with the bundle exec rubocop and confirm that it passes
  4. Commit your changes (git commit -am 'Add some feature')
  5. Push to the branch (git push origin my-new-feature)
  6. Create new Pull Request

License

The gem is available as open source under the terms of the MIT License.

Copyright (c) 2017 Obi Yuta. See MIT-LICENSE for details.