Browser Crawler

Build Status

Browser Crawler is aimed to visit pages available on the site and extract useful information.

It can help maintaining e.g. lists of internal and external links, creating sitemaps, visual testing using screenshots
or prepare the list of urls for the more sophisticated tool like Wraith.

Browser based crawling is performed with the help of Capybara and Chrome. Javascript is executed before page is analyzed allowing to crawl dynamic content. Browser based crawling is essentially an alternative to Wraith's spider mode, which parses only server side rendered html.

By default crawler visits pages following the links extracted. No button clicks performed other than during the optional authentication step. Thus crawler does not perform any updates to the site and can be treated as noninvasive.

Table of contents

Installation

Add this line to your application's Gemfile:

gem 'crawler', github: 'DimaSamodurov/crawler'

And then execute:

$ bundle

Or install it yourself as:

$ gem install browser_crawler

Usage from command line

Without the authentication required:

crawl http://localhost:3000

With authentication, screenshots and limiting visited page number to 1:

crawl https://your.site.com/welcome -u username -p password -n 1 -s tmp/screenshots
# or
export username=dima
export password=secret
#... 
crawl https://your.site.com/welcome -n 1 -s tmp/screenshots

Generate index from the captured screenshots. Index is saved to tmp/screenshots/index.html.

bin/crawl -s tmp/screenshots

see additional options with:

bin/crawl -h

When finished the crawling report will be saved to tmp/crawl_report.yml file by default. You can specify the file path using command line options.

Usage with scripting

Below pointed an example script which configures the crawler and targets on the github.com site and after that records the result report as yaml file.

crawler = BrowserCrawler::Engine.new({
    browser_options: {
        headless: true,
        window_size: [1200, 1600],
        timeout: 60,
        browser_options: { 'no-sandbox': nil }
    },
    max_pages: 10,
    deep_visit: true 
})

crawler.extract_links(url: 'https://github.com')
crawler.report_save

This gem use external dependency a cuprite. The cuprite allows working with browser without intermediaries (chrome-driver). browser_options responsible for configuration the chrome headless browser though the cuprite.

  • max_pages - an additional option to allow to set amount of pages for crawling. By default it equals nil and allows the crawler is browsing all pages within a certain domain.
  • deep_visit - a mode of the crawler when the crawler checks external resources without collecting links from them.

Callback methods

All of them you can use with Capybara DSL.

Callback methods Before/After crawling

crawler = BrowserCrawler::Engine.new()

# scroll down page before scan.
crawler.before do
  page.execute_script 'window.scrollBy(0,10000)' 
end

crawler.after do
   page.body  
end 

crawler.extract_links(url: 'https://github.com')

Callback methods Before/After for each crawling page

crawler = BrowserCrawler::Engine.new()

# scroll down page before scan.
crawler.before type: :each do
    page.execute_script 'window.scrollBy(0,10000)' 
end

crawler.after type: :each do
   page.body
end 

crawler.extract_links(url: 'https://github.com')

Default behavior: by default crawler is sent all links from page to an unvisited_links array and after that browses each of them. This callback allows to change this behavior.

crawler = BrowserCrawler::Engine.new()

# scan_result consists of array with links from scaned page.
crawler.unvisited_links do
  @page_inspector.scan_result   
end

crawler.extract_links(url: 'https://github.com')

Changed behavior: change default behavior so that crawler browses only links which consist of /best-links.

crawler = BrowserCrawler::Engine.new()

crawler.unvisited_links do
  @page_inspector.scan_result.select { |link| link.include?('/best-links') }   
end

crawler.extract_links(url: 'https://github.com')

Callback method is changed page scan rules

Default behavior: by default crawler get all links from page and move to one to another.

crawler = BrowserCrawler::Engine.new()

crawler.change_page_scan_rules do
  page.all('a').map { |a| a['href'] }   
end

crawler.extract_links(url: 'https://github.com')

Changed behavior: change default behavior so that crawler get only links to have selector paginations.

crawler = BrowserCrawler::Engine.new()

crawler.change_page_scan_rules do
  if URI.parse(page.current_url).to_s.include?('/help/')
    page.all('a.paginations') { |a| a['href'] }
  else
    []  
  end 
end

crawler.extract_links(url: 'https://github.com')

Setup folder to save report file

crawler = BrowserCrawler::Engine.new()
crawler.extract_links(url: 'https://github.com')

crawler.report_save(folder_path: './reports/')

If the folder doesn't exist, BrowserCrawler create the folder for report.

Save report to yaml file

crawler = BrowserCrawler::Engine.new()
crawler.extract_links(url: 'https://github.com')

crawler.report_save(type: :yaml)

Save report to csv file

crawler = BrowserCrawler::Engine.new()
crawler.extract_links(url: 'https://github.com')

crawler.report_save(type: :csv)

Usage with Wraith

Browser Crawler can be useful to update paths: section of the wraith's configs.

Provided wraith config is placed to wraith/configs/capture.yaml file, do:

crawl https://your.site.com/welcome -c wraith/configs/capture.yaml 

Or if you have crawling report available, just use it without the URL to skip crawling:

bin/crawl -c tmp/wraith_config.yml -r tmp/crawl_report.yml

Restrictions

Current version has the authentication process hardcoded: the path to login form and the field names used are specific to the project the crawler is extracted from. Configuration may be added in a future version.

Ideas for enhancements

It should be easy to crawl the site as part of the automated testing. e.g. in order to verify the list of pages available on the site, or in order to generate visual report (Wraith does it better).

Integration with test frameworks

By integrating browser_crawler into the application test suite it would be possible accessing pages and content not easily accessible on real site. E.g. when performing data modifications.

By integrating into test suite it would be possible to use all the tools/mocks/helpers/ created to simulate user behavior. E.g. mock external request using e.g. VCR.

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/dimasamodurov/browser_crawler.

License

MIT