Browser Crawler
Browser Crawler is aimed to visit pages available on the site and extract useful information.
It can help maintaining e.g. lists of internal and external links,
creating sitemaps, visual testing using screenshots
or prepare the list of urls for the more sophisticated tool like Wraith.
Browser based crawling is performed with the help of Capybara and Chrome. Javascript is executed before page is analyzed allowing to crawl dynamic content. Browser based crawling is essentially an alternative to Wraith's spider mode, which parses only server side rendered html.
By default crawler visits pages following the links extracted. No button clicks performed other than during the optional authentication step. Thus crawler does not perform any updates to the site and can be treated as noninvasive.
Table of contents
- Installation
- Usage from command line
- Usage with scripting
- Restrictions
- Ideas for enhancements
- Development
- Contributing
- License
Installation
Add this line to your application's Gemfile:
gem 'crawler', github: 'DimaSamodurov/crawler'
And then execute:
$ bundle
Or install it yourself as:
$ gem install browser_crawler
Usage from command line
Without the authentication required:
crawl http://localhost:3000
With authentication, screenshots and limiting visited page number to 1:
crawl https://your.site.com/welcome -u username -p password -n 1 -s tmp/screenshots
# or
export username=dima
export password=secret
#...
crawl https://your.site.com/welcome -n 1 -s tmp/screenshots
Generate index from the captured screenshots. Index is saved to tmp/screenshots/index.html
.
bin/crawl -s tmp/screenshots
see additional options with:
bin/crawl -h
When finished the crawling report will be saved to tmp/crawl_report.yml
file by default.
You can specify the file path using command line options.
Usage with scripting
Below pointed an example script which configures the crawler and targets on the github.com
site
and after that records the result report as yaml file.
crawler = BrowserCrawler::Engine.new({
browser_options: {
headless: true,
window_size: [1200, 1600],
timeout: 60,
browser_options: { 'no-sandbox': nil }
},
max_pages: 10,
deep_visit: true
})
crawler.extract_links(url: 'https://github.com')
crawler.report_save
This gem use external dependency a cuprite
. The cuprite
allows working with browser without intermediaries (chrome-driver).
browser_options
responsible for configuration the chrome headless browser though the cuprite
.
max_pages
- an additional option to allow to set amount of pages for crawling. By default it equalsnil
and allows the crawler is browsing all pages within a certain domain.deep_visit
- a mode of the crawler when the crawler checks external resources without collecting links from them.
Callback methods
All of them you can use with Capybara DSL.
Callback methods Before/After crawling
crawler = BrowserCrawler::Engine.new()
# scroll down page before scan.
crawler.before do
page.execute_script 'window.scrollBy(0,10000)'
end
crawler.after do
page.body
end
crawler.extract_links(url: 'https://github.com')
Callback methods Before/After for each crawling page
crawler = BrowserCrawler::Engine.new()
# scroll down page before scan.
crawler.before type: :each do
page.execute_script 'window.scrollBy(0,10000)'
end
crawler.after type: :each do
page.body
end
crawler.extract_links(url: 'https://github.com')
Callback method is recorded unvisited links
Default behavior: by default crawler is sent all links from page to an unvisited_links array and after that browses each of them. This callback allows to change this behavior.
crawler = BrowserCrawler::Engine.new()
# scan_result consists of array with links from scaned page.
crawler.unvisited_links do
@page_inspector.scan_result
end
crawler.extract_links(url: 'https://github.com')
Changed behavior: change default behavior so that crawler browses only links which consist of /best-links
.
crawler = BrowserCrawler::Engine.new()
crawler.unvisited_links do
@page_inspector.scan_result.select { |link| link.include?('/best-links') }
end
crawler.extract_links(url: 'https://github.com')
Callback method is changed page scan rules
Default behavior: by default crawler get all links from page and move to one to another.
crawler = BrowserCrawler::Engine.new()
crawler.change_page_scan_rules do
page.all('a').map { |a| a['href'] }
end
crawler.extract_links(url: 'https://github.com')
Changed behavior: change default behavior so that crawler get only links to have selector paginations
.
crawler = BrowserCrawler::Engine.new()
crawler.change_page_scan_rules do
if URI.parse(page.current_url).to_s.include?('/help/')
page.all('a.paginations') { |a| a['href'] }
else
[]
end
end
crawler.extract_links(url: 'https://github.com')
Setup folder to save report file
crawler = BrowserCrawler::Engine.new()
crawler.extract_links(url: 'https://github.com')
crawler.report_save(folder_path: './reports/')
If the folder doesn't exist, BrowserCrawler
create the folder for report.
Save report to yaml file
crawler = BrowserCrawler::Engine.new()
crawler.extract_links(url: 'https://github.com')
crawler.report_save(type: :yaml)
Save report to csv file
crawler = BrowserCrawler::Engine.new()
crawler.extract_links(url: 'https://github.com')
crawler.report_save(type: :csv)
Usage with Wraith
Browser Crawler can be useful to update paths:
section of the wraith's configs.
Provided wraith config is placed to wraith/configs/capture.yaml
file, do:
crawl https://your.site.com/welcome -c wraith/configs/capture.yaml
Or if you have crawling report available, just use it without the URL to skip crawling:
bin/crawl -c tmp/wraith_config.yml -r tmp/crawl_report.yml
Restrictions
Current version has the authentication process hardcoded: the path to login form and the field names used are specific to the project the crawler is extracted from. Configuration may be added in a future version.
Ideas for enhancements
It should be easy to crawl the site as part of the automated testing. e.g. in order to verify the list of pages available on the site, or in order to generate visual report (Wraith does it better).
Integration with test frameworks
By integrating browser_crawler into the application test suite it would be possible accessing pages and content not easily accessible on real site. E.g. when performing data modifications.
By integrating into test suite it would be possible to use all the tools/mocks/helpers/ created to simulate user behavior. E.g. mock external request using e.g. VCR.
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/dimasamodurov/browser_crawler.
License
MIT