Module: Wgit::DSL

Included in:
Base
Defined in:
lib/wgit/dsl.rb

Overview

DSL methods that act as a wrapper around Wgit's underlying class methods. All instance vars/constants are prefixed to avoid conflicts when included.

Constant Summary collapse

DSL_ERROR__NO_START_URL =

Error message shown when there's no URL to crawl.

"missing url, pass as parameter to this or \
the 'start' function".freeze

Instance Method Summary collapse

Instance Method Details

#crawl(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl_url

Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath. If no urls are provided, then the start URL is used.

Parameters:

  • urls (*Wgit::Url)

    The URL's to crawl. Defaults to the start URL(s).

  • follow_redirects (Boolean, Symbol) (defaults to: true)

    Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :origin, :host, :domain or :brand. See Wgit::Url#relative? opts param. This value will be used for all urls crawled.

Yields:

  • (doc)

    Given each crawled page (Wgit::Document); this is the only way to interact with them.

Returns:

Raises:

  • (StandardError)

    If no urls are provided and no start URL has been set.



99
100
101
102
103
104
105
# File 'lib/wgit/dsl.rb', line 99

def crawl(*urls, follow_redirects: true, &block)
  urls = (@dsl_start || []) if urls.empty?
  raise DSL_ERROR__NO_START_URL if urls.empty?

  urls.map! { |url| Wgit::Url.parse(url) }
  get_crawler.crawl_urls(*urls, follow_redirects:, &block)
end

#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r

Crawls an entire site using Wgit::Crawler#crawl_site underneath. If no url is provided, then the first start URL is used.

Parameters:

  • urls (*String, *Wgit::Url)

    The base URL(s) of the website(s) to be crawled. It is recommended that this URL be the index page of the site to give a greater chance of finding all pages within that site/host. Defaults to the start URLs.

  • follow (String) (defaults to: @dsl_follow)

    The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML. This can also be set using follow.

  • allow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.

  • disallow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

  • (doc)

    Given each crawled page (Wgit::Document) of the site. A block is the only way to interact with each crawled Document. Use doc.empty? to determine if the page is valid.

Returns:

  • (Array<Wgit::Url>, nil)

    Unique Array of external urls collected from all of the site's pages or nil if the given url could not be crawled successfully.

Raises:

  • (StandardError)

    If no url is provided and no start URL has been set.



130
131
132
133
134
135
136
137
138
139
140
141
142
143
# File 'lib/wgit/dsl.rb', line 130

def crawl_site(
  *urls, follow: @dsl_follow,
  allow_paths: nil, disallow_paths: nil, &block
)
  urls = (@dsl_start || []) if urls.empty?
  raise DSL_ERROR__NO_START_URL if urls.empty?

  xpath = follow || :default
  opts  = { follow: xpath, allow_paths:, disallow_paths: }

  urls.reduce([]) do |externals, url|
    externals + get_crawler.crawl_site(Wgit::Url.parse(url), **opts, &block)
  end
end

#empty_db!Integer

Deletes everything in the urls and documents collections by calling Wgit::Database::DatabaseAdapter#empty underneath.

Returns:

  • (Integer)

    The number of deleted records.



298
299
300
# File 'lib/wgit/dsl.rb', line 298

def empty_db!
  get_db.empty
end

#extract(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol

Defines an extractor using Wgit::Document.define_extractor underneath.

Parameters:

  • var (Symbol)

    The name of the variable to be initialised, that will contain the extracted content.

  • xpath (String, #call)

    The xpath used to find the element(s) of the webpage. Only used when initializing from HTML.

    Pass a callable object (proc etc.) if you want the xpath value to be derived on Document initialisation (instead of when the extractor is defined). The call method must return a valid xpath String.

  • opts (Hash) (defaults to: {})

    The options to define an extractor with. The options are only used when intializing from HTML, not the database.

Options Hash (opts):

  • :singleton (Boolean)

    The singleton option determines whether or not the result(s) should be in an Array. If multiple results are found and singleton is true then the first result will be used. Defaults to true.

  • :text_content_only (Boolean)

    The text_content_only option if true will use the text content of the Nokogiri result object, otherwise the Nokogiri object itself is returned. Defaults to true.

Yields:

  • The block is executed when a Wgit::Document is initialized, regardless of the source. Use it (optionally) to process the result value.

Yield Parameters:

  • value (Object)

    The result value to be assigned to the new var.

  • source (Wgit::Document, Object)

    The source of the value.

  • type (Symbol)

    The source type, either :document or (DB) :object.

Yield Returns:

  • (Object)

    The return value of the block becomes the new var's value. Return the block's value param unchanged if you want to inspect.

Returns:

  • (Symbol)

    The given var Symbol if successful.

Raises:

  • (StandardError)

    If the var param isn't valid.



43
44
45
# File 'lib/wgit/dsl.rb', line 43

def extract(var, xpath, opts = {}, &block)
  Wgit::Document.define_extractor(var, xpath, opts, &block)
end

#follow(xpath) ⇒ Object

Sets the xpath to be followed when crawl_site or index_site is subsequently called. Calling this method is optional as the default is to follow all <a> href's that point to the site domain. You can also pass follow: to the crawl/index methods directly.

Parameters:

  • xpath (String)

    The xpath which is followed when crawling/indexing a site. Use :default to restore the default follow logic.



80
81
82
# File 'lib/wgit/dsl.rb', line 80

def follow(xpath)
  @dsl_follow = xpath
end

#index(*urls, insert_externals: false) {|doc| ... } ⇒ Object Also known as: index_url

Indexes a single webpage using Wgit::Indexer#index_url underneath.

Parameters:

  • urls (*Wgit::Url)

    The webpage URL's to crawl. Defaults to the start URL(s).

  • insert_externals (Boolean) (defaults to: false)

    Whether or not to insert the website's external URL's into the database.

Yields:

  • (doc)

    Given the Wgit::Document of the crawled webpage, before it's inserted into the database allowing for prior manipulation. Return nil or false from the block to prevent the document from being saved into the database.

Raises:

  • (StandardError)

    If no urls are provided and no start URL has been set.



238
239
240
241
242
243
244
245
246
# File 'lib/wgit/dsl.rb', line 238

def index(*urls, insert_externals: false, &block)
  urls = (@dsl_start || []) if urls.empty?
  raise DSL_ERROR__NO_START_URL if urls.empty?

  indexer = Wgit::Indexer.new(get_db, get_crawler)

  urls.map! { |url| Wgit::Url.parse(url) }
  indexer.index_urls(*urls, insert_externals:, &block)
end

#index_site(*urls, insert_externals: false, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer Also known as: index_r

Indexes a single website using Wgit::Indexer#index_site underneath.

Parameters:

  • urls (*String, *Wgit::Url)

    The base URL(s) of the website(s) to crawl. Can be set using start.

  • insert_externals (Boolean) (defaults to: false)

    Whether or not to insert the website's external URL's into the database.

  • follow (String) (defaults to: @dsl_follow)

    The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML. This can also be set using follow.

  • allow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.

  • disallow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

  • (doc)

    Given the Wgit::Document of each crawled webpage, before it is inserted into the database allowing for prior manipulation.

Returns:

  • (Integer)

    The total number of pages crawled within the website.

Raises:

  • (StandardError)

    If no url is provided and no start URL has been set.



208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
# File 'lib/wgit/dsl.rb', line 208

def index_site(
  *urls, insert_externals: false, follow: @dsl_follow,
  allow_paths: nil, disallow_paths: nil, &block
)
  urls = (@dsl_start || []) if urls.empty?
  raise DSL_ERROR__NO_START_URL if urls.empty?

  indexer    = Wgit::Indexer.new(get_db, get_crawler)
  xpath      = follow || :default
  crawl_opts = {
    insert_externals:, follow: xpath, allow_paths:, disallow_paths:
  }

  urls.reduce(0) do |total, url|
    total + indexer.index_site(Wgit::Url.parse(url), **crawl_opts, &block)
  end
end

#index_www(max_sites: -1,, max_data: 1_048_576_000) ⇒ Object

Indexes the World Wide Web using Wgit::Indexer#index_www underneath.

Parameters:

  • max_sites (Integer) (defaults to: -1,)

    The number of separate and whole websites to be crawled before the method exits. Defaults to -1 which means the crawl will occur until manually stopped (Ctrl+C etc).

  • max_data (Integer) (defaults to: 1_048_576_000)

    The maximum amount of bytes that will be scraped from the web (default is 1GB). Note, that this value is used to determine when to stop crawling; it's not a guarantee of the max data that will be obtained.



183
184
185
186
187
# File 'lib/wgit/dsl.rb', line 183

def index_www(max_sites: -1, max_data: 1_048_576_000)
  indexer = Wgit::Indexer.new(get_db, get_crawler)

  indexer.index_www(max_sites:, max_data:)
end

#last_responseWgit::Response

Returns the DSL's Wgit::Crawler#last_response.

Returns:



148
149
150
# File 'lib/wgit/dsl.rb', line 148

def last_response
  get_crawler.last_response
end

#resetObject

Nilifies the DSL instance variables.



153
154
155
156
157
158
# File 'lib/wgit/dsl.rb', line 153

def reset
  @dsl_crawler = nil
  @dsl_start   = nil
  @dsl_follow  = nil
  @dsl_db      = nil
end

#search(query, stream: $stdout, top_result_only: true, include_score: false, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Array<Wgit::Document>

Performs a search of the database's indexed documents and pretty prints the results in a search engine-esque format. See Wgit::Database::DatabaseAdapter#search and Wgit::Document#search! for details of how the search methods work.

Parameters:

  • query (String)

    The text query to search with.

  • stream (nil, #puts) (defaults to: $stdout)

    Any object that respond_to?(:puts). It is used to output text somewhere e.g. a file or STDERR. Use nil for no output.

  • case_sensitive (Boolean) (defaults to: false)

    Whether character case must match.

  • whole_sentence (Boolean) (defaults to: true)

    Whether multiple words should be searched for separately.

  • limit (Integer) (defaults to: 10)

    The max number of results to print.

  • skip (Integer) (defaults to: 0)

    The number of DB records to skip.

  • sentence_limit (Integer) (defaults to: 80)

    The max length of each result's text snippet.

Yields:

  • (doc)

    Given each search result (Wgit::Document) returned from the database containing only its matching #text.

Returns:



268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
# File 'lib/wgit/dsl.rb', line 268

def search(
  query, stream: $stdout,
  top_result_only: true, include_score: false,
  case_sensitive: false, whole_sentence: true,
  limit: 10, skip: 0, sentence_limit: 80
)
  stream ||= File.open(File::NULL, 'w')

  results = get_db.search(
    query, case_sensitive:, whole_sentence:, limit:, skip:)

  results.each do |doc|
    doc.search_text!(
      query, case_sensitive:, whole_sentence:, sentence_limit:)
    yield(doc) if block_given?
  end

  if top_result_only
    Wgit::Utils.pprint_top_search_results(results, include_score:, stream:)
  else
    Wgit::Utils.pprint_all_search_results(results, include_score:, stream:)
  end

  results
end

#start(*urls) {|crawler| ... } ⇒ Object Also known as: start_urls

Sets the URL to be crawled when a crawl* or index* method is subsequently called. Calling this is optional as the URL can be passed to the method instead. You can also omit the url param and just use the block to configure the crawler instead.

Parameters:

  • urls (*String, *Wgit::Url)

    The URL(s) to crawl or nil (if only using the block to configure the crawler).

Yields:

  • (crawler)

    The crawler that'll be used in the subsequent crawl/index; use the block to configure.



68
69
70
71
# File 'lib/wgit/dsl.rb', line 68

def start(*urls, &block)
  use_crawler(&block) if block_given?
  @dsl_start = urls
end

#use_crawler(crawler = nil) {|crawler| ... } ⇒ Wgit::Crawler

Sets and returns the Wgit::Crawler used in subsequent crawls including indexing. Defaults to Wgit::Crawler.new if not given a param. See the Wgit::Crawler documentation for more details.

Yields:

  • (crawler)

    Given the DSL crawler; use the block to configure.

Returns:



53
54
55
56
57
# File 'lib/wgit/dsl.rb', line 53

def use_crawler(crawler = nil)
  @dsl_crawler = crawler || @dsl_crawler || Wgit::Crawler.new
  yield @dsl_crawler if block_given?
  @dsl_crawler
end

#use_database(db) ⇒ Object

Defines the connected database instance used in subsequent index and DB method calls. This method is optional however, as a new instance of the Wgit::Database.adapter_class will be initialised otherwise. Therefore if not calling this method, you should ensure ENV['WGIT_CONNECTION_STRING'] is set or the connection will fail.

Parameters:



170
171
172
# File 'lib/wgit/dsl.rb', line 170

def use_database(db)
  @dsl_db = db
end