Module: Wgit::DSL
- Included in:
- Base
- Defined in:
- lib/wgit/dsl.rb
Overview
DSL methods that act as a wrapper around Wgit's underlying class methods. All instance vars/constants are prefixed to avoid conflicts when included.
Constant Summary collapse
- DSL_ERROR__NO_START_URL =
Error message shown when there's no URL to crawl.
"missing url, pass as parameter to this or \ the 'start' function".freeze
Instance Method Summary collapse
-
#crawl(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document
(also: #crawl_url)
Crawls one or more individual urls using
Wgit::Crawler#crawl_url
underneath. -
#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>?
(also: #crawl_r)
Crawls an entire site using
Wgit::Crawler#crawl_site
underneath. -
#empty_db! ⇒ Integer
Deletes everything in the urls and documents collections by calling
Wgit::Database::DatabaseAdapter#empty
underneath. -
#extract(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines an extractor using
Wgit::Document.define_extractor
underneath. -
#follow(xpath) ⇒ Object
Sets the xpath to be followed when
crawl_site
orindex_site
is subsequently called. -
#index(*urls, insert_externals: false) {|doc| ... } ⇒ Object
(also: #index_url)
Indexes a single webpage using
Wgit::Indexer#index_url
underneath. -
#index_site(*urls, insert_externals: false, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer
(also: #index_r)
Indexes a single website using
Wgit::Indexer#index_site
underneath. -
#index_www(max_sites: -1,, max_data: 1_048_576_000) ⇒ Object
Indexes the World Wide Web using
Wgit::Indexer#index_www
underneath. -
#last_response ⇒ Wgit::Response
Returns the DSL's
Wgit::Crawler#last_response
. -
#reset ⇒ Object
Nilifies the DSL instance variables.
-
#search(query, stream: $stdout, top_result_only: true, include_score: false, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Array<Wgit::Document>
Performs a search of the database's indexed documents and pretty prints the results in a search engine-esque format.
-
#start(*urls) {|crawler| ... } ⇒ Object
(also: #start_urls)
Sets the URL to be crawled when a
crawl*
orindex*
method is subsequently called. -
#use_crawler(crawler = nil) {|crawler| ... } ⇒ Wgit::Crawler
Sets and returns the Wgit::Crawler used in subsequent crawls including indexing.
-
#use_database(db) ⇒ Object
Defines the connected database instance used in subsequent index and DB method calls.
Instance Method Details
#crawl(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl_url
Crawls one or more individual urls using Wgit::Crawler#crawl_url
underneath. If no urls are provided, then the start
URL is used.
99 100 101 102 103 104 105 |
# File 'lib/wgit/dsl.rb', line 99 def crawl(*urls, follow_redirects: true, &block) urls = (@dsl_start || []) if urls.empty? raise DSL_ERROR__NO_START_URL if urls.empty? urls.map! { |url| Wgit::Url.parse(url) } get_crawler.crawl_urls(*urls, follow_redirects:, &block) end |
#crawl_site(*urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r
Crawls an entire site using Wgit::Crawler#crawl_site
underneath. If no
url is provided, then the first start
URL is used.
130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
# File 'lib/wgit/dsl.rb', line 130 def crawl_site( *urls, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil, &block ) urls = (@dsl_start || []) if urls.empty? raise DSL_ERROR__NO_START_URL if urls.empty? xpath = follow || :default opts = { follow: xpath, allow_paths:, disallow_paths: } urls.reduce([]) do |externals, url| externals + get_crawler.crawl_site(Wgit::Url.parse(url), **opts, &block) end end |
#empty_db! ⇒ Integer
Deletes everything in the urls and documents collections by calling
Wgit::Database::DatabaseAdapter#empty
underneath.
298 299 300 |
# File 'lib/wgit/dsl.rb', line 298 def empty_db! get_db.empty end |
#extract(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines an extractor using Wgit::Document.define_extractor
underneath.
43 44 45 |
# File 'lib/wgit/dsl.rb', line 43 def extract(var, xpath, opts = {}, &block) Wgit::Document.define_extractor(var, xpath, opts, &block) end |
#follow(xpath) ⇒ Object
Sets the xpath to be followed when crawl_site
or index_site
is
subsequently called. Calling this method is optional as the default is to
follow all <a>
href's that point to the site domain. You can also pass
follow:
to the crawl/index methods directly.
80 81 82 |
# File 'lib/wgit/dsl.rb', line 80 def follow(xpath) @dsl_follow = xpath end |
#index(*urls, insert_externals: false) {|doc| ... } ⇒ Object Also known as: index_url
Indexes a single webpage using Wgit::Indexer#index_url
underneath.
238 239 240 241 242 243 244 245 246 |
# File 'lib/wgit/dsl.rb', line 238 def index(*urls, insert_externals: false, &block) urls = (@dsl_start || []) if urls.empty? raise DSL_ERROR__NO_START_URL if urls.empty? indexer = Wgit::Indexer.new(get_db, get_crawler) urls.map! { |url| Wgit::Url.parse(url) } indexer.index_urls(*urls, insert_externals:, &block) end |
#index_site(*urls, insert_externals: false, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer Also known as: index_r
Indexes a single website using Wgit::Indexer#index_site
underneath.
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
# File 'lib/wgit/dsl.rb', line 208 def index_site( *urls, insert_externals: false, follow: @dsl_follow, allow_paths: nil, disallow_paths: nil, &block ) urls = (@dsl_start || []) if urls.empty? raise DSL_ERROR__NO_START_URL if urls.empty? indexer = Wgit::Indexer.new(get_db, get_crawler) xpath = follow || :default crawl_opts = { insert_externals:, follow: xpath, allow_paths:, disallow_paths: } urls.reduce(0) do |total, url| total + indexer.index_site(Wgit::Url.parse(url), **crawl_opts, &block) end end |
#index_www(max_sites: -1,, max_data: 1_048_576_000) ⇒ Object
Indexes the World Wide Web using Wgit::Indexer#index_www
underneath.
183 184 185 186 187 |
# File 'lib/wgit/dsl.rb', line 183 def index_www(max_sites: -1, max_data: 1_048_576_000) indexer = Wgit::Indexer.new(get_db, get_crawler) indexer.index_www(max_sites:, max_data:) end |
#last_response ⇒ Wgit::Response
Returns the DSL's Wgit::Crawler#last_response
.
148 149 150 |
# File 'lib/wgit/dsl.rb', line 148 def last_response get_crawler.last_response end |
#reset ⇒ Object
Nilifies the DSL instance variables.
153 154 155 156 157 158 |
# File 'lib/wgit/dsl.rb', line 153 def reset @dsl_crawler = nil @dsl_start = nil @dsl_follow = nil @dsl_db = nil end |
#search(query, stream: $stdout, top_result_only: true, include_score: false, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Array<Wgit::Document>
Performs a search of the database's indexed documents and pretty prints
the results in a search engine-esque format. See
Wgit::Database::DatabaseAdapter#search
and Wgit::Document#search!
for details of how the search methods work.
268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 |
# File 'lib/wgit/dsl.rb', line 268 def search( query, stream: $stdout, top_result_only: true, include_score: false, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80 ) stream ||= File.open(File::NULL, 'w') results = get_db.search( query, case_sensitive:, whole_sentence:, limit:, skip:) results.each do |doc| doc.search_text!( query, case_sensitive:, whole_sentence:, sentence_limit:) yield(doc) if block_given? end if top_result_only Wgit::Utils.pprint_top_search_results(results, include_score:, stream:) else Wgit::Utils.pprint_all_search_results(results, include_score:, stream:) end results end |
#start(*urls) {|crawler| ... } ⇒ Object Also known as: start_urls
Sets the URL to be crawled when a crawl*
or index*
method is
subsequently called. Calling this is optional as the URL can be
passed to the method instead. You can also omit the url param and just
use the block to configure the crawler instead.
68 69 70 71 |
# File 'lib/wgit/dsl.rb', line 68 def start(*urls, &block) use_crawler(&block) if block_given? @dsl_start = urls end |
#use_crawler(crawler = nil) {|crawler| ... } ⇒ Wgit::Crawler
Sets and returns the Wgit::Crawler used in subsequent crawls including
indexing. Defaults to Wgit::Crawler.new
if not given a param. See the
Wgit::Crawler documentation for more details.
53 54 55 56 57 |
# File 'lib/wgit/dsl.rb', line 53 def use_crawler(crawler = nil) @dsl_crawler = crawler || @dsl_crawler || Wgit::Crawler.new yield @dsl_crawler if block_given? @dsl_crawler end |
#use_database(db) ⇒ Object
Defines the connected database instance used in subsequent index and DB method calls. This method is optional however, as a new instance of the Wgit::Database.adapter_class will be initialised otherwise. Therefore if not calling this method, you should ensure ENV['WGIT_CONNECTION_STRING'] is set or the connection will fail.
170 171 172 |
# File 'lib/wgit/dsl.rb', line 170 def use_database(db) @dsl_db = db end |