Module: Wgit
- Defined in:
- lib/wgit/version.rb,
lib/wgit/url.rb,
lib/wgit/utils.rb,
lib/wgit/logger.rb,
lib/wgit/crawler.rb,
lib/wgit/indexer.rb,
lib/wgit/document.rb,
lib/wgit/response.rb,
lib/wgit/assertable.rb,
lib/wgit/database/model.rb,
lib/wgit/database/database.rb
Overview
Wgit is a WWW indexer/scraper which crawls URL's and retrieves their page contents for later use.
Defined Under Namespace
Modules: Assertable, Model, Utils Classes: Crawler, Database, Document, Indexer, Response, Url
Constant Summary collapse
- VERSION =
The current gem version of Wgit.
'0.7.0'
Class Method Summary collapse
-
.default_logger ⇒ Logger
Returns the default Logger instance.
-
.index_page(url, connection_string: nil, insert_externals: true) {|doc| ... } ⇒ Object
Convience method to index a single webpage using Wgit::Indexer#index_page.
-
.index_site(url, connection_string: nil, insert_externals: true, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer
Convience method to index a single website using Wgit::Indexer#index_site.
-
.index_www(connection_string: nil, max_sites: -1,, max_data: 1_048_576_000) ⇒ Object
Convience method to index the World Wide Web using Wgit::Indexer#index_www.
-
.indexed_search(query, connection_string: nil, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Object
Performs a search of the database's indexed documents and pretty prints the results.
-
.logger ⇒ Logger
Returns the current Logger instance.
-
.logger=(logger) ⇒ Logger
Sets the current Logger instance.
-
.use_default_logger ⇒ Logger
Sets the default Logger instance to be used by Wgit.
-
.version ⇒ Object
Returns the current gem version of Wgit as a String.
-
.version_str ⇒ Object
Returns the current gem version in a presentation String.
Class Method Details
.default_logger ⇒ Logger
Returns the default Logger instance.
30 31 32 33 34 35 36 |
# File 'lib/wgit/logger.rb', line 30 def self.default_logger logger = Logger.new(STDOUT, progname: 'wgit', level: :info) logger.formatter = proc do |_severity, _datetime, progname, msg| "[#{progname}] #{msg}\n" end logger end |
.index_page(url, connection_string: nil, insert_externals: true) {|doc| ... } ⇒ Object
Convience method to index a single webpage using Wgit::Indexer#index_page.
Crawls a single webpage and stores it into the database. There is no max download limit so be careful of large pages.
76 77 78 79 80 81 82 83 |
# File 'lib/wgit/indexer.rb', line 76 def self.index_page( url, connection_string: nil, insert_externals: true, &block ) url = Wgit::Url.parse(url) db = Wgit::Database.new(connection_string) indexer = Wgit::Indexer.new(db) indexer.index_page(url, insert_externals: insert_externals, &block) end |
.index_site(url, connection_string: nil, insert_externals: true, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer
Convience method to index a single website using Wgit::Indexer#index_site.
Crawls a single website's pages and stores them into the database. There is no max download limit so be careful which sites you index.
50 51 52 53 54 55 56 57 58 59 60 61 |
# File 'lib/wgit/indexer.rb', line 50 def self.index_site( url, connection_string: nil, insert_externals: true, allow_paths: nil, disallow_paths: nil, &block ) url = Wgit::Url.parse(url) db = Wgit::Database.new(connection_string) indexer = Wgit::Indexer.new(db) indexer.index_site( url, insert_externals: insert_externals, allow_paths: allow_paths, disallow_paths: disallow_paths, &block ) end |
.index_www(connection_string: nil, max_sites: -1,, max_data: 1_048_576_000) ⇒ Object
Convience method to index the World Wide Web using Wgit::Indexer#index_www.
Retrieves uncrawled url's from the database and recursively crawls each site storing their internal pages into the database and adding their external url's to be crawled later on. Logs info on the crawl using Wgit.logger as it goes along.
24 25 26 27 28 29 30 |
# File 'lib/wgit/indexer.rb', line 24 def self.index_www( connection_string: nil, max_sites: -1, max_data: 1_048_576_000 ) db = Wgit::Database.new(connection_string) indexer = Wgit::Indexer.new(db) indexer.index_www(max_sites: max_sites, max_data: max_data) end |
.indexed_search(query, connection_string: nil, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Object
Performs a search of the database's indexed documents and pretty prints the results. See Wgit::Database#search and Wgit::Document#search for details of how the search works.
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
# File 'lib/wgit/indexer.rb', line 101 def self.indexed_search( query, connection_string: nil, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80, &block ) db = Wgit::Database.new(connection_string) results = db.search( query, case_sensitive: case_sensitive, whole_sentence: whole_sentence, limit: limit, skip: skip, &block ) results.each do |doc| doc.search!( query, case_sensitive: case_sensitive, whole_sentence: whole_sentence, sentence_limit: sentence_limit ) end Wgit::Utils.printf_search_results(results) end |
.logger ⇒ Logger
Returns the current Logger instance.
15 16 17 |
# File 'lib/wgit/logger.rb', line 15 def self.logger @logger end |
.logger=(logger) ⇒ Logger
Sets the current Logger instance.
23 24 25 |
# File 'lib/wgit/logger.rb', line 23 def self.logger=(logger) @logger = logger end |
.use_default_logger ⇒ Logger
Sets the default Logger instance to be used by Wgit.
41 42 43 |
# File 'lib/wgit/logger.rb', line 41 def self.use_default_logger @logger = default_logger end |
.version ⇒ Object
Returns the current gem version of Wgit as a String.
11 12 13 |
# File 'lib/wgit/version.rb', line 11 def self.version VERSION end |
.version_str ⇒ Object
Returns the current gem version in a presentation String.
16 17 18 |
# File 'lib/wgit/version.rb', line 16 def self.version_str "wgit v#{VERSION}" end |