Module: Wgit

Defined in:
lib/wgit/version.rb,
lib/wgit/url.rb,
lib/wgit/utils.rb,
lib/wgit/logger.rb,
lib/wgit/crawler.rb,
lib/wgit/indexer.rb,
lib/wgit/document.rb,
lib/wgit/response.rb,
lib/wgit/assertable.rb,
lib/wgit/database/model.rb,
lib/wgit/database/database.rb

Overview

Wgit is a WWW indexer/scraper which crawls URL's and retrieves their page contents for later use.

Author:

  • Michael Telford

Defined Under Namespace

Modules: Assertable, Model, Utils Classes: Crawler, Database, Document, Indexer, Response, Url

Constant Summary collapse

VERSION =

The current gem version of Wgit.

'0.7.0'

Class Method Summary collapse

Class Method Details

.default_loggerLogger

Returns the default Logger instance.

Returns:

  • The default Logger instance.



30
31
32
33
34
35
36
# File 'lib/wgit/logger.rb', line 30

def self.default_logger
  logger = Logger.new(STDOUT, progname: 'wgit', level: :info)
  logger.formatter = proc do |_severity, _datetime, progname, msg|
    "[#{progname}] #{msg}\n"
  end
  logger
end

.index_page(url, connection_string: nil, insert_externals: true) {|doc| ... } ⇒ Object

Convience method to index a single webpage using Wgit::Indexer#index_page.

Crawls a single webpage and stores it into the database. There is no max download limit so be careful of large pages.

Parameters:

  • The Url of the webpage to crawl.

  • (defaults to: nil)

    The database connection string. Set as nil to use ENV['WGIT_CONNECTION_STRING'].

  • (defaults to: true)

    Whether or not to insert the website's external Url's into the database.

Yields:

  • (doc)

    Given the Wgit::Document of the crawled webpage, before it's inserted into the database allowing for prior manipulation.



76
77
78
79
80
81
82
83
# File 'lib/wgit/indexer.rb', line 76

def self.index_page(
  url, connection_string: nil, insert_externals: true, &block
)
  url = Wgit::Url.parse(url)
  db = Wgit::Database.new(connection_string)
  indexer = Wgit::Indexer.new(db)
  indexer.index_page(url, insert_externals: insert_externals, &block)
end

.index_site(url, connection_string: nil, insert_externals: true, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer

Convience method to index a single website using Wgit::Indexer#index_site.

Crawls a single website's pages and stores them into the database. There is no max download limit so be careful which sites you index.

Parameters:

  • The base Url of the website to crawl.

  • (defaults to: nil)

    The database connection string. Set as nil to use ENV['WGIT_CONNECTION_STRING'].

  • (defaults to: true)

    Whether or not to insert the website's external Url's into the database.

  • (defaults to: nil)

    Filters links by selecting them if their path File.fnmatch? one of allow_paths.

  • (defaults to: nil)

    Filters links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

  • (doc)

    Given the Wgit::Document of each crawled webpage, before it's inserted into the database allowing for prior manipulation.

Returns:

  • The total number of pages crawled within the website.



50
51
52
53
54
55
56
57
58
59
60
61
# File 'lib/wgit/indexer.rb', line 50

def self.index_site(
  url, connection_string: nil, insert_externals: true,
  allow_paths: nil, disallow_paths: nil, &block
)
  url = Wgit::Url.parse(url)
  db = Wgit::Database.new(connection_string)
  indexer = Wgit::Indexer.new(db)
  indexer.index_site(
    url, insert_externals: insert_externals,
         allow_paths: allow_paths, disallow_paths: disallow_paths, &block
  )
end

.index_www(connection_string: nil, max_sites: -1,, max_data: 1_048_576_000) ⇒ Object

Convience method to index the World Wide Web using Wgit::Indexer#index_www.

Retrieves uncrawled url's from the database and recursively crawls each site storing their internal pages into the database and adding their external url's to be crawled later on. Logs info on the crawl using Wgit.logger as it goes along.

Parameters:

  • (defaults to: nil)

    The database connection string. Set as nil to use ENV['WGIT_CONNECTION_STRING'].

  • (defaults to: -1,)

    The number of separate and whole websites to be crawled before the method exits. Defaults to -1 which means the crawl will occur until manually stopped (Ctrl+C etc).

  • (defaults to: 1_048_576_000)

    The maximum amount of bytes that will be scraped from the web (default is 1GB). Note, that this value is used to determine when to stop crawling; it's not a guarantee of the max data that will be obtained.



24
25
26
27
28
29
30
# File 'lib/wgit/indexer.rb', line 24

def self.index_www(
  connection_string: nil, max_sites: -1, max_data: 1_048_576_000
)
  db = Wgit::Database.new(connection_string)
  indexer = Wgit::Indexer.new(db)
  indexer.index_www(max_sites: max_sites, max_data: max_data)
end

.indexed_search(query, connection_string: nil, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Object

Performs a search of the database's indexed documents and pretty prints the results. See Wgit::Database#search and Wgit::Document#search for details of how the search works.

Parameters:

  • The text query to search with.

  • (defaults to: nil)

    The database connection string. Set as nil to use ENV['WGIT_CONNECTION_STRING'].

  • (defaults to: false)

    Whether character case must match.

  • (defaults to: true)

    Whether multiple words should be searched for separately.

  • (defaults to: 10)

    The max number of results to print.

  • (defaults to: 0)

    The number of DB records to skip.

  • (defaults to: 80)

    The max length of each result's text snippet.

Yields:

  • (doc)

    Given each search result (Wgit::Document) returned from the database.



101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# File 'lib/wgit/indexer.rb', line 101

def self.indexed_search(
  query, connection_string: nil,
  case_sensitive: false, whole_sentence: true,
  limit: 10, skip: 0, sentence_limit: 80, &block
)
  db = Wgit::Database.new(connection_string)

  results = db.search(
    query,
    case_sensitive: case_sensitive,
    whole_sentence: whole_sentence,
    limit: limit,
    skip: skip,
    &block
  )

  results.each do |doc|
    doc.search!(
      query,
      case_sensitive: case_sensitive,
      whole_sentence: whole_sentence,
      sentence_limit: sentence_limit
    )
  end

  Wgit::Utils.printf_search_results(results)
end

.loggerLogger

Returns the current Logger instance.

Returns:

  • The current Logger instance.



15
16
17
# File 'lib/wgit/logger.rb', line 15

def self.logger
  @logger
end

.logger=(logger) ⇒ Logger

Sets the current Logger instance.

Parameters:

  • The Logger instance to use.

Returns:

  • The current Logger instance having being set.



23
24
25
# File 'lib/wgit/logger.rb', line 23

def self.logger=(logger)
  @logger = logger
end

.use_default_loggerLogger

Sets the default Logger instance to be used by Wgit.

Returns:

  • The default Logger instance.



41
42
43
# File 'lib/wgit/logger.rb', line 41

def self.use_default_logger
  @logger = default_logger
end

.versionObject

Returns the current gem version of Wgit as a String.



11
12
13
# File 'lib/wgit/version.rb', line 11

def self.version
  VERSION
end

.version_strObject

Returns the current gem version in a presentation String.



16
17
18
# File 'lib/wgit/version.rb', line 16

def self.version_str
  "wgit v#{VERSION}"
end