Module: Wgit

Defined in:: lib/wgit/version.rb,
lib/wgit/url.rb,
lib/wgit/utils.rb,
lib/wgit/logger.rb,
lib/wgit/crawler.rb,
lib/wgit/indexer.rb,
lib/wgit/document.rb,
lib/wgit/response.rb,
lib/wgit/assertable.rb,
lib/wgit/database/model.rb,
lib/wgit/database/database.rb

Overview

Wgit is a WWW indexer/scraper which crawls URL's and retrieves their page contents for later use.

Author:

Michael Telford

Defined Under Namespace

Modules: Assertable, Model, Utils Classes: Crawler, Database, Document, Indexer, Response, Url

Constant Summary collapse

VERSION = The current gem version of Wgit.

'0.7.0'

Class Method Summary collapse

.default_logger ⇒ Logger
Returns the default Logger instance.
.index_page(url, connection_string: nil, insert_externals: true) {|doc| ... } ⇒ Object
Convience method to index a single webpage using Wgit::Indexer#index_page.
.index_site(url, connection_string: nil, insert_externals: true, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer
Convience method to index a single website using Wgit::Indexer#index_site.
.index_www(connection_string: nil, max_sites: -1,, max_data: 1_048_576_000) ⇒ Object
Convience method to index the World Wide Web using Wgit::Indexer#index_www.
.indexed_search(query, connection_string: nil, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Object
Performs a search of the database's indexed documents and pretty prints the results.
.logger ⇒ Logger
Returns the current Logger instance.
.logger=(logger) ⇒ Logger
Sets the current Logger instance.
.use_default_logger ⇒ Logger
Sets the default Logger instance to be used by Wgit.
.version ⇒ Object
Returns the current gem version of Wgit as a String.
.version_str ⇒ Object
Returns the current gem version in a presentation String.

Class Method Details

.default_logger ⇒ `Logger`

Returns the default Logger instance.

Returns:

The default Logger instance.

# File 'lib/wgit/logger.rb', line 30

def self.default_logger
  logger = Logger.new(STDOUT, progname: 'wgit', level: :info)
  logger.formatter = proc do |_severity, _datetime, progname, msg|
    "[#{progname}] #{msg}\n"
  end
  logger
end

.index_page(url, connection_string: nil, insert_externals: true) {|doc| ... } ⇒ `Object`

Convience method to index a single webpage using Wgit::Indexer#index_page.

Crawls a single webpage and stores it into the database. There is no max download limit so be careful of large pages.

Parameters:

The Url of the webpage to crawl.
(defaults to: nil)
The database connection string. Set as nil to use ENV['WGIT_CONNECTION_STRING'].
(defaults to: true)
Whether or not to insert the website's external Url's into the database.

Yields:

(doc) —
Given the Wgit::Document of the crawled webpage, before it's inserted into the database allowing for prior manipulation.

# File 'lib/wgit/indexer.rb', line 76

def self.index_page(
  url, connection_string: nil, insert_externals: true, &block
)
  url = Wgit::Url.parse(url)
  db = Wgit::Database.new(connection_string)
  indexer = Wgit::Indexer.new(db)
  indexer.index_page(url, insert_externals: insert_externals, &block)
end

.index_site(url, connection_string: nil, insert_externals: true, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Integer`

Convience method to index a single website using Wgit::Indexer#index_site.

Crawls a single website's pages and stores them into the database. There is no max download limit so be careful which sites you index.

Parameters:

The base Url of the website to crawl.
(defaults to: nil)
The database connection string. Set as nil to use ENV['WGIT_CONNECTION_STRING'].
(defaults to: true)
Whether or not to insert the website's external Url's into the database.
(defaults to: nil)
Filters links by selecting them if their path File.fnmatch? one of allow_paths.
(defaults to: nil)
Filters links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

(doc) —
Given the Wgit::Document of each crawled webpage, before it's inserted into the database allowing for prior manipulation.

Returns:

The total number of pages crawled within the website.

# File 'lib/wgit/indexer.rb', line 50

def self.index_site(
  url, connection_string: nil, insert_externals: true,
  allow_paths: nil, disallow_paths: nil, &block
)
  url = Wgit::Url.parse(url)
  db = Wgit::Database.new(connection_string)
  indexer = Wgit::Indexer.new(db)
  indexer.index_site(
    url, insert_externals: insert_externals,
         allow_paths: allow_paths, disallow_paths: disallow_paths, &block
  )
end

.index_www(connection_string: nil, max_sites: -1,, max_data: 1_048_576_000) ⇒ `Object`

Convience method to index the World Wide Web using Wgit::Indexer#index_www.

Retrieves uncrawled url's from the database and recursively crawls each site storing their internal pages into the database and adding their external url's to be crawled later on. Logs info on the crawl using Wgit.logger as it goes along.

Parameters:

(defaults to: nil)
The database connection string. Set as nil to use ENV['WGIT_CONNECTION_STRING'].
(defaults to: -1,)
The number of separate and whole websites to be crawled before the method exits. Defaults to -1 which means the crawl will occur until manually stopped (Ctrl+C etc).
(defaults to: 1_048_576_000)
The maximum amount of bytes that will be scraped from the web (default is 1GB). Note, that this value is used to determine when to stop crawling; it's not a guarantee of the max data that will be obtained.

# File 'lib/wgit/indexer.rb', line 24

def self.index_www(
  connection_string: nil, max_sites: -1, max_data: 1_048_576_000
)
  db = Wgit::Database.new(connection_string)
  indexer = Wgit::Indexer.new(db)
  indexer.index_www(max_sites: max_sites, max_data: max_data)
end

.indexed_search(query, connection_string: nil, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ `Object`

Performs a search of the database's indexed documents and pretty prints the results. See Wgit::Database#search and Wgit::Document#search for details of how the search works.

Parameters:

The text query to search with.
(defaults to: nil)
The database connection string. Set as nil to use ENV['WGIT_CONNECTION_STRING'].
(defaults to: false)
Whether character case must match.
(defaults to: true)
Whether multiple words should be searched for separately.
(defaults to: 10)
The max number of results to print.
(defaults to: 0)
The number of DB records to skip.
(defaults to: 80)
The max length of each result's text snippet.

Yields:

(doc) —
Given each search result (Wgit::Document) returned from the database.

# File 'lib/wgit/indexer.rb', line 101

def self.indexed_search(
  query, connection_string: nil,
  case_sensitive: false, whole_sentence: true,
  limit: 10, skip: 0, sentence_limit: 80, &block
)
  db = Wgit::Database.new(connection_string)

  results = db.search(
    query,
    case_sensitive: case_sensitive,
    whole_sentence: whole_sentence,
    limit: limit,
    skip: skip,
    &block
  )

  results.each do |doc|
    doc.search!(
      query,
      case_sensitive: case_sensitive,
      whole_sentence: whole_sentence,
      sentence_limit: sentence_limit
    )
  end

  Wgit::Utils.printf_search_results(results)
end

.logger ⇒ `Logger`

Returns the current Logger instance.

Returns:

The current Logger instance.



15
16
17

# File 'lib/wgit/logger.rb', line 15

def self.logger
  @logger
end

.logger=(logger) ⇒ `Logger`

Sets the current Logger instance.

Parameters:

The Logger instance to use.

Returns:

The current Logger instance having being set.



23
24
25

# File 'lib/wgit/logger.rb', line 23

def self.logger=(logger)
  @logger = logger
end

.use_default_logger ⇒ `Logger`

Sets the default Logger instance to be used by Wgit.

Returns:

The default Logger instance.



41
42
43

# File 'lib/wgit/logger.rb', line 41

def self.use_default_logger
  @logger = default_logger
end

.version ⇒ `Object`

Returns the current gem version of Wgit as a String.



11
12
13

# File 'lib/wgit/version.rb', line 11

def self.version
  VERSION
end

.version_str ⇒ `Object`

Returns the current gem version in a presentation String.



16
17
18

# File 'lib/wgit/version.rb', line 16

def self.version_str
  "wgit v#{VERSION}"
end

Module: Wgit

Overview

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.default_logger ⇒ Logger

.index_page(url, connection_string: nil, insert_externals: true) {|doc| ... } ⇒ Object

.index_site(url, connection_string: nil, insert_externals: true, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer

.index_www(connection_string: nil, max_sites: -1,, max_data: 1_048_576_000) ⇒ Object

.indexed_search(query, connection_string: nil, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ Object

.logger ⇒ Logger

.logger=(logger) ⇒ Logger

.use_default_logger ⇒ Logger

.version ⇒ Object

.version_str ⇒ Object

.default_logger ⇒ `Logger`

.index_page(url, connection_string: nil, insert_externals: true) {|doc| ... } ⇒ `Object`

.index_site(url, connection_string: nil, insert_externals: true, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Integer`

.index_www(connection_string: nil, max_sites: -1,, max_data: 1_048_576_000) ⇒ `Object`

.indexed_search(query, connection_string: nil, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80) {|doc| ... } ⇒ `Object`

.logger ⇒ `Logger`

.logger=(logger) ⇒ `Logger`

.use_default_logger ⇒ `Logger`

.version ⇒ `Object`

.version_str ⇒ `Object`