Wgit Change Log

v0.0.0 - BREAKING CHANGES

Added

... ### Changed/Removed
... ### Fixed

- ...

v0.11.0 - BREAKING CHANGES

This release is a biggie with the main headline being the introduction of robots.txt support (see below). This release introduces several breaking changes so take care when updating your current version of Wgit.

Added

Ability to prevent indexing via robots.txt and noindex values in HTML meta elements and HTTP response header X-Robots-Tag. See new class Wgit::RobotsParser and the updated Wgit::Indexer#index_* methods. Also see the wiki article on the subject.
Wgit::RobotsParser class for parsing robots.txt files.
Wgit::Response#no_index? and Wgit::Document#no_index? methods (see wiki article above).
Added two new default extractors which extract robots meta elements for use in Wgit::Document#no_index?.
Added Wgit::Document.to_h_ignore_vars Array for user manipulation.
Added Wgit::Utils.pprint method to aid debugging.
Added Wgit::Utils.sanitize_url method.
Added Wgit::Indexer#index_www(max_urls_per_iteration:, ...) param.
Added Wgit::Url#redirects and #redirects= methods.
Added Wgit::Url#redirects_journey used by Wgit::Indexer to insert a Url and it's redirects.
Added Wgit::Database#bulk_upsert which Wgit::Indexer now uses where possible. This reduces the total database calls made during an index operation. ### Changed/Removed
Updated Wgit::Indexer#index_* methods to honour index prevention methods (see the wiki article).
Updated Wgit::Utils.sanitize* methods so they no longer modify the receiver.
Updated Wgit::Crawler#crawl_url to always return the crawled Wgit::Document. If relying on nil in your code, you should now use doc.empty? instead.
Updated Wgit::Indexer method logs.
Updated/added custom class #inspect methods.
Renamed Wgit::Utils.printf_search_results to pprint_search_results.
Renamed Wgit::Url#concat to #join. The #concat method is now String#concat.
Updated Wgit::Indexer methods to now write external Urls to the Database as: doc.external_urls.map(&:to_origin) meaning http://example.com/about becomes http://example.com.
Updated the following methods to no longer omit trailing slashes from Urls: Wgit::Url - #to_path, #omit_base, #omit_origin and Wgit::Document - #internal_links, #internal_absolute_links, #external_links. For an average website, this results in ~30% less network requests when crawling.
Updated Ruby version to 3.3.0.
Updated all bundle dependencies to latest versions, see Gemfile.lock for exact versions. ### Fixed
Wgit::Crawler#crawl_site now internally records all redirects for a given Url.
Wgit::Crawler#crawl_site infinite loop when using Wgit on a Ruby version > 3.0.2.

- Various other minor fixes/improvements throughout the code base.

v0.10.8

Added

Custom #inspect methods to Wgit::Url and Wgit::Document classes.
Document.remove_extractors method, which removes all default and defined extractors.

Changed/Removed

... ### Fixed

- ...

v0.10.7

Added

... ### Changed/Removed
... ### Fixed

- Security vulnerabilities by updating gem dependencies.

v0.10.6

Added

Wgit::DSL method #crawl_url (aliased to #crawl). ### Changed/Removed
Added a &block param to Wgit::Document#extract, which gets passed to #extract_from_html. ### Fixed

- ...

v0.10.5

Added

Database#last_result getter method to return the most recent raw mongo result. ### Changed/Removed
... ### Fixed

- ...

v0.10.4

Added

Database#search_text method which returns a Hash of url => text_results instead of Wgit::Documents (like #search). ### Changed/Removed
... ### Fixed

- ...

v0.10.3

Added

... ### Changed/Removed
Changed Database#create_collections and #create_unique_indexes by removing rescue nil from their database operations. Now any underlying errors with the database client are not masked. ### Fixed

- ...

v0.10.2

Added

Wgit::Base#setup and #teardown methods (lifecycle hooks) that can be overridden by subclasses. ### Changed/Removed
... ### Fixed

- ...

v0.10.1

Added

Support for Ruby 3. ### Changed/Removed
Removed support for Ruby 2.5 (as it's too old). ### Fixed

- ...

v0.10.0

Added

Wgit::Url#scheme_relative? method. ### Changed/Removed
Breaking change: Changed method signature of Wgit::Url#prefix_scheme by making the previously named parameter a defaulted positional parameter. Remove the protocol named parameter for the old behaviour. ### Fixed

- Scheme-relative bug by adding support for scheme-relative URL's.

v0.9.0

This release is a big one with the introduction of a Wgit::DSL and Javascript parse support. The README has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.

Added

Wgit::DSL module providing a wrapper around the underlying classes and methods. Check out the README for example usage.
Wgit::Crawler#parse_javascript which when set to true uses Chrome to parse a page's Javascript before returning the fully rendered HTML. This feature is disabled by default.
Wgit::Base class to inherit from, acting as an alternative form of using the DSL.
Wgit::Utils.sanitize which calls .sanitize_* underneath.
Wgit::Crawler#crawl_site now has a follow: named param - if set, it's xpath value is used to retrieve the next urls to crawl. Otherwise the :default is used (as it was before). Use this to override how the site is crawled.
Wgit::Database methods: #clear_urls, #clear_docs, #clear_db, #text_index, #text_index=, #create_collections, #create_unique_indexes, #docs, #get, #exists?, #delete, #upsert.
Wgit::Database#clear_db! alias.
Wgit::Document methods: #at_xpath, #at_css - which call nokogiri underneath.
Wgit::Document#extract method to perform one off content extractions.
Wgit::Indexer#index_urls method which can index several urls in one call.
Wgit::Url methods: #to_user, #to_password, #to_sub_domain, #to_port, #omit_origin, #index?. ### Changed/Removed
Breaking change: Moved all Wgit.index* convienence methods into Wgit::DSL.
Breaking change: Removed Wgit::Url#normalise, use #normalize instead.
Breaking change: Removed Wgit::Database#num_documents, use #num_docs instead.
Breaking change: Removed Wgit::Database#length and #count, use #size instead.
Breaking change: Removed Wgit::Database#document?, use #doc? instead.
Breaking change: Renamed Wgit::Indexer#index_page to #index_url.
Breaking change: Renamed Wgit::Url.parse_or_nil to be .parse?.
Breaking change: Renamed Wgit::Utils.process_* to be .sanitize_*.
Breaking change: Renamed Wgit::Utils.remove_non_bson_types to be Wgit::Model.select_bson_types.
Breaking change: Changed Wgit::Indexer.index* named param default from insert_externals: true to false. Explicitly set it to true for the old behaviour.
Breaking change: Renamed Wgit::Document.define_extension to define_extractor. Same goes for remove_extension -> remove_extractor and extensions -> extractors. See the docs for more information.
Breaking change: Renamed Wgit::Document#doc to #parser.
Breaking change: Renamed Wgit::Crawler#time_out to #timeout. Same goes for the named param passed to Wgit::Crawler.initialize.
Breaking change: Refactored Wgit::Url#relative? now takes :origin instead of :base which takes the port into account. This has a knock on effect for some other methods too - check the docs if you're getting parameter errors.
Breaking change: Renamed Wgit::Url#prefix_base to #make_absolute.
Updated Utils.printf_search_results to return the number of results.
Updated Wgit::Indexer.new which can now be called without parameters - the first param (for a database) now defaults to Wgit::Database.new which works if ENV['WGIT_CONNECTION_STRING'] is set.
Updated Wgit::Document.define_extractor to define a setter method (as well as the usual getter method).
Updated Wgit::Document#search to support a Regexp query (in addition to a String). ### Fixed
Re-indexing bug so that indexing content a 2nd time will update it in the database - before it simply disgarded the document.

- `Wgit::Crawler#crawl_site` params `allow/disallow_paths` values can now start with a `/`.

v0.8.0

Added

To the range of Wgit::Document.text_elements. Now (only and) all visible page text should be extracted into Wgit::Document#text successfully.
Wgit::Document#description default extension.
Wgit::Url.parse_or_nil method. ### Changed/Removed
Breaking change: Renamed Document#stats[:text_snippets] to be :text.
Breaking change: Wgit::Document.define_extension's block return value now becomes the var value, even when nil is returned. This allows var to be set to nil.
Potential breaking change: Renamed Wgit::Response#crawl_time (alias) to be #crawl_duration.
Updated Wgit::Crawler::SUPPORTED_FILE_EXTENSIONS to be Wgit::Crawler.supported_file_extensions, making it configurable. Now you can add your own URL extensions if needed.
Updated the Wgit core extension String#to_url to use Wgit::Url.parse allowing instances of Wgit::Url to returned as is. This also affects Enumerable#to_urls in the same way. ### Fixed

- An issue where too much `Wgit::Document#text` was being extracted from the HTML. This was fixed by reverting the recent commit: "Document.text_elements_xpath is now `//*/text()`".

v0.7.0

Added

Wgit::Indexer.new optional crawler: named param.
bin/wgit executable; available after gem install wgit. Just type wgit at the command line for an interactive shell session with the Wgit gem already loaded.
Document.extensions returning a Set of all defined extensions. ### Changed/Removed
Potential breaking changes: Updated the default search param from whole_sentence: false to true across all search methods e.g. Wgit::Database#search, Wgit::Document#search Wgit.indexed_search etc. This brings back more relevant search results by default.
Updated the Docker image to now include index names; making it easier to identify them. ### Fixed

- ...

v0.6.0

Added

Added Wgit::Utils.proces_arr encode: param. ### Changed/Removed
Breaking changes: Updated Wgit::Response#success? and #failure? logic.
Breaking changes: Updated Wgit::Crawler redirect logic. See the docs for more info.
Breaking changes: Updated Wgit::Crawler#crawl_site path params logic to support globs e.g. allow_paths: 'wiki/*'. See the docs for more info.
Breaking changes: Refactored references of encode_html: to encode: in the Wgit::Document and Wgit::Crawler classes.
Breaking changes: Wgit::Document.text_elements_xpath is now //*/text(). This means that more text is extracted from each page and you can no longer be selective of the text elements on a page.
Improved Wgit::Url#valid? and #relative?. ### Fixed
Bug fix in Wgit::Crawler#crawl_site where *.php URLs weren't being crawled. The fix was to implement Wgit::Crawler::SUPPORTED_FILE_EXTENSIONS.

- Bug fix in `Wgit::Document#search`.

v0.5.1

Added

Wgit.version_str method. ### Changed/Removed
Switched to optimistic dependency versioning. ### Fixed

- Bug in `Wgit::Url#concat`.

v0.5.0

Added

A Wgit Wiki! https://github.com/michaeltelford/wgit/wiki
Wgit::Document#content alias for #html.
Wgit::Url#prefix_base method.
Wgit::Url#to_addressable_uri method.
Support for partially crawling a site using Wgit::Crawler#crawl_site(allow_paths: []) or disallow_paths:.
Wgit::Url#+ as alias for #concat.
Wgit::Url#invalid? method.
Wgit.version method.
Wgit::Response class containing adapter agnostic HTTP response logic. ### Changed/Removed
Breaking changes: Removed Wgit::Document#date_crawled and #crawl_duration because both of these methods exist on the Wgit::Document#url. Instead, use doc.url.date_crawled etc.
Breaking changes: Added to and moved Document.define_extension block params, it's now |value, source, type|. The source is not what it used to be; it's now type - of either :document or :object. Confused? See the docs.
Breaking changes: Changed Wgit::Url#prefix_protocol so that it no longer modifies the receiver.
Breaking changes: Updated Wgit::Url#to_anchor and #to_query logic to align with that of Addressable::URI e.g. the anchor value no longer contains # prefix; and the query value no longer contains ? prefix.
Breaking changes: Renamed Wgit::Url methods containing anchor to now be named fragment e.g. to_anchor is now called to_fragment and without_anchor is without_fragment etc.
Breaking changes: Renamed Wgit::Url#prefix_protocol to #prefix_scheme. The protocol: param name remains unchanged.
Breaking changes: Renamed all Wgit::Url methods starting with without_* to omit_*.
Breaking changes: Wgit::Indexer no longer inserts invalid external URL's (to be crawled at a later date).
Breaking changes: Wgit::Crawler#last_response is now of type Wgit::Response. You can access the underlying Typhoeus::Response object with crawler.last_response.adapter_response. ### Fixed
Bug in Wgit::Document#base_url around the handling of invalid base URL scenarios.

- Several bugs in `Wgit::Database` class caused by the recent changes to the data model (in version 0.3.0).

v0.4.1

Added

... ### Changed/Removed
... ### Fixed

- A crawl bug that resulted in some servers dropping requests due to the use of Typhoeus's default `User-Agent` header. This has now been changed.

v0.4.0

Added

Wgit::Document#stats alias #statistics.
Wgit::Crawler#time_out logic for long crawls. Can also be set via initialize.
Wgit::Crawler#last_response#redirect_count method logic.
Wgit::Crawler#last_response#total_time method logic.
Wgit::Utils.fetch(hash, key, default = nil) method which tries multiple key formats before giving up e.g. :foo, 'foo', 'FOO' etc. ### Changed/Removed
Breaking changes: Updated Wgit::Crawler crawl logic to use typhoeus instead of Net:HTTP. Users should see a significant improvement in crawl speed as a result. This means that Wgit::Crawler#last_response is now of type Typhoeus::Response. See https://rubydoc.info/gems/typhoeus/Typhoeus/Response for more info. ### Fixed

- ...

v0.3.0

Added

Url#crawl_duration method.
Document#crawl_duration method.
Benchmark.measure to Crawler logic to set Url#crawl_duration. ### Changed/Removed
Breaking changes: Updated data model to embed the full url object inside the documents object.
Breaking changes: Updated data model by removing documents score attribute. ### Fixed

- ...

v0.2.0

This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/gems/wgit

Added

Wgit::Url#absolute? method.
Wgit::Url#relative? base: url support.
Wgit::Database.connect method (alias for Wgit::Database.new).
Wgit::Database#search and Wgit::Document#search methods now support case_sensitive: and whole_sentence: named parameters. ### Changed/Removed
Breaking changes: Renamed the following Wgit and Wgit::Indexer methods: Wgit.index_the_web to Wgit.index_www, Wgit::Indexer.index_the_web to Wgit::Indexer.index_www, Wgit.index_this_site to Wgit.index_site, Wgit::Indexer.index_this_site to Wgit::Indexer.index_site, Wgit.index_this_page to Wgit.index_page, Wgit::Indexer.index_this_page to Wgit::Indexer.index_page.
Breaking changes: All Wgit::Indexer methods now take named parameters.
Breaking changes: The following Wgit::Url method signatures have changed: initialize aka new,
Breaking changes: The following Wgit::Url class methods have been removed: .validate, .valid?, .prefix_protocol, .concat in favour of instance methods by the same names.
Breaking changes: The following Wgit::Url instance methods/aliases have been changed/removed: #to_protocol (now #to_scheme), #to_query_string and #query_string (now #to_query), #relative_link? (now #relative?), #without_query_string (now #without_query), #is_query_string? (now #query?).
Breaking changes: The database connection string is now passed directly to Wgit::Database.new; or in its absence, obtained from ENV['WGIT_CONNECTION_STRING']. See the README.md section entitled: Practical Database Example for an example.
Breaking changes: The following Wgit::Database instance methods now take named parameters: #urls, #crawled_urls, #uncrawled_urls, #search.
Breaking changes: The following Wgit::Document instance methods now take named parameters: #to_h, #to_json, #search, #search!.
Breaking changes: The following Wgit::Document instance methods/aliases have been changed/removed: #internal_full_links (now #internal_absolute_links).
Breaking changes: Any Wgit::Document method alias for returning links containing the word relative has been removed for clarity. Use #internal_links, #internal_absolute_links or #external_links instead.
Breaking changes: Wgit::Crawler instance vars @docs and @urls have been removed causing the following instance methods to also be removed: #urls=, #[], #<<. Also, .new aka #initialize now requires no params.
Breaking changes: Wgit::Crawler.new now takes an optional redirect_limit: parameter. This is now the only way of customising the redirect crawl behavior. Wgit::Crawler.redirect_limit no longer exists.
Breaking changes: The following Wgit::Crawler instance methods signatures have changed: #crawl_site and #crawl_url now require a url param (which no longer defaults), #crawl_urls now requires one or more *urls (which no longer defaults).
Breaking changes: The following Wgit::Assertable method aliases have been removed: .type, .types (use .assert_types instead) and .arr_type, .arr_types (use .assert_arr_types instead).
Breaking changes: The following Wgit::Utils methods now take named parameters: .to_h and .printf_search_results.
Breaking changes: Wgit::Utils.printf_search_results's method signature has changed; the search parameters have been removed. Before calling this method you must call doc.search! on each of the results. See the docs for the full details.
Wgit::Document instances can now be instantiated with String Url's (previously only Wgit::Url's). ### Fixed

- ...

v0.0.18

Added

Wgit::Url#to_brand method and updated Wgit::Url#is_relative? to support it. ### Changed/Removed
Updated certain classes by changing some private methods to protected. ### Fixed

- ...

v0.0.17

Added

Support for <base> element in Wgit::Document's.
New Wgit::Url methods: without_query_string, is_query_string?, is_anchor?, replace (override of String#replace). ### Changed/Removed
Breaking changes: Removed Wgit::Document#internal_links_without_anchors method.
Breaking changes (potentially): Wgit::Url's are now replaced with the redirected to Url during a crawl.
Updated Wgit::Document#base_url to support an optional link: named parameter.
Updated Wgit::Crawler#crawl_site to allow the initial url to redirect to another host.
Updated Wgit::Url#is_relative? to support an optional domain: named parameter. ### Fixed
Bug in Wgit::Document#internal_full_links affecting anchor and query string links including those used during Wgit::Crawler#crawl_site.

- Bug causing an 'Invalid URL' error for `Wgit::Crawler#crawl_site`.

v0.0.16

Added

Added Wgit::Url.parse class method as alias for Wgit::Url.new. ### Changed/Removed
Breaking changes: Removed Wgit::Url.relative_link? (class method). Use Wgit::Url#is_relative? (instance method) instead e.g. Wgit::Url.new('/blah').is_relative?. ### Fixed

- Several URI related bugs in `Wgit::Url` affecting crawls.

v0.0.15

Added

Support for IRI's (non ASCII based URL's). ### Changed/Removed
Breaking changes: Removed Document and Url#to_hash aliases. Call to_h instead. ### Fixed

- Bug in `Crawler#crawl_site` where an internal redirect to an external site's page was being followed.

v0.0.14

Added

Indexer#index_this_page method. ### Changed/Removed
Breaking Changes: Wgit::CONNECTION_DETAILS now only requires DB_CONNECTION_STRING. ### Fixed

Wgit Change Log

v0.0.0 - BREAKING CHANGES

Added

- ...

v0.11.0 - BREAKING CHANGES

Added

- Various other minor fixes/improvements throughout the code base.

v0.10.8

Added

Changed/Removed

- ...

v0.10.7

Added

- Security vulnerabilities by updating gem dependencies.

v0.10.6

Added

- ...

v0.10.5

Added

- ...

v0.10.4

Added

- ...

v0.10.3

Added

- ...

v0.10.2

Added

- ...

v0.10.1

Added

- ...

v0.10.0

Added

- Scheme-relative bug by adding support for scheme-relative URL's.

v0.9.0

Added

- Wgit::Crawler#crawl_site params allow/disallow_paths values can now start with a /.

v0.8.0

Added

- An issue where too much Wgit::Document#text was being extracted from the HTML. This was fixed by reverting the recent commit: "Document.text_elements_xpath is now //*/text()".

v0.7.0

Added

- ...

v0.6.0

Added

- Bug fix in Wgit::Document#search.

v0.5.1

Added

- Bug in Wgit::Url#concat.

v0.5.0

Added

- Several bugs in Wgit::Database class caused by the recent changes to the data model (in version 0.3.0).

v0.4.1

Added

- A crawl bug that resulted in some servers dropping requests due to the use of Typhoeus's default User-Agent header. This has now been changed.

v0.4.0

Added

- ...

v0.3.0

Added

- ...

v0.2.0

Added

- ...

v0.0.18

Added

- ...

v0.0.17

Added

- Bug causing an 'Invalid URL' error for Wgit::Crawler#crawl_site.

v0.0.16

Added

- Several URI related bugs in Wgit::Url affecting crawls.

v0.0.15

Added

- Bug in Crawler#crawl_site where an internal redirect to an external site's page was being followed.

v0.0.14

Added

- Found and fixed a bug in Document#new.

- `Wgit::Crawler#crawl_site` params `allow/disallow_paths` values can now start with a `/`.

- An issue where too much `Wgit::Document#text` was being extracted from the HTML. This was fixed by reverting the recent commit: "Document.text_elements_xpath is now `//*/text()`".

- Bug fix in `Wgit::Document#search`.

- Bug in `Wgit::Url#concat`.

- Several bugs in `Wgit::Database` class caused by the recent changes to the data model (in version 0.3.0).

- A crawl bug that resulted in some servers dropping requests due to the use of Typhoeus's default `User-Agent` header. This has now been changed.

- Bug causing an 'Invalid URL' error for `Wgit::Crawler#crawl_site`.

- Several URI related bugs in `Wgit::Url` affecting crawls.

- Bug in `Crawler#crawl_site` where an internal redirect to an external site's page was being followed.

- Found and fixed a bug in `Document#new`.