Class: Html2rss::AutoSource

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source.rb,
lib/html2rss/auto_source/cleanup.rb,
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/microdata.rb,
lib/html2rss/auto_source/scraper/json_state.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

The AutoSource class automatically extracts articles from a given URL by utilizing a collection of Scrapers. These scrapers analyze and parse popular structured data formats—such as schema, microdata, and open graph—to identify and compile article elements into unified articles.

Scrapers supporting plain HTML are also available for sites without structured data, though results may vary based on page markup.

Defined Under Namespace

Modules: Scraper Classes: Cleanup

Constant Summary collapse

DEFAULT_CONFIG =
{
  scraper: {
    wordpress_api: {
      enabled: true
    },
    schema: {
      enabled: true
    },
    microdata: {
      enabled: true
    },
    json_state: {
      enabled: true
    },
    semantic_html: {
      enabled: true
    },
    html: {
      enabled: true,
      minimum_selector_frequency: Scraper::Html::DEFAULT_MINIMUM_SELECTOR_FREQUENCY,
      use_top_selectors: Scraper::Html::DEFAULT_USE_TOP_SELECTORS
    }
  },
  cleanup: Cleanup::DEFAULT_CONFIG
}.freeze
Config =
Dry::Schema.Params do
  optional(:scraper).hash(&SCRAPER_CONFIG)

  optional(:cleanup).hash do
    optional(:keep_different_domain).filled(:bool)
    optional(:min_words_title).filled(:integer, gt?: 0)
  end
end

Instance Method Summary collapse

Constructor Details

#initialize(response, opts = DEFAULT_CONFIG, request_session: nil) ⇒ void

Parameters:



84
85
86
87
88
89
# File 'lib/html2rss/auto_source.rb', line 84

def initialize(response, opts = DEFAULT_CONFIG, request_session: nil)
  @parsed_body = response.parsed_body
  @url = response.url
  @opts = opts
  @request_session = request_session
end

Instance Method Details

#articlesArray<Html2rss::RssBuilder::Article>

Extracts article candidates by selecting every scraper that can explain the page shape, running those scrapers, and normalizing the resulting hashes into RssBuilder::Article objects.

The contributor-facing flow is:

  1. choose scraper instances that match the page
  2. let each scraper collect its own candidates
  3. clean and deduplicate the merged article list

Scrapers with expensive precomputation, such as SemanticHtml, keep that state on the instance so detection and extraction can reuse the same work.

Returns:



105
106
107
108
109
110
# File 'lib/html2rss/auto_source.rb', line 105

def articles
  @articles ||= extract_articles
rescue Html2rss::AutoSource::Scraper::NoScraperFound => error
  Log.warn "#{self.class}: no scraper matched #{url} (#{error.message})"
  []
end