Class: Html2rss::AutoSource
- Inherits:
-
Object
- Object
- Html2rss::AutoSource
- Defined in:
- lib/html2rss/auto_source.rb,
lib/html2rss/auto_source/cleanup.rb,
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/microdata.rb,
lib/html2rss/auto_source/scraper/json_state.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb
Overview
The AutoSource class automatically extracts articles from a given URL by utilizing a collection of Scrapers. These scrapers analyze and parse popular structured data formats—such as schema, microdata, and open graph—to identify and compile article elements into unified articles.
Scrapers supporting plain HTML are also available for sites without structured data, though results may vary based on page markup.
Defined Under Namespace
Modules: Scraper Classes: Cleanup
Constant Summary collapse
- DEFAULT_CONFIG =
{ scraper: { wordpress_api: { enabled: true }, schema: { enabled: true }, microdata: { enabled: true }, json_state: { enabled: true }, semantic_html: { enabled: true }, html: { enabled: true, minimum_selector_frequency: Scraper::Html::DEFAULT_MINIMUM_SELECTOR_FREQUENCY, use_top_selectors: Scraper::Html::DEFAULT_USE_TOP_SELECTORS } }, cleanup: Cleanup::DEFAULT_CONFIG }.freeze
- Config =
Dry::Schema.Params do optional(:scraper).hash(&SCRAPER_CONFIG) optional(:cleanup).hash do optional(:keep_different_domain).filled(:bool) optional(:min_words_title).filled(:integer, gt?: 0) end end
Instance Method Summary collapse
-
#articles ⇒ Array<Html2rss::RssBuilder::Article>
Extracts article candidates by selecting every scraper that can explain the page shape, running those scrapers, and normalizing the resulting hashes into
RssBuilder::Articleobjects. - #initialize(response, opts = DEFAULT_CONFIG, request_session: nil) ⇒ void constructor
Constructor Details
#initialize(response, opts = DEFAULT_CONFIG, request_session: nil) ⇒ void
84 85 86 87 88 89 |
# File 'lib/html2rss/auto_source.rb', line 84 def initialize(response, opts = DEFAULT_CONFIG, request_session: nil) @parsed_body = response.parsed_body @url = response.url @opts = opts @request_session = request_session end |
Instance Method Details
#articles ⇒ Array<Html2rss::RssBuilder::Article>
Extracts article candidates by selecting every scraper that can explain the
page shape, running those scrapers, and normalizing the resulting hashes
into RssBuilder::Article objects.
The contributor-facing flow is:
- choose scraper instances that match the page
- let each scraper collect its own candidates
- clean and deduplicate the merged article list
Scrapers with expensive precomputation, such as SemanticHtml, keep that
state on the instance so detection and extraction can reuse the same work.
105 106 107 108 109 110 |
# File 'lib/html2rss/auto_source.rb', line 105 def articles @articles ||= extract_articles rescue Html2rss::AutoSource::Scraper::NoScraperFound => error Log.warn "#{self.class}: no scraper matched #{url} (#{error.})" [] end |