Module: Html2rss::AutoSource::Scraper

Defined in:
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/microdata.rb,
lib/html2rss/auto_source/scraper/json_state.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

The Scraper module contains all scrapers that can be used to extract articles. Each scraper should implement an each method that yields article hashes. Each scraper should also implement an articles? method that returns true if the scraper can potentially be used to extract articles from the given HTML.

Detection is intentionally shallow for most scrapers, but instance-based matching is available for scrapers that need to carry expensive selection state forward into extraction. Scrapers run in parallel threads, so implementations must avoid shared mutable state and degrade by returning no articles when a follow-up would be unsafe or unsupported.

Defined Under Namespace

Classes: Html, JsonState, Microdata, NoScraperFound, Schema, SemanticHtml, WordpressApi

Constant Summary collapse

APP_SHELL_ROOT_SELECTORS =
'#app, #root, #__next, [data-reactroot], [ng-app], [id*="app-shell"]'
APP_SHELL_MAX_ANCHORS =
2
APP_SHELL_MAX_VISIBLE_TEXT_LENGTH =
220
SCRAPERS =
[
  WordpressApi,
  Schema,
  Microdata,
  JsonState,
  SemanticHtml,
  Html
].freeze

Class Method Summary collapse

Class Method Details

.from(parsed_body, opts = ) ⇒ Array<Class>

Returns an array of scraper classes that claim to find articles in the parsed body.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    The parsed HTML body.

  • opts (Hash) (defaults to: )

    The options hash.

Returns:

  • (Array<Class>)

    An array of scraper classes that can handle the parsed body.



70
71
72
73
74
75
76
77
# File 'lib/html2rss/auto_source/scraper.rb', line 70

def self.from(parsed_body, opts = Html2rss::AutoSource::DEFAULT_CONFIG[:scraper])
  scrapers = SCRAPERS.select { |scraper| opts.dig(scraper.options_key, :enabled) }
  scrapers.select! { |scraper| scraper.articles?(parsed_body) }

  raise no_scraper_found_for(parsed_body) if scrapers.empty?

  scrapers
end

.instances_for(parsed_body, url:, request_session: nil, opts: ) ⇒ Array<Object>

Returns scraper instances ready for extraction. instances_for is the main entrypoint for extraction. It lets a scraper decide whether it matches using the same instance that will later yield article hashes, which keeps precomputed state close to the scraper that owns it.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    The parsed HTML body.

  • url (String, Html2rss::Url)

    The page url.

  • request_session (Html2rss::RequestSession, nil) (defaults to: nil)

    Shared follow-up session.

  • opts (Hash) (defaults to: )

    The options hash.

Returns:

  • (Array<Object>)

    An array of scraper instances that can handle the parsed body.



90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/html2rss/auto_source/scraper.rb', line 90

def self.instances_for(parsed_body, url:, request_session: nil,
                       opts: Html2rss::AutoSource::DEFAULT_CONFIG[:scraper])
  instances = SCRAPERS.filter_map do |scraper|
    next unless opts.dig(scraper.options_key, :enabled)

    instance = scraper.new(parsed_body, url:, request_session:, **opts.fetch(scraper.options_key, {}))
    next unless extractable_instance?(instance, parsed_body)

    instance
  end

  raise no_scraper_found_for(parsed_body) if instances.empty?

  instances
end