Module: Html2rss::AutoSource::Scraper
- Defined in:
- lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/microdata.rb,
lib/html2rss/auto_source/scraper/json_state.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb
Overview
The Scraper module contains all scrapers that can be used to extract articles.
Each scraper should implement an each method that yields article hashes.
Each scraper should also implement an articles? method that returns true if the scraper
can potentially be used to extract articles from the given HTML.
Detection is intentionally shallow for most scrapers, but instance-based matching is available for scrapers that need to carry expensive selection state forward into extraction. Scrapers run in parallel threads, so implementations must avoid shared mutable state and degrade by returning no articles when a follow-up would be unsafe or unsupported.
Defined Under Namespace
Classes: Html, JsonState, Microdata, NoScraperFound, Schema, SemanticHtml, WordpressApi
Constant Summary collapse
- APP_SHELL_ROOT_SELECTORS =
'#app, #root, #__next, [data-reactroot], [ng-app], [id*="app-shell"]'- APP_SHELL_MAX_ANCHORS =
2- APP_SHELL_MAX_VISIBLE_TEXT_LENGTH =
220- SCRAPERS =
[ WordpressApi, Schema, Microdata, JsonState, SemanticHtml, Html ].freeze
Class Method Summary collapse
-
.from(parsed_body, opts = ) ⇒ Array<Class>
Returns an array of scraper classes that claim to find articles in the parsed body.
-
.instances_for(parsed_body, url:, request_session: nil, opts: ) ⇒ Array<Object>
Returns scraper instances ready for extraction.
Class Method Details
.from(parsed_body, opts = ) ⇒ Array<Class>
Returns an array of scraper classes that claim to find articles in the parsed body.
70 71 72 73 74 75 76 77 |
# File 'lib/html2rss/auto_source/scraper.rb', line 70 def self.from(parsed_body, opts = Html2rss::AutoSource::DEFAULT_CONFIG[:scraper]) scrapers = SCRAPERS.select { |scraper| opts.dig(scraper., :enabled) } scrapers.select! { |scraper| scraper.articles?(parsed_body) } raise no_scraper_found_for(parsed_body) if scrapers.empty? scrapers end |
.instances_for(parsed_body, url:, request_session: nil, opts: ) ⇒ Array<Object>
Returns scraper instances ready for extraction.
instances_for is the main entrypoint for extraction. It lets a scraper
decide whether it matches using the same instance that will later yield
article hashes, which keeps precomputed state close to the scraper that
owns it.
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# File 'lib/html2rss/auto_source/scraper.rb', line 90 def self.instances_for(parsed_body, url:, request_session: nil, opts: Html2rss::AutoSource::DEFAULT_CONFIG[:scraper]) instances = SCRAPERS.filter_map do |scraper| next unless opts.dig(scraper., :enabled) instance = scraper.new(parsed_body, url:, request_session:, **opts.fetch(scraper., {})) next unless extractable_instance?(instance, parsed_body) instance end raise no_scraper_found_for(parsed_body) if instances.empty? instances end |