Class: Html2rss::AutoSource::Scraper::SemanticHtml
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::SemanticHtml
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb
Overview
Scrapes semantic containers by choosing one primary content link per block before extraction.
This scraper is intentionally container-first:
- collect candidate semantic containers once
- select the strongest content-like anchor within each container
- extract fields from the container while honoring that anchor choice
The result is lower recall on weak-signal blocks, but much better link quality on modern teaser cards that mix headlines, utility links, and duplicate image overlays.
Defined Under Namespace
Classes: AnchorSelector, Entry
Constant Summary collapse
- CONTAINER_SELECTORS =
[ 'article:not(:has(article))', 'section:not(:has(section))', 'li:not(:has(li))', 'tr:not(:has(tr))', 'div:not(:has(div))' ].freeze
Instance Attribute Summary collapse
-
#parsed_body ⇒ Object
readonly
Returns the value of attribute parsed_body.
Class Method Summary collapse
-
.articles?(parsed_body) ⇒ Boolean
True when at least one semantic container has an eligible anchor.
-
.options_key ⇒ Symbol
Config key used to enable or configure this scraper.
Instance Method Summary collapse
-
#each {|article_hash| ... } ⇒ Enumerator<Hash>
Yields extracted article hashes for each semantic container that survives anchor selection.
-
#extractable? ⇒ Boolean
Reports whether the page contains at least one semantic container with a selectable primary anchor.
-
#initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) ⇒ SemanticHtml
constructor
A new instance of SemanticHtml.
Constructor Details
#initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) ⇒ SemanticHtml
Returns a new instance of SemanticHtml.
48 49 50 51 52 53 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 48 def initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) @parsed_body = parsed_body @url = url @extractor = extractor @anchor_selector = AnchorSelector.new(url) end |
Instance Attribute Details
#parsed_body ⇒ Object (readonly)
Returns the value of attribute parsed_body.
55 56 57 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 55 def parsed_body @parsed_body end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
Returns true when at least one semantic container has an eligible anchor.
39 40 41 42 43 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 39 def self.articles?(parsed_body) return false unless parsed_body new(parsed_body, url: 'https://example.com').extractable? end |
.options_key ⇒ Symbol
Returns config key used to enable or configure this scraper.
35 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 35 def self. = :semantic_html |
Instance Method Details
#each {|article_hash| ... } ⇒ Enumerator<Hash>
Yields extracted article hashes for each semantic container that survives anchor selection.
Detection and extraction share the same memoized entry list so this scraper does not rerun anchor ranking once a page has already been accepted as extractable.
67 68 69 70 71 72 73 74 75 76 77 78 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 67 def each return enum_for(:each) unless block_given? extractable_entries.each do |entry| article_hash = @extractor.new( entry.container, base_url: @url, selected_anchor: entry.selected_anchor ).call yield article_hash if article_hash end end |
#extractable? ⇒ Boolean
Reports whether the page contains at least one semantic container with a selectable primary anchor.
85 86 87 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 85 def extractable? extractable_entries.any? end |