Class: Html2rss::AutoSource::Scraper::SemanticHtml

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

Scrapes semantic containers by choosing one primary content link per block before extraction.

This scraper is intentionally container-first:

  1. collect candidate semantic containers once
  2. select the strongest content-like anchor within each container
  3. extract fields from the container while honoring that anchor choice

The result is lower recall on weak-signal blocks, but much better link quality on modern teaser cards that mix headlines, utility links, and duplicate image overlays.

Defined Under Namespace

Classes: AnchorSelector, Entry

Constant Summary collapse

CONTAINER_SELECTORS =
[
  'article:not(:has(article))',
  'section:not(:has(section))',
  'li:not(:has(li))',
  'tr:not(:has(tr))',
  'div:not(:has(div))'
].freeze

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) ⇒ SemanticHtml

Returns a new instance of SemanticHtml.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed HTML document

  • url (String, Html2rss::Url)

    base url

  • extractor (Class) (defaults to: HtmlExtractor)

    extractor class used for article extraction



48
49
50
51
52
53
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 48

def initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts)
  @parsed_body = parsed_body
  @url = url
  @extractor = extractor
  @anchor_selector = AnchorSelector.new(url)
end

Instance Attribute Details

#parsed_bodyObject (readonly)

Returns the value of attribute parsed_body.



55
56
57
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 55

def parsed_body
  @parsed_body
end

Class Method Details

.articles?(parsed_body) ⇒ Boolean

Returns true when at least one semantic container has an eligible anchor.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed HTML document

Returns:

  • (Boolean)

    true when at least one semantic container has an eligible anchor



39
40
41
42
43
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 39

def self.articles?(parsed_body)
  return false unless parsed_body

  new(parsed_body, url: 'https://example.com').extractable?
end

.options_keySymbol

Returns config key used to enable or configure this scraper.

Returns:

  • (Symbol)

    config key used to enable or configure this scraper



35
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 35

def self.options_key = :semantic_html

Instance Method Details

#each {|article_hash| ... } ⇒ Enumerator<Hash>

Yields extracted article hashes for each semantic container that survives anchor selection.

Detection and extraction share the same memoized entry list so this scraper does not rerun anchor ranking once a page has already been accepted as extractable.

Yield Parameters:

  • article_hash (Hash)

    extracted article hash

Returns:

  • (Enumerator<Hash>)


67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 67

def each
  return enum_for(:each) unless block_given?

  extractable_entries.each do |entry|
    article_hash = @extractor.new(
      entry.container,
      base_url: @url,
      selected_anchor: entry.selected_anchor
    ).call
    yield article_hash if article_hash
  end
end

#extractable?Boolean

Reports whether the page contains at least one semantic container with a selectable primary anchor.

Returns:

  • (Boolean)

    true when at least one candidate container yields a primary anchor



85
86
87
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 85

def extractable?
  extractable_entries.any?
end