Class: Html2rss::AutoSource::Scraper::SemanticHtml::AnchorSelector

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

Selects the best content-like anchor from a semantic container.

The selector turns raw DOM anchors into ranked facts so semantic scraping can reason about link intent instead of DOM order. It favors heading-aligned article links and suppresses utility links, duplicate destinations, and weak textless affordances.

Defined Under Namespace

Classes: AnchorFacts

Constant Summary collapse

HEADING_SELECTOR =
HtmlExtractor::HEADING_TAGS.join(',').freeze
UTILITY_PATH_SEGMENTS =
%w[
  about account author category comment comments contact feedback help
  login newsletter profile register search settings share signup subscribe
  topic topics view-all archive archives
  feed feeds
  recommended
  for-you
  preference preferences
  notification notifications
  privacy terms
  cookie cookies
  logout
  user users
].to_set.freeze
CONTENT_PATH_SEGMENTS =
%w[
  article articles news post posts story stories update updates
].to_set.freeze
UTILITY_LANDMARK_TAGS =
%w[nav aside footer menu].freeze

Instance Method Summary collapse

Constructor Details

#initialize(base_url) ⇒ AnchorSelector

Returns a new instance of AnchorSelector.



48
49
50
# File 'lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb', line 48

def initialize(base_url)
  @base_url = base_url
end

Instance Method Details

#primary_anchor_for(container) ⇒ Nokogiri::XML::Element?

Chooses the single anchor that best represents the story contained in a semantic block.

Ranking is scoped to one container at a time. That keeps the logic local, makes duplicate links to the same destination collapse into one candidate, and avoids page-wide heuristics leaking across cards.

Parameters:

  • container (Nokogiri::XML::Element)

    semantic container being evaluated

Returns:

  • (Nokogiri::XML::Element, nil)

    selected primary anchor or nil when none qualify



62
63
64
# File 'lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb', line 62

def primary_anchor_for(container)
  facts_for(container).max_by(&:score)&.anchor
end