Class: Html2rss::AutoSource::Scraper::SemanticHtml::AnchorSelector
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::SemanticHtml::AnchorSelector
- Defined in:
- lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb
Overview
Selects the best content-like anchor from a semantic container.
The selector turns raw DOM anchors into ranked facts so semantic scraping can reason about link intent instead of DOM order. It favors heading-aligned article links and suppresses utility links, duplicate destinations, and weak textless affordances.
Defined Under Namespace
Classes: AnchorFacts
Constant Summary collapse
- HEADING_SELECTOR =
HtmlExtractor::HEADING_TAGS.join(',').freeze
- UTILITY_PATH_SEGMENTS =
%w[ about account author category comment comments contact feedback help login newsletter profile register search settings share signup subscribe topic topics view-all archive archives feed feeds recommended for-you preference preferences notification notifications privacy terms cookie cookies logout user users ].to_set.freeze
- CONTENT_PATH_SEGMENTS =
%w[ article articles news post posts story stories update updates ].to_set.freeze
- UTILITY_LANDMARK_TAGS =
%w[nav aside footer menu].freeze
Instance Method Summary collapse
-
#initialize(base_url) ⇒ AnchorSelector
constructor
A new instance of AnchorSelector.
-
#primary_anchor_for(container) ⇒ Nokogiri::XML::Element?
Chooses the single anchor that best represents the story contained in a semantic block.
Constructor Details
#initialize(base_url) ⇒ AnchorSelector
Returns a new instance of AnchorSelector.
48 49 50 |
# File 'lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb', line 48 def initialize(base_url) @base_url = base_url end |
Instance Method Details
#primary_anchor_for(container) ⇒ Nokogiri::XML::Element?
Chooses the single anchor that best represents the story contained in a semantic block.
Ranking is scoped to one container at a time. That keeps the logic local, makes duplicate links to the same destination collapse into one candidate, and avoids page-wide heuristics leaking across cards.
62 63 64 |
# File 'lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb', line 62 def primary_anchor_for(container) facts_for(container).max_by(&:score)&.anchor end |