Class: Html2rss::AutoSource::Scraper::Html

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/auto_source/scraper/html.rb

Overview

Scrapes article-like blocks from plain HTML by looking for repeated link structures when richer structured data is unavailable.

The approach is intentionally heuristic:

  1. collect repeated anchor paths
  2. walk upward to a shared container shape
  3. extract the best anchor found inside each container

This scraper is broader and noisier than SemanticHtml, so it acts as a fallback for pages without stronger semantic signals.

Constant Summary collapse

TAGS_TO_IGNORE =
/(nav|footer|header|svg|script|style)/i
DEFAULT_MINIMUM_SELECTOR_FREQUENCY =
2
DEFAULT_USE_TOP_SELECTORS =
5

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:, extractor: HtmlExtractor, **opts) ⇒ Html

Returns a new instance of Html.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    The parsed HTML document.

  • url (String)

    The base URL.

  • extractor (Class) (defaults to: HtmlExtractor)

    The extractor class to handle article extraction.

  • opts (Hash)

    Additional options.



56
57
58
59
60
61
# File 'lib/html2rss/auto_source/scraper/html.rb', line 56

def initialize(parsed_body, url:, extractor: HtmlExtractor, **opts)
  @parsed_body = parsed_body
  @url = url
  @extractor = extractor
  @opts = opts
end

Instance Attribute Details

#parsed_bodyObject (readonly)

Returns the value of attribute parsed_body.



63
64
65
# File 'lib/html2rss/auto_source/scraper/html.rb', line 63

def parsed_body
  @parsed_body
end

Class Method Details

.articles?(parsed_body) ⇒ Boolean

Probes whether the document appears to contain repeated anchor structures that this fallback scraper can cluster into article-like containers.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed HTML document

Returns:

  • (Boolean)

    true when the scraper can likely extract articles



38
39
40
# File 'lib/html2rss/auto_source/scraper/html.rb', line 38

def self.articles?(parsed_body)
  new(parsed_body, url: '').any?
end

.options_keySymbol

Returns config key used to enable or configure this scraper.

Returns:

  • (Symbol)

    config key used to enable or configure this scraper



29
# File 'lib/html2rss/auto_source/scraper/html.rb', line 29

def self.options_key = :html

.simplify_xpath(xpath) ⇒ String

Simplify an XPath selector by removing the index notation. This keeps repeated anchor paths comparable across sibling blocks.

Parameters:

  • xpath (String)

    original XPath

Returns:

  • (String)

    XPath without positional indexes



48
49
50
# File 'lib/html2rss/auto_source/scraper/html.rb', line 48

def self.simplify_xpath(xpath)
  xpath.gsub(/\[\d+\]/, '')
end

Instance Method Details

#article_tag_condition?(node) ⇒ Boolean

Decides whether a traversed node has reached a useful article-like boundary for the generic HTML scraper.

The predicate prefers containers that add surrounding link context, which helps the scraper move from a leaf anchor toward a repeated teaser/card wrapper.

Parameters:

  • node (Nokogiri::XML::Node)

    candidate boundary node

Returns:

  • (Boolean)

    true when the node is a good extraction boundary



87
88
89
90
91
92
93
94
# File 'lib/html2rss/auto_source/scraper/html.rb', line 87

def (node)
  # Ignore tags that are below a tag which is in TAGS_TO_IGNORE.
  return false if node.path.match?(TAGS_TO_IGNORE)
  return true if %w[body html].include?(node.name)
  return false unless (parent = node.parent)

  anchor_count(parent) > anchor_count(node)
end

#each {|The| ... } ⇒ Enumerator

Returns Enumerator for the scraped articles.

Yield Parameters:

  • The (Hash)

    scraped article hash

Returns:

  • (Enumerator)

    Enumerator for the scraped articles



68
69
70
71
72
73
74
75
# File 'lib/html2rss/auto_source/scraper/html.rb', line 68

def each
  return enum_for(:each) unless block_given?

   do ||
    article_hash = extract_article()
    yield article_hash if article_hash
  end
end