Class: Html2rss::AutoSource::Scraper::Html
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::Html
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/html.rb
Overview
Scrapes article-like blocks from plain HTML by looking for repeated link structures when richer structured data is unavailable.
The approach is intentionally heuristic:
- collect repeated anchor paths
- walk upward to a shared container shape
- extract the best anchor found inside each container
This scraper is broader and noisier than SemanticHtml, so it acts as a
fallback for pages without stronger semantic signals.
Constant Summary collapse
- TAGS_TO_IGNORE =
/(nav|footer|header|svg|script|style)/i- DEFAULT_MINIMUM_SELECTOR_FREQUENCY =
2- DEFAULT_USE_TOP_SELECTORS =
5
Instance Attribute Summary collapse
-
#parsed_body ⇒ Object
readonly
Returns the value of attribute parsed_body.
Class Method Summary collapse
-
.articles?(parsed_body) ⇒ Boolean
Probes whether the document appears to contain repeated anchor structures that this fallback scraper can cluster into article-like containers.
-
.options_key ⇒ Symbol
Config key used to enable or configure this scraper.
-
.simplify_xpath(xpath) ⇒ String
Simplify an XPath selector by removing the index notation.
Instance Method Summary collapse
-
#article_tag_condition?(node) ⇒ Boolean
Decides whether a traversed node has reached a useful article-like boundary for the generic HTML scraper.
-
#each {|The| ... } ⇒ Enumerator
Enumerator for the scraped articles.
-
#initialize(parsed_body, url:, extractor: HtmlExtractor, **opts) ⇒ Html
constructor
A new instance of Html.
Constructor Details
#initialize(parsed_body, url:, extractor: HtmlExtractor, **opts) ⇒ Html
Returns a new instance of Html.
56 57 58 59 60 61 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 56 def initialize(parsed_body, url:, extractor: HtmlExtractor, **opts) @parsed_body = parsed_body @url = url @extractor = extractor @opts = opts end |
Instance Attribute Details
#parsed_body ⇒ Object (readonly)
Returns the value of attribute parsed_body.
63 64 65 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 63 def parsed_body @parsed_body end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
Probes whether the document appears to contain repeated anchor structures that this fallback scraper can cluster into article-like containers.
38 39 40 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 38 def self.articles?(parsed_body) new(parsed_body, url: '').any? end |
.options_key ⇒ Symbol
Returns config key used to enable or configure this scraper.
29 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 29 def self. = :html |
.simplify_xpath(xpath) ⇒ String
Simplify an XPath selector by removing the index notation. This keeps repeated anchor paths comparable across sibling blocks.
48 49 50 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 48 def self.simplify_xpath(xpath) xpath.gsub(/\[\d+\]/, '') end |
Instance Method Details
#article_tag_condition?(node) ⇒ Boolean
Decides whether a traversed node has reached a useful article-like boundary for the generic HTML scraper.
The predicate prefers containers that add surrounding link context, which helps the scraper move from a leaf anchor toward a repeated teaser/card wrapper.
87 88 89 90 91 92 93 94 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 87 def article_tag_condition?(node) # Ignore tags that are below a tag which is in TAGS_TO_IGNORE. return false if node.path.match?(TAGS_TO_IGNORE) return true if %w[body html].include?(node.name) return false unless (parent = node.parent) anchor_count(parent) > anchor_count(node) end |
#each {|The| ... } ⇒ Enumerator
Returns Enumerator for the scraped articles.
68 69 70 71 72 73 74 75 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 68 def each return enum_for(:each) unless block_given? each_article_tag do |article_tag| article_hash = extract_article(article_tag) yield article_hash if article_hash end end |