Class: Html2rss::HtmlExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/html_extractor.rb,
lib/html2rss/html_extractor/date_extractor.rb,
lib/html2rss/html_extractor/image_extractor.rb,
lib/html2rss/html_extractor/enclosure_extractor.rb

Overview

HtmlExtractor is responsible for extracting details (headline, url, images, etc.) from an article_tag.

Defined Under Namespace

Modules: Extractors Classes: DateExtractor, EnclosureExtractor, ImageExtractor

Constant Summary collapse

INVISIBLE_CONTENT_TAGS =
%w[svg script noscript style template].to_set.freeze
HEADING_TAGS =
%w[h1 h2 h3 h4 h5 h6].freeze
NON_HEADLINE_SELECTOR =
(HEADING_TAGS.map { |tag| ":not(#{tag})" } + .to_a).freeze
MAIN_ANCHOR_SELECTOR =
begin
  buf = +'a[href]:not([href=""])'
  %w[# javascript: mailto: tel: file:// sms: data:].each do |prefix|
    buf << %[:not([href^="#{prefix}"])]
  end
  buf.freeze
end

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(article_tag, base_url:, selected_anchor:) ⇒ HtmlExtractor

Returns a new instance of HtmlExtractor.

Parameters:

  • article_tag (Nokogiri::XML::Node)

    article-like container to extract from

  • base_url (String, Html2rss::Url)

    base url used to resolve relative links

  • selected_anchor (Nokogiri::XML::Node, nil)

    explicit primary anchor for the container

Raises:

  • (ArgumentError)


51
52
53
54
55
56
57
# File 'lib/html2rss/html_extractor.rb', line 51

def initialize(, base_url:, selected_anchor:)
  raise ArgumentError, 'article_tag is required' unless 

  @article_tag = 
  @base_url = base_url
  @selected_anchor = selected_anchor
end

Class Method Details

.extract_visible_text(tag, separator: ' ') ⇒ String?

Extracts visible text from a given node and its children.

Parameters:

  • tag (Nokogiri::XML::Node)

    the node from which to extract visible text

  • separator (String) (defaults to: ' ')

    separator used to join text fragments (default is a space)

Returns:

  • (String, nil)

    the concatenated visible text, or nil if none is found



27
28
29
30
31
32
33
34
35
36
37
# File 'lib/html2rss/html_extractor.rb', line 27

def extract_visible_text(tag, separator: ' ')
  parts = tag.children.filter_map do |child|
    next unless visible_child?(child)

    raw_text = child.children.empty? ? child.text : extract_visible_text(child)
    text = raw_text&.strip
    text unless text.to_s.empty?
  end

  parts.join(separator).squeeze(' ').strip unless parts.empty?
end

.main_anchor_for(article_tag) ⇒ Nokogiri::XML::Node?

Returns first eligible descendant anchor.

Parameters:

  • article_tag (Nokogiri::XML::Node)

    article-like container to search within

Returns:

  • (Nokogiri::XML::Node, nil)

    first eligible descendant anchor



80
81
82
83
84
# File 'lib/html2rss/html_extractor.rb', line 80

def main_anchor_for()
  return  if .name == 'a' && .matches?(MAIN_ANCHOR_SELECTOR)

  .at_css(MAIN_ANCHOR_SELECTOR)
end

Instance Method Details

#callObject



59
60
61
62
63
64
65
66
67
68
69
70
# File 'lib/html2rss/html_extractor.rb', line 59

def call
  {
    title: extract_title,
    url: extract_url,
    image: extract_image,
    description: extract_description,
    id: generate_id,
    published_at: extract_published_at,
    enclosures: extract_enclosures,
    categories: extract_categories
  }
end