Class: Html2rss::AutoSource::Scraper::SemanticHtml::Extractor

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/scraper/semantic_html/extractor.rb

Overview

ArticleExtractor is responsible for extracting the details of an article. It focuses on finding a headline first, and from it traverse as much as possible, to find the DOM upwards to find the other details.

Constant Summary collapse

INVISIBLE_CONTENT_TAG_SELECTORS =
%w[svg script noscript style template].to_set.freeze
HEADING_TAGS =
%w[h1 h2 h3 h4 h5 h6].freeze
NOT_HEADLINE_SELECTOR =
(HEADING_TAGS.map { |selector| ":not(#{selector})" } +
INVISIBLE_CONTENT_TAG_SELECTORS.to_a).freeze

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(article_tag, url:) ⇒ Extractor

Returns a new instance of Extractor.



35
36
37
38
# File 'lib/html2rss/auto_source/scraper/semantic_html/extractor.rb', line 35

def initialize(, url:)
  @article_tag = 
  @url = url
end

Class Method Details

.visible_text_from_tag(tag, separator: ' ') ⇒ Object



19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# File 'lib/html2rss/auto_source/scraper/semantic_html/extractor.rb', line 19

def self.visible_text_from_tag(tag, separator: ' ')
  text = if (children = tag.children).empty?
           tag.text.strip
         else
           children.filter_map do |child|
             next if INVISIBLE_CONTENT_TAG_SELECTORS.include?(child.name)

             visible_text_from_tag(child)
           end.join(separator)
         end

  return if (sanitized_text = text.gsub(/\s+/, ' ').strip).empty?

  sanitized_text
end

Instance Method Details

#callHash?

Returns The scraped article or nil.

Returns:

  • (Hash, nil)

    The scraped article or nil.



41
42
43
44
45
46
47
48
49
50
51
52
53
54
# File 'lib/html2rss/auto_source/scraper/semantic_html/extractor.rb', line 41

def call
  @heading = find_heading || closest_anchor || return

  @extract_url = find_url

  {
    title: extract_title,
    url: extract_url,
    image: extract_image,
    description: extract_description,
    id: generate_id,
    published_at: extract_published_at
  }
end