Class: Html2rss::AutoSource::Scraper::SemanticHtml::Extractor

Inherits:

Object

Object
Html2rss::AutoSource::Scraper::SemanticHtml::Extractor

show all

Defined in:: lib/html2rss/auto_source/scraper/semantic_html/extractor.rb

Overview

ArticleExtractor is responsible for extracting the details of an article. It focuses on finding a headline first, and from it traverse as much as possible, to find the DOM upwards to find the other details.

Constant Summary collapse

INVISIBLE_CONTENT_TAG_SELECTORS =

%w[svg script noscript style template].to_set.freeze

HEADING_TAGS =

%w[h1 h2 h3 h4 h5 h6].freeze

NOT_HEADLINE_SELECTOR =

(HEADING_TAGS.map { |selector| ":not(#{selector})" } +
INVISIBLE_CONTENT_TAG_SELECTORS.to_a).freeze

Class Method Summary collapse

.visible_text_from_tag(tag, separator: ' ') ⇒ Object

Instance Method Summary collapse

#call ⇒ Hash^?

The scraped article or nil.
#initialize(article_tag, url:) ⇒ Extractor constructor

A new instance of Extractor.

Constructor Details

#initialize(article_tag, url:) ⇒ `Extractor`

Returns a new instance of Extractor.

# File 'lib/html2rss/auto_source/scraper/semantic_html/extractor.rb', line 35

def initialize(article_tag, url:)
  @article_tag = article_tag
  @url = url
end

Class Method Details

.visible_text_from_tag(tag, separator: ' ') ⇒ `Object`

# File 'lib/html2rss/auto_source/scraper/semantic_html/extractor.rb', line 19

def self.visible_text_from_tag(tag, separator: ' ')
  text = if (children = tag.children).empty?
           tag.text.strip
         else
           children.filter_map do |child|
             next if INVISIBLE_CONTENT_TAG_SELECTORS.include?(child.name)

             visible_text_from_tag(child)
           end.join(separator)
         end

  return if (sanitized_text = text.gsub(/\s+/, ' ').strip).empty?

  sanitized_text
end

Instance Method Details

#call ⇒ `Hash`^?

Returns The scraped article or nil.

Returns:

(Hash, nil) —

The scraped article or nil.

# File 'lib/html2rss/auto_source/scraper/semantic_html/extractor.rb', line 41

def call
  @heading = find_heading || closest_anchor || return

  @extract_url = find_url

  {
    title: extract_title,
    url: extract_url,
    image: extract_image,
    description: extract_description,
    id: generate_id,
    published_at: extract_published_at
  }
end