Class: Html2rss::AutoSource::Scraper::SemanticHtml::Extractor
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::SemanticHtml::Extractor
- Defined in:
- lib/html2rss/auto_source/scraper/semantic_html/extractor.rb
Overview
ArticleExtractor is responsible for extracting the details of an article. It focuses on finding a headline first, and from it traverse as much as possible, to find the DOM upwards to find the other details.
Constant Summary collapse
- INVISIBLE_CONTENT_TAG_SELECTORS =
%w[svg script noscript style template].to_set.freeze
- HEADING_TAGS =
%w[h1 h2 h3 h4 h5 h6].freeze
- NOT_HEADLINE_SELECTOR =
(HEADING_TAGS.map { |selector| ":not(#{selector})" } + INVISIBLE_CONTENT_TAG_SELECTORS.to_a).freeze
Class Method Summary collapse
Instance Method Summary collapse
-
#call ⇒ Hash?
The scraped article or nil.
-
#initialize(article_tag, url:) ⇒ Extractor
constructor
A new instance of Extractor.
Constructor Details
#initialize(article_tag, url:) ⇒ Extractor
Returns a new instance of Extractor.
35 36 37 38 |
# File 'lib/html2rss/auto_source/scraper/semantic_html/extractor.rb', line 35 def initialize(article_tag, url:) @article_tag = article_tag @url = url end |
Class Method Details
.visible_text_from_tag(tag, separator: ' ') ⇒ Object
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# File 'lib/html2rss/auto_source/scraper/semantic_html/extractor.rb', line 19 def self.visible_text_from_tag(tag, separator: ' ') text = if (children = tag.children).empty? tag.text.strip else children.filter_map do |child| next if INVISIBLE_CONTENT_TAG_SELECTORS.include?(child.name) visible_text_from_tag(child) end.join(separator) end return if (sanitized_text = text.gsub(/\s+/, ' ').strip).empty? sanitized_text end |
Instance Method Details
#call ⇒ Hash?
Returns The scraped article or nil.
41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# File 'lib/html2rss/auto_source/scraper/semantic_html/extractor.rb', line 41 def call @heading = find_heading || closest_anchor || return @extract_url = find_url { title: extract_title, url: extract_url, image: extract_image, description: extract_description, id: generate_id, published_at: extract_published_at } end |