Class: Html2rss::AutoSource::Scraper::SemanticHtml

Inherits:

Object

Object
Html2rss::AutoSource::Scraper::SemanticHtml

show all

Includes:: Enumerable

Defined in:: lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/semantic_html/image.rb,
lib/html2rss/auto_source/scraper/semantic_html/extractor.rb

Overview

Scrapes articles by looking for common markup tags (article, section, li) containing an <a href> tag.

See:

developer.mozilla.org/en-US/docs/Web/HTML/Element/article

Defined Under Namespace

Classes: Extractor, Image

Constant Summary collapse

ANCHOR_TAG_SELECTORS = Map of parent element names to CSS selectors for finding <a href> tags.

{
  'section' => ['section :not(section) a[href]'],
  'tr' => ['table tr :not(tr) a[href]'],
  'article' => [
    'article :not(article) a[href]',
    'article a[href]'
  ],
  'li' => [
    'ul > li :not(li) a[href]',
    'ol > li :not(li) a[href]'
  ]
}.freeze

Instance Attribute Summary collapse

#parsed_body ⇒ Object readonly

Returns the value of attribute parsed_body.

Class Method Summary collapse

.anchor_tag_selector_pairs ⇒ Array<[String, String]>

Returns an array of [tag_name, selector] pairs.
.articles?(parsed_body) ⇒ Boolean

Check if the parsed_body contains articles.
.find_closest_selector(current_tag, selector: 'a[href]:not([href=""])') ⇒ Nokogiri::XML::Node^?

Finds the closest matching selector upwards in the DOM tree.
.find_closest_selector_upwards(current_tag, selector:) ⇒ Nokogiri::XML::Node^?

Helper method to find a matching selector upwards.
.find_tag_in_ancestors(current_tag, tag_name, stop_tag: 'html') ⇒ Nokogiri::XML::Node

Finds the closest ancestor tag matching the specified tag name.

Instance Method Summary collapse

#each {|The| ... } ⇒ Enumerator

Enumerator for the scraped articles.
#initialize(parsed_body, url:) ⇒ SemanticHtml constructor

A new instance of SemanticHtml.

Constructor Details

#initialize(parsed_body, url:) ⇒ `SemanticHtml`

Returns a new instance of SemanticHtml.

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 93

def initialize(parsed_body, url:)
  @parsed_body = parsed_body
  @url = url
end

Instance Attribute Details

#parsed_body ⇒ `Object` (readonly)

Returns the value of attribute parsed_body.



98
99
100

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 98

def parsed_body
  @parsed_body
end

Class Method Details

.anchor_tag_selector_pairs ⇒ `Array<[String, String]>`

Returns an array of [tag_name, selector] pairs

Returns:

(Array<[String, String]>) —

Array of tag name and selector pairs

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 87

def self.anchor_tag_selector_pairs
  ANCHOR_TAG_SELECTORS.flat_map do |tag_name, selectors|
    selectors.map { |selector| [tag_name, selector] }
  end
end

.articles?(parsed_body) ⇒ `Boolean`

Check if the parsed_body contains articles

Parameters:

parsed_body (Nokogiri::HTML::Document) —

The parsed HTML document

Returns:

(Boolean) —

True if articles are found, otherwise false.

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 36

def self.articles?(parsed_body)
  return false unless parsed_body

  ANCHOR_TAG_SELECTORS.each_value do |selectors|
    return true if selectors.any? { |selector| parsed_body.at_css(selector) }
  end
  false
end

.find_closest_selector(current_tag, selector: 'a[href]:not([href=""])') ⇒ `Nokogiri::XML::Node`^?

Finds the closest matching selector upwards in the DOM tree

Parameters:

current_tag (Nokogiri::XML::Node) —

The current tag to start searching from
selector (String) (defaults to: 'a[href]:not([href=""])') —

The CSS selector to search for

Returns:

(Nokogiri::XML::Node, nil) —

The closest matching tag or nil if not found



66
67
68

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 66

def self.find_closest_selector(current_tag, selector: 'a[href]:not([href=""])')
  current_tag.at_css(selector) || find_closest_selector_upwards(current_tag, selector:)
end

.find_closest_selector_upwards(current_tag, selector:) ⇒ `Nokogiri::XML::Node`^?

Helper method to find a matching selector upwards

Parameters:

current_tag (Nokogiri::XML::Node) —

The current tag to start searching from
selector (String) —

The CSS selector to search for

Returns:

(Nokogiri::XML::Node, nil) —

The closest matching tag or nil if not found

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 74

def self.find_closest_selector_upwards(current_tag, selector:)
  while current_tag
    found = current_tag.at_css(selector)
    return found if found

    return nil unless current_tag.respond_to?(:parent)

    current_tag = current_tag.parent
  end
end

.find_tag_in_ancestors(current_tag, tag_name, stop_tag: 'html') ⇒ `Nokogiri::XML::Node`

Finds the closest ancestor tag matching the specified tag name

Parameters:

current_tag (Nokogiri::XML::Node) —

The current tag to start searching from
tag_name (String) —

The tag name to search for
stop_tag (String) (defaults to: 'html') —

The tag name to stop searching at

Returns:

(Nokogiri::XML::Node) —

The found ancestor tag or the current tag if matched

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 50

def self.find_tag_in_ancestors(current_tag, tag_name, stop_tag: 'html')
  return current_tag if current_tag.name == tag_name

  stop_tags = Set[tag_name, stop_tag]

  while current_tag.respond_to?(:parent) && !stop_tags.member?(current_tag.name)
    current_tag = current_tag.parent
  end

  current_tag
end

Instance Method Details

#each {|The| ... } ⇒ `Enumerator`

Returns Enumerator for the scraped articles.

Yield Parameters:

The (Hash) —

scraped article hash

Returns:

(Enumerator) —

Enumerator for the scraped articles

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 103

def each
  return enum_for(:each) unless block_given?

  SemanticHtml.anchor_tag_selector_pairs.each do |tag_name, selector|
    parsed_body.css(selector).each do |selected_tag|
      article_tag = SemanticHtml.find_tag_in_ancestors(selected_tag, tag_name)
      article_hash = Extractor.new(article_tag, url: @url).call

      yield article_hash if article_hash
    end
  end
end

Class: Html2rss::AutoSource::Scraper::SemanticHtml

Overview

Defined Under Namespace

Constant Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:) ⇒ SemanticHtml

Instance Attribute Details

#parsed_body ⇒ Object (readonly)

Class Method Details

.anchor_tag_selector_pairs ⇒ Array<[String, String]>

.articles?(parsed_body) ⇒ Boolean

.find_closest_selector(current_tag, selector: 'a[href]:not([href=""])') ⇒ Nokogiri::XML::Node?

.find_closest_selector_upwards(current_tag, selector:) ⇒ Nokogiri::XML::Node?

.find_tag_in_ancestors(current_tag, tag_name, stop_tag: 'html') ⇒ Nokogiri::XML::Node

Instance Method Details

#each {|The| ... } ⇒ Enumerator

#initialize(parsed_body, url:) ⇒ `SemanticHtml`

#parsed_body ⇒ `Object` (readonly)

.anchor_tag_selector_pairs ⇒ `Array<[String, String]>`

.articles?(parsed_body) ⇒ `Boolean`

.find_closest_selector(current_tag, selector: 'a[href]:not([href=""])') ⇒ `Nokogiri::XML::Node`^?

.find_closest_selector_upwards(current_tag, selector:) ⇒ `Nokogiri::XML::Node`^?

.find_tag_in_ancestors(current_tag, tag_name, stop_tag: 'html') ⇒ `Nokogiri::XML::Node`

#each {|The| ... } ⇒ `Enumerator`