Class: Html2rss::AutoSource::Scraper::SemanticHtml

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/semantic_html/image.rb,
lib/html2rss/auto_source/scraper/semantic_html/extractor.rb

Overview

Scrapes articles by looking for common markup tags (article, section, li) containing an <a href> tag.

See:

  1. developer.mozilla.org/en-US/docs/Web/HTML/Element/article

Defined Under Namespace

Classes: Extractor, Image

Constant Summary collapse

ANCHOR_TAG_SELECTORS =

Map of parent element names to CSS selectors for finding <a href> tags.

{
  'section' => ['section :not(section) a[href]'],
  'tr' => ['table tr :not(tr) a[href]'],
  'article' => [
    'article :not(article) a[href]',
    'article a[href]'
  ],
  'li' => [
    'ul > li :not(li) a[href]',
    'ol > li :not(li) a[href]'
  ]
}.freeze

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:) ⇒ SemanticHtml

Returns a new instance of SemanticHtml.



93
94
95
96
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 93

def initialize(parsed_body, url:)
  @parsed_body = parsed_body
  @url = url
end

Instance Attribute Details

#parsed_bodyObject (readonly)

Returns the value of attribute parsed_body.



98
99
100
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 98

def parsed_body
  @parsed_body
end

Class Method Details

.anchor_tag_selector_pairsArray<[String, String]>

Returns an array of [tag_name, selector] pairs

Returns:

  • (Array<[String, String]>)

    Array of tag name and selector pairs



87
88
89
90
91
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 87

def self.anchor_tag_selector_pairs
  ANCHOR_TAG_SELECTORS.flat_map do |tag_name, selectors|
    selectors.map { |selector| [tag_name, selector] }
  end
end

.articles?(parsed_body) ⇒ Boolean

Check if the parsed_body contains articles

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    The parsed HTML document

Returns:

  • (Boolean)

    True if articles are found, otherwise false.



36
37
38
39
40
41
42
43
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 36

def self.articles?(parsed_body)
  return false unless parsed_body

  ANCHOR_TAG_SELECTORS.each_value do |selectors|
    return true if selectors.any? { |selector| parsed_body.at_css(selector) }
  end
  false
end

.find_closest_selector(current_tag, selector: 'a[href]:not([href=""])') ⇒ Nokogiri::XML::Node?

Finds the closest matching selector upwards in the DOM tree

Parameters:

  • current_tag (Nokogiri::XML::Node)

    The current tag to start searching from

  • selector (String) (defaults to: 'a[href]:not([href=""])')

    The CSS selector to search for

Returns:

  • (Nokogiri::XML::Node, nil)

    The closest matching tag or nil if not found



66
67
68
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 66

def self.find_closest_selector(current_tag, selector: 'a[href]:not([href=""])')
  current_tag.at_css(selector) || find_closest_selector_upwards(current_tag, selector:)
end

.find_closest_selector_upwards(current_tag, selector:) ⇒ Nokogiri::XML::Node?

Helper method to find a matching selector upwards

Parameters:

  • current_tag (Nokogiri::XML::Node)

    The current tag to start searching from

  • selector (String)

    The CSS selector to search for

Returns:

  • (Nokogiri::XML::Node, nil)

    The closest matching tag or nil if not found



74
75
76
77
78
79
80
81
82
83
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 74

def self.find_closest_selector_upwards(current_tag, selector:)
  while current_tag
    found = current_tag.at_css(selector)
    return found if found

    return nil unless current_tag.respond_to?(:parent)

    current_tag = current_tag.parent
  end
end

.find_tag_in_ancestors(current_tag, tag_name, stop_tag: 'html') ⇒ Nokogiri::XML::Node

Finds the closest ancestor tag matching the specified tag name

Parameters:

  • current_tag (Nokogiri::XML::Node)

    The current tag to start searching from

  • tag_name (String)

    The tag name to search for

  • stop_tag (String) (defaults to: 'html')

    The tag name to stop searching at

Returns:

  • (Nokogiri::XML::Node)

    The found ancestor tag or the current tag if matched



50
51
52
53
54
55
56
57
58
59
60
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 50

def self.find_tag_in_ancestors(current_tag, tag_name, stop_tag: 'html')
  return current_tag if current_tag.name == tag_name

  stop_tags = Set[tag_name, stop_tag]

  while current_tag.respond_to?(:parent) && !stop_tags.member?(current_tag.name)
    current_tag = current_tag.parent
  end

  current_tag
end

Instance Method Details

#each {|The| ... } ⇒ Enumerator

Returns Enumerator for the scraped articles.

Yield Parameters:

  • The (Hash)

    scraped article hash

Returns:

  • (Enumerator)

    Enumerator for the scraped articles



103
104
105
106
107
108
109
110
111
112
113
114
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 103

def each
  return enum_for(:each) unless block_given?

  SemanticHtml.anchor_tag_selector_pairs.each do |tag_name, selector|
    parsed_body.css(selector).each do |selected_tag|
       = SemanticHtml.find_tag_in_ancestors(selected_tag, tag_name)
      article_hash = Extractor.new(, url: @url).call

      yield article_hash if article_hash
    end
  end
end