Class: Html2rss::AutoSource::Scraper::Microdata

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/auto_source/scraper/microdata.rb

Overview

Scrapes Schema.org Microdata items embedded directly in HTML markup.

Constant Summary collapse

ITEM_SELECTOR =
'[itemscope][itemtype]'
SUPPORTED_TYPES =
(Schema::Thing::SUPPORTED_TYPES | Set['Product']).freeze
VALUE_ATTRIBUTES =
%w[content datetime href src data value].freeze

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:, **_opts) ⇒ void

Builds a Microdata scraper for an already parsed response body.

Parameters:

  • parsed_body (Nokogiri::HTML5::Document, Nokogiri::HTML4::Document, Nokogiri::XML::Node, nil)

    the parsed response body to inspect for top-level Microdata items.

  • url (Html2rss::Url)

    the absolute page URL used to resolve relative links.

  • _opts (Hash)

    unused scraper-specific options.



57
58
59
60
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 57

def initialize(parsed_body, url:, **_opts)
  @parsed_body = parsed_body
  @url = url
end

Class Method Details

.articles?(parsed_body) ⇒ Boolean

Returns:

  • (Boolean)


17
18
19
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 17

def articles?(parsed_body)
  supported_roots(parsed_body).any?
end

.normalized_types(itemtype) ⇒ Object



35
36
37
38
39
40
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 35

def normalized_types(itemtype)
  itemtype.to_s.split.filter_map do |value|
    type = value.split('/').last.to_s.split('#').last.to_s
    type unless type.empty?
  end
end

.options_keyObject



14
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 14

def self.options_key = :microdata

.supported_root?(node) ⇒ Boolean

Returns:

  • (Boolean)


27
28
29
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 27

def supported_root?(node)
  supported_type_name(node) && top_level_item?(node)
end

.supported_roots(parsed_body) ⇒ Object



21
22
23
24
25
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 21

def supported_roots(parsed_body)
  return [] unless parsed_body

  parsed_body.css(ITEM_SELECTOR).select { supported_root?(_1) }
end

.supported_type_name(node) ⇒ Object



31
32
33
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 31

def supported_type_name(node)
  normalized_types(node['itemtype']).find { SUPPORTED_TYPES.include?(_1) }
end

.top_level_item?(node) ⇒ Boolean

Returns:

  • (Boolean)


42
43
44
45
46
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 42

def top_level_item?(node)
  return false if node.attribute('itemprop')

  node.ancestors.none? { |ancestor| ancestor.attribute('itemscope') && ancestor.attribute('itemprop') }
end

Instance Method Details

#each {|article| ... } ⇒ Enumerator, void

Iterates over normalized article hashes extracted from supported Microdata roots.

Yield Parameters:

  • article (Hash<Symbol, Object>)

    the normalized article attributes.

Returns:

  • (Enumerator, void)

    an enumerator when no block is given.



67
68
69
70
71
72
73
74
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 67

def each
  return enum_for(:each) unless block_given?

  self.class.supported_roots(parsed_body).each do |root|
    article = article_from(root)
    yield article if article
  end
end