Class: Html2rss::AutoSource::Scraper::Microdata
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::Microdata
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/microdata.rb
Overview
Scrapes Schema.org Microdata items embedded directly in HTML markup.
Constant Summary collapse
- ITEM_SELECTOR =
'[itemscope][itemtype]'- SUPPORTED_TYPES =
(Schema::Thing::SUPPORTED_TYPES | Set['Product']).freeze
- VALUE_ATTRIBUTES =
%w[content datetime href src data value].freeze
Class Method Summary collapse
- .articles?(parsed_body) ⇒ Boolean
- .normalized_types(itemtype) ⇒ Object
- .options_key ⇒ Object
- .supported_root?(node) ⇒ Boolean
- .supported_roots(parsed_body) ⇒ Object
- .supported_type_name(node) ⇒ Object
- .top_level_item?(node) ⇒ Boolean
Instance Method Summary collapse
-
#each {|article| ... } ⇒ Enumerator, void
Iterates over normalized article hashes extracted from supported Microdata roots.
-
#initialize(parsed_body, url:, **_opts) ⇒ void
constructor
Builds a Microdata scraper for an already parsed response body.
Constructor Details
#initialize(parsed_body, url:, **_opts) ⇒ void
Builds a Microdata scraper for an already parsed response body.
57 58 59 60 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 57 def initialize(parsed_body, url:, **_opts) @parsed_body = parsed_body @url = url end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
17 18 19 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 17 def articles?(parsed_body) supported_roots(parsed_body).any? end |
.normalized_types(itemtype) ⇒ Object
35 36 37 38 39 40 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 35 def normalized_types(itemtype) itemtype.to_s.split.filter_map do |value| type = value.split('/').last.to_s.split('#').last.to_s type unless type.empty? end end |
.options_key ⇒ Object
14 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 14 def self. = :microdata |
.supported_root?(node) ⇒ Boolean
27 28 29 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 27 def supported_root?(node) supported_type_name(node) && top_level_item?(node) end |
.supported_roots(parsed_body) ⇒ Object
21 22 23 24 25 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 21 def supported_roots(parsed_body) return [] unless parsed_body parsed_body.css(ITEM_SELECTOR).select { supported_root?(_1) } end |
.supported_type_name(node) ⇒ Object
31 32 33 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 31 def supported_type_name(node) normalized_types(node['itemtype']).find { SUPPORTED_TYPES.include?(_1) } end |
.top_level_item?(node) ⇒ Boolean
42 43 44 45 46 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 42 def top_level_item?(node) return false if node.attribute('itemprop') node.ancestors.none? { |ancestor| ancestor.attribute('itemscope') && ancestor.attribute('itemprop') } end |
Instance Method Details
#each {|article| ... } ⇒ Enumerator, void
Iterates over normalized article hashes extracted from supported Microdata roots.
67 68 69 70 71 72 73 74 |
# File 'lib/html2rss/auto_source/scraper/microdata.rb', line 67 def each return enum_for(:each) unless block_given? self.class.supported_roots(parsed_body).each do |root| article = article_from(root) yield article if article end end |