Class: Html2rss::AutoSource::Scraper::Schema
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::Schema
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/schema/base.rb
Overview
-
<script type=“application/ld+json”> “schema” tag.
-
tbd
See:
Defined Under Namespace
Classes: Base
Constant Summary collapse
- TAG_SELECTOR =
'script[type="application/ld+json"]'
- SCHEMA_OBJECT_TYPES =
%w[ AdvertiserContentArticle AnalysisNewsArticle APIReference Article AskPublicNewsArticle BackgroundNewsArticle BlogPosting DiscussionForumPosting LiveBlogPosting NewsArticle OpinionNewsArticle Report ReportageNewsArticle ReviewNewsArticle SatiricalArticle ScholarlyArticle SocialMediaPosting TechArticle ].to_set.freeze
Class Method Summary collapse
- .articles?(parsed_body) ⇒ Boolean
-
.from(object) ⇒ Array<Hash>
Returns a flat array of all supported schema objects by recursively traversing the ‘from` object.
- .scraper_for_schema_object(schema_object) ⇒ Scraper::Schema::Base, ...
- .supported_schema_object?(object) ⇒ Boolean
Instance Method Summary collapse
-
#each {|Hash| ... } ⇒ Array<Hash>
The scraped article_hashes.
-
#initialize(parsed_body, url:) ⇒ Schema
constructor
A new instance of Schema.
Constructor Details
#initialize(parsed_body, url:) ⇒ Schema
Returns a new instance of Schema.
97 98 99 100 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 97 def initialize(parsed_body, url:) @parsed_body = parsed_body @url = url end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
45 46 47 48 49 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 45 def articles?(parsed_body) parsed_body.css(TAG_SELECTOR).any? do |script| SCHEMA_OBJECT_TYPES.any? { |type| script.text.match?(/"@type"\s*:\s*"#{Regexp.escape(type)}"/) } end end |
.from(object) ⇒ Array<Hash>
Returns a flat array of all supported schema objects by recursively traversing the ‘from` object.
:reek:DuplicateMethodCall
59 60 61 62 63 64 65 66 67 68 69 70 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 59 def from(object) case object when Nokogiri::XML::Element from(parse_script_tag(object)) when Hash supported_schema_object?(object) ? [object] : object.values.flat_map { |item| from(item) } when Array object.flat_map { |item| from(item) } else [] end end |
.scraper_for_schema_object(schema_object) ⇒ Scraper::Schema::Base, ...
78 79 80 81 82 83 84 85 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 78 def scraper_for_schema_object(schema_object) if SCHEMA_OBJECT_TYPES.member?(schema_object[:@type]) Base else Log.warn("Schema#scraper_for_schema_object: Unsupported schema object @type: #{schema_object[:@type]}") nil end end |
.supported_schema_object?(object) ⇒ Boolean
72 73 74 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 72 def supported_schema_object?(object) scraper_for_schema_object(object) ? true : false end |
Instance Method Details
#each {|Hash| ... } ⇒ Array<Hash>
Returns the scraped article_hashes.
105 106 107 108 109 110 111 112 113 114 |
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 105 def each(&) return enum_for(:each) unless block_given? schema_objects.filter_map do |schema_object| next unless (klass = self.class.scraper_for_schema_object(schema_object)) next unless (article_hash = klass.new(schema_object, url:).call) yield article_hash end end |