Class: Html2rss::AutoSource::Scraper::Schema

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/schema/base.rb

Overview

  1. <script type=“application/ld+json”> “schema” tag.

  2. tbd

See:

  1. schema.org/NewsArticle

  2. developers.google.com/search/docs/appearance/structured-data/article#microdata

Defined Under Namespace

Classes: Base

Constant Summary collapse

TAG_SELECTOR =
'script[type="application/ld+json"]'
SCHEMA_OBJECT_TYPES =
%w[
  AdvertiserContentArticle
  AnalysisNewsArticle
  APIReference
  Article
  AskPublicNewsArticle
  BackgroundNewsArticle
  BlogPosting
  DiscussionForumPosting
  LiveBlogPosting
  NewsArticle
  OpinionNewsArticle
  Report
  ReportageNewsArticle
  ReviewNewsArticle
  SatiricalArticle
  ScholarlyArticle
  SocialMediaPosting
  TechArticle
].to_set.freeze

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:) ⇒ Schema

Returns a new instance of Schema.



97
98
99
100
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 97

def initialize(parsed_body, url:)
  @parsed_body = parsed_body
  @url = url
end

Class Method Details

.articles?(parsed_body) ⇒ Boolean

Returns:

  • (Boolean)


45
46
47
48
49
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 45

def articles?(parsed_body)
  parsed_body.css(TAG_SELECTOR).any? do |script|
    SCHEMA_OBJECT_TYPES.any? { |type| script.text.match?(/"@type"\s*:\s*"#{Regexp.escape(type)}"/) }
  end
end

.from(object) ⇒ Array<Hash>

Returns a flat array of all supported schema objects by recursively traversing the ‘from` object.

:reek:DuplicateMethodCall

Parameters:

  • object (Hash, Array)

Returns:

  • (Array<Hash>)

    the schema_objects, or an empty array



59
60
61
62
63
64
65
66
67
68
69
70
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 59

def from(object)
  case object
  when Nokogiri::XML::Element
    from(parse_script_tag(object))
  when Hash
    supported_schema_object?(object) ? [object] : object.values.flat_map { |item| from(item) }
  when Array
    object.flat_map { |item| from(item) }
  else
    []
  end
end

.scraper_for_schema_object(schema_object) ⇒ Scraper::Schema::Base, ...

Returns:



78
79
80
81
82
83
84
85
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 78

def scraper_for_schema_object(schema_object)
  if SCHEMA_OBJECT_TYPES.member?(schema_object[:@type])
    Base
  else
    Log.warn("Schema#scraper_for_schema_object: Unsupported schema object @type: #{schema_object[:@type]}")
    nil
  end
end

.supported_schema_object?(object) ⇒ Boolean

Returns:

  • (Boolean)


72
73
74
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 72

def supported_schema_object?(object)
  scraper_for_schema_object(object) ? true : false
end

Instance Method Details

#each {|Hash| ... } ⇒ Array<Hash>

Returns the scraped article_hashes.

Yields:

  • (Hash)

    Each scraped article_hash

Returns:

  • (Array<Hash>)

    the scraped article_hashes



105
106
107
108
109
110
111
112
113
114
# File 'lib/html2rss/auto_source/scraper/schema.rb', line 105

def each(&)
  return enum_for(:each) unless block_given?

  schema_objects.filter_map do |schema_object|
    next unless (klass = self.class.scraper_for_schema_object(schema_object))
    next unless (article_hash = klass.new(schema_object, url:).call)

    yield article_hash
  end
end