Class: Html2rss::AutoSource::Scraper::WordpressApi

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb

Overview

Scrapes WordPress sites through their REST API instead of parsing article HTML.

Defined Under Namespace

Classes: PageScope, PostsEndpoint

Constant Summary collapse

'link[rel="https://api.w.org/"][href]'
'link[rel="canonical"][href]'
POSTS_FIELDS =
%w[id title excerpt content link date categories].freeze

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:, request_session: nil, **_opts) ⇒ void

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed HTML document

  • url (String, Html2rss::Url)

    canonical page URL

  • request_session (Html2rss::RequestSession, nil) (defaults to: nil)

    shared request session for follow-up fetches

  • _opts (Hash)

    unused scraper-specific options



33
34
35
36
37
38
# File 'lib/html2rss/auto_source/scraper/wordpress_api.rb', line 33

def initialize(parsed_body, url:, request_session: nil, **_opts)
  @parsed_body = parsed_body
  @url = Html2rss::Url.from_absolute(url)
  @request_session = request_session
  @page_scope = PageScope.from(parsed_body:, url: @url)
end

Class Method Details

.articles?(parsed_body) ⇒ Boolean

Returns whether the page advertises a WordPress REST API endpoint.

Parameters:

  • parsed_body (Nokogiri::HTML::Document, nil)

    parsed HTML document

Returns:

  • (Boolean)

    whether the page advertises a WordPress REST API endpoint



21
22
23
24
25
# File 'lib/html2rss/auto_source/scraper/wordpress_api.rb', line 21

def self.articles?(parsed_body)
  return false unless parsed_body

  !parsed_body.at_css(API_LINK_SELECTOR).nil?
end

.options_keyObject



16
# File 'lib/html2rss/auto_source/scraper/wordpress_api.rb', line 16

def self.options_key = :wordpress_api

Instance Method Details

#each {|article| ... } ⇒ Enumerator, void

Yields article hashes from the WordPress posts API.

Yield Parameters:

  • article (Hash<Symbol, Object>)

    normalized article hash

Returns:

  • (Enumerator, void)

    enumerator when no block is given



45
46
47
48
49
50
# File 'lib/html2rss/auto_source/scraper/wordpress_api.rb', line 45

def each
  return enum_for(:each) unless block_given?
  return unless (posts = fetch_posts)

  posts.filter_map { article_from(_1) }.each { yield(_1) }
end