Class: Html2rss::AutoSource

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source.rb,
lib/html2rss/auto_source/article.rb,
lib/html2rss/auto_source/channel.rb,
lib/html2rss/auto_source/cleanup.rb,
lib/html2rss/auto_source/reducer.rb,
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/rss_builder.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/schema/base.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/semantic_html/image.rb,
lib/html2rss/auto_source/scraper/semantic_html/extractor.rb

Overview

The AutoSource class is responsible for extracting channel and articles from a given URL. It uses a set of ArticleExtractors to extract articles, utilizing popular ways of marking articles, e.g. schema, microdata, open graph, etc.

Defined Under Namespace

Modules: Scraper Classes: Article, Channel, Cleanup, NoArticlesFound, Reducer, RssBuilder, UnsupportedUrlScheme

Constant Summary collapse

SUPPORTED_URL_SCHEMES =
%w[http https].to_set.freeze

Instance Method Summary collapse

Constructor Details

#initialize(url, body:, headers: {}) ⇒ AutoSource

Returns a new instance of AutoSource.

Parameters:

  • url (Addressable::URI)

    The URL to extract articles from.

  • body (String)

    The body of the response.

  • headers (Hash) (defaults to: {})

    The headers of the response.

Raises:

  • (ArgumentError)


23
24
25
26
27
28
29
30
31
# File 'lib/html2rss/auto_source.rb', line 23

def initialize(url, body:, headers: {})
  raise ArgumentError, 'URL must be a Addressable::URI' unless url.is_a?(Addressable::URI)
  raise ArgumentError, 'URL must be absolute' unless url.absolute?
  raise UnsupportedUrlScheme, "#{url.scheme} not supported" unless SUPPORTED_URL_SCHEMES.include?(url.scheme)

  @url = url
  @body = body
  @headers = headers
end

Instance Method Details

#articlesObject



47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# File 'lib/html2rss/auto_source.rb', line 47

def articles
  @articles ||= Scraper.from(parsed_body).flat_map do |scraper|
    instance = scraper.new(parsed_body, url:)

    articles_in_thread = Parallel.map(instance.each) do |article_hash|
      Log.debug "Scraper: #{scraper} in worker: #{Parallel.worker_number} [#{article_hash[:url]}]"

      Article.new(**article_hash, scraper:)
    end

    Reducer.call(articles_in_thread, url:)

    articles_in_thread
  end
end

#buildObject

Raises:



33
34
35
36
37
38
39
40
41
42
43
44
45
# File 'lib/html2rss/auto_source.rb', line 33

def build
  raise NoArticlesFound if articles.empty?

  Reducer.call(articles, url:)
  Cleanup.call(articles, url:, keep_different_domain: true)

  channel.articles = articles

  Html2rss::AutoSource::RssBuilder.new(
    channel:,
    articles:
  ).call
end

#channelObject



63
64
65
# File 'lib/html2rss/auto_source.rb', line 63

def channel
  @channel ||= Channel.new(parsed_body, headers: @headers, url:)
end