Class: Html2rss::AutoSource

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source.rb,
lib/html2rss/auto_source/article.rb,
lib/html2rss/auto_source/channel.rb,
lib/html2rss/auto_source/cleanup.rb,
lib/html2rss/auto_source/reducer.rb,
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/rss_builder.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/semantic_html/image.rb,
lib/html2rss/auto_source/scraper/semantic_html/extractor.rb
more...

Overview

The AutoSource class is responsible for extracting channel and articles from a given URL. It uses a set of ArticleExtractors to extract articles, utilizing popular ways of marking articles, e.g. schema, microdata, open graph, etc.

Defined Under Namespace

Modules: Scraper Classes: Article, Channel, Cleanup, NoArticlesFound, Reducer, RssBuilder

Instance Method Summary collapse

Constructor Details

#initialize(url, body:, headers: {}) ⇒ AutoSource

Returns a new instance of AutoSource.

Parameters:

  • url (Addressable::URI)

    The URL to extract articles from.

  • body (String)

    The body of the response.

  • headers (Hash) (defaults to: {})

    The headers of the response.

[View source]

20
21
22
23
24
# File 'lib/html2rss/auto_source.rb', line 20

def initialize(url, body:, headers: {})
  @url = url
  @body = body
  @headers = headers
end

Instance Method Details

#articlesObject

[View source]

40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# File 'lib/html2rss/auto_source.rb', line 40

def articles
  @articles ||= Scraper.from(parsed_body).flat_map do |scraper|
    instance = scraper.new(parsed_body, url:)

    articles_in_thread = Parallel.map(instance.each) do |article_hash|
      Log.debug "Scraper: #{scraper} in worker: #{Parallel.worker_number} [#{article_hash[:url]}]"

      Article.new(**article_hash, scraper:)
    end

    Reducer.call(articles_in_thread, url:)

    articles_in_thread
  end
end

#buildObject

Raises:

[View source]

26
27
28
29
30
31
32
33
34
35
36
37
38
# File 'lib/html2rss/auto_source.rb', line 26

def build
  raise NoArticlesFound if articles.empty?

  Reducer.call(articles, url:)
  Cleanup.call(articles, url:, keep_different_domain: true)

  channel.articles = articles

  Html2rss::AutoSource::RssBuilder.new(
    channel:,
    articles:
  ).call
end

#channelObject

[View source]

56
57
58
# File 'lib/html2rss/auto_source.rb', line 56

def channel
  @channel ||= Channel.new(parsed_body, headers: @headers, url:)
end