Class: Html2rss::AutoSource

Inherits:

Object

Object
Html2rss::AutoSource

show all

Defined in:: lib/html2rss/auto_source.rb,
lib/html2rss/auto_source/article.rb,
lib/html2rss/auto_source/channel.rb,
lib/html2rss/auto_source/cleanup.rb,
lib/html2rss/auto_source/reducer.rb,
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/rss_builder.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/semantic_html/image.rb,
lib/html2rss/auto_source/scraper/semantic_html/extractor.rb more...

Overview

The AutoSource class is responsible for extracting channel and articles from a given URL. It uses a set of ArticleExtractors to extract articles, utilizing popular ways of marking articles, e.g. schema, microdata, open graph, etc.

Defined Under Namespace

Modules: Scraper Classes: Article, Channel, Cleanup, NoArticlesFound, Reducer, RssBuilder

Instance Method Summary collapse

#articles ⇒ Object
#build ⇒ Object
#channel ⇒ Object
#initialize(url, body:, headers: {}) ⇒ AutoSource constructor

A new instance of AutoSource.

Constructor Details

permalink #initialize(url, body:, headers: {}) ⇒ `AutoSource`

Returns a new instance of AutoSource.

Parameters:

url (Addressable::URI) —

The URL to extract articles from.
body (String) —

The body of the response.
headers (Hash) (defaults to: {}) —

The headers of the response.

[View source]

# File 'lib/html2rss/auto_source.rb', line 20

def initialize(url, body:, headers: {})
  @url = url
  @body = body
  @headers = headers
end

Instance Method Details

permalink #articles ⇒ `Object`

[View source]

# File 'lib/html2rss/auto_source.rb', line 40

def articles
  @articles ||= Scraper.from(parsed_body).flat_map do |scraper|
    instance = scraper.new(parsed_body, url:)

    articles_in_thread = Parallel.map(instance.each) do |article_hash|
      Log.debug "Scraper: #{scraper} in worker: #{Parallel.worker_number} [#{article_hash[:url]}]"

      Article.new(**article_hash, scraper:)
    end

    Reducer.call(articles_in_thread, url:)

    articles_in_thread
  end
end

permalink #build ⇒ `Object`

Raises:

(NoArticlesFound)

[View source]

# File 'lib/html2rss/auto_source.rb', line 26

def build
  raise NoArticlesFound if articles.empty?

  Reducer.call(articles, url:)
  Cleanup.call(articles, url:, keep_different_domain: true)

  channel.articles = articles

  Html2rss::AutoSource::RssBuilder.new(
    channel:,
    articles:
  ).call
end

permalink #channel ⇒ `Object`

[View source]


56
57
58

# File 'lib/html2rss/auto_source.rb', line 56

def channel
  @channel ||= Channel.new(parsed_body, headers: @headers, url:)
end

Class: Html2rss::AutoSource

Overview

Defined Under Namespace

Instance Method Summary collapse

Constructor Details

permalink #initialize(url, body:, headers: {}) ⇒ AutoSource

Instance Method Details

permalink #articles ⇒ Object

permalink #build ⇒ Object

permalink #channel ⇒ Object

permalink #initialize(url, body:, headers: {}) ⇒ `AutoSource`

permalink #articles ⇒ `Object`

permalink #build ⇒ `Object`

permalink #channel ⇒ `Object`