Module: Html2rss::AutoSource::Scraper

Defined in:
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/schema/base.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/semantic_html/image.rb,
lib/html2rss/auto_source/scraper/semantic_html/extractor.rb

Overview

The Scraper module contains all scrapers that can be used to extract articles. Each scraper should implement a ‘call` method that returns an array of article hashes. Each scraper should also implement an `articles?` method that returns true if the scraper can potentially be used to extract articles from the given HTML.

Defined Under Namespace

Classes: Html, NoScraperFound, Schema, SemanticHtml

Constant Summary collapse

SCRAPERS =
[
  Html,
  Schema,
  SemanticHtml
].freeze

Class Method Summary collapse

Class Method Details

.from(parsed_body) ⇒ Array<Class>

Returns an array of scrapers that claim to find articles in the parsed body.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    The parsed HTML body.

Returns:

  • (Array<Class>)

    An array of scraper classes that can handle the parsed body.

Raises:



26
27
28
29
30
31
# File 'lib/html2rss/auto_source/scraper.rb', line 26

def self.from(parsed_body)
  scrapers = SCRAPERS.select { |scraper| scraper.articles?(parsed_body) }
  raise NoScraperFound, 'No suitable scraper found for URL.' if scrapers.empty?

  scrapers
end