Class: Html2rss::Selectors

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/selectors.rb,
lib/html2rss/selectors/config.rb,
lib/html2rss/selectors/extractors.rb,
lib/html2rss/selectors/extractors/href.rb,
lib/html2rss/selectors/extractors/html.rb,
lib/html2rss/selectors/extractors/text.rb,
lib/html2rss/selectors/post_processors.rb,
lib/html2rss/selectors/extractors/static.rb,
lib/html2rss/selectors/extractors/attribute.rb,
lib/html2rss/selectors/post_processors/base.rb,
lib/html2rss/selectors/post_processors/gsub.rb,
lib/html2rss/selectors/object_to_xml_converter.rb,
lib/html2rss/selectors/post_processors/template.rb,
lib/html2rss/selectors/post_processors/parse_uri.rb,
lib/html2rss/selectors/post_processors/substring.rb,
lib/html2rss/selectors/post_processors/parse_time.rb,
lib/html2rss/selectors/post_processors/sanitize_html.rb,
lib/html2rss/selectors/post_processors/html_to_markdown.rb,
lib/html2rss/selectors/post_processors/markdown_to_html.rb,
lib/html2rss/selectors/post_processors/html_transformers/wrap_img_in_a.rb,
lib/html2rss/selectors/post_processors/html_transformers/transform_urls_to_absolute_ones.rb

Overview

This scraper is designed to scrape articles from a given HTML page using CSS selectors defined in the feed config.

It supports the traditional feed configs that html2rss originally provided, ensuring compatibility with existing setups.

Additionally, it uniquely offers the capability to convert JSON into XML, extending its versatility for diverse data processing workflows.

Defined Under Namespace

Modules: Extractors, PostProcessors Classes: Config, Context, InvalidSelectorName, ObjectToXmlConverter

Constant Summary collapse

DEFAULT_CONFIG =
{ items: { enhance: true } }.freeze
ITEMS_SELECTOR_KEY =
:items
ITEM_TAGS =
%i[title url description author comments published_at guid enclosure categories].freeze
SPECIAL_ATTRIBUTES =
Set[:guid, :enclosure, :categories].freeze
RENAMED_ATTRIBUTES =

Mapping of new attribute names to their legacy names for backward compatibility.

{ published_at: %i[updated pubDate] }.freeze

Instance Method Summary collapse

Constructor Details

#initialize(response, selectors:, time_zone:) ⇒ Selectors

Initializes a new Selectors instance.

Parameters:

  • response (RequestService::Response)

    The response object.

  • selectors (Hash)

    A hash of CSS selectors.

  • time_zone (String)

    Time zone string used for date parsing.



38
39
40
41
42
43
44
45
46
# File 'lib/html2rss/selectors.rb', line 38

def initialize(response, selectors:, time_zone:)
  @response = response
  @url = response.url
  @selectors = selectors
  @time_zone = time_zone

  prepare_selectors!
  @rss_item_attributes = @selectors.keys & Html2rss::RssBuilder::Article::PROVIDED_KEYS
end

Instance Method Details

#articlesArray<Html2rss::RssBuilder::Article>

Returns articles extracted from the response. Reverses order if config specifies reverse ordering.

Returns:



53
54
55
# File 'lib/html2rss/selectors.rb', line 53

def articles
  @articles ||= @selectors.dig(ITEMS_SELECTOR_KEY, :order) == 'reverse' ? to_a.tap(&:reverse!) : to_a
end

#each {|article| ... } ⇒ Enumerator

Iterates over each scraped article.

Yields:

  • (article)

    Gives each article as an Html2rss::RssBuilder::Article.

Returns:

  • (Enumerator)

    An enumerator if no block is given.



62
63
64
65
66
67
68
69
70
71
72
73
74
# File 'lib/html2rss/selectors.rb', line 62

def each(&)
  return enum_for(:each) unless block_given?

  enhance = enhance?

  parsed_body.css(items_selector).each do |item|
    article_hash = extract_article(item, response)

    enhance_article_hash(article_hash, item, response.url) if enhance

    yield Html2rss::RssBuilder::Article.new(**article_hash, scraper: self.class)
  end
end

#enhance?Boolean

Returns whether to enhance the article hash with auto_source's semantic HTML extraction.

Returns:

  • (Boolean)

    whether to enhance the article hash with auto_source's semantic HTML extraction.



82
# File 'lib/html2rss/selectors.rb', line 82

def enhance? = !!@selectors.dig(ITEMS_SELECTOR_KEY, :enhance)

#enhance_article_hash(article_hash, article_tag, base_url = @url) ⇒ Hash

Enhances the article hash using semantic HTML extraction. Only adds keys that are missing from the original hash.

Parameters:

  • article_hash (Hash)

    The original article hash.

  • article_tag (Nokogiri::XML::Element)

    HTML element to extract additional info from.

Returns:

  • (Hash)

    The enhanced article hash.



100
101
102
103
104
105
106
107
108
109
110
111
112
# File 'lib/html2rss/selectors.rb', line 100

def enhance_article_hash(article_hash, , base_url = @url)
  selected_anchor = HtmlExtractor.main_anchor_for()
  return article_hash unless selected_anchor

  extracted = HtmlExtractor.new(, base_url:, selected_anchor:).call
  return article_hash unless extracted

  extracted.each_with_object(article_hash) do |(key, value), hash|
    next if value.nil? || (hash.key?(key) && hash[key])

    hash[key] = value
  end
end

#extract_article(item, page_response = response) ⇒ Hash

Extracts an article hash for a given item element.

Parameters:

  • item (Nokogiri::XML::Element)

    The element to extract from.

Returns:

  • (Hash)

    Hash of attributes for the article.



89
90
91
# File 'lib/html2rss/selectors.rb', line 89

def extract_article(item, page_response = response)
  @rss_item_attributes.to_h { |key| [key, select(key, item, base_url: page_response.url)] }.compact
end

#items_selectorString

Returns the CSS selector for the items.

Returns:

  • (String)

    the CSS selector for the items



79
# File 'lib/html2rss/selectors.rb', line 79

def items_selector = @selectors.dig(ITEMS_SELECTOR_KEY, :selector)

#select(name, item, base_url: @url) ⇒ Object+

Selects the value for a given attribute from an HTML element.

Parameters:

  • name (Symbol, String)

    Name of the attribute.

  • item (Nokogiri::XML::Element)

    The HTML element to process.

Returns:

  • (Object, Array<Object>)

    The selected value(s).

Raises:



121
122
123
124
125
126
127
128
129
130
131
132
133
# File 'lib/html2rss/selectors.rb', line 121

def select(name, item, base_url: @url)
  name = name.to_sym

  raise InvalidSelectorName, "Attribute selector '#{name}' is reserved for items." if name == ITEMS_SELECTOR_KEY

  selector_key, config = selector_config_for(name)

  if SPECIAL_ATTRIBUTES.member?(selector_key)
    select_special(selector_key, item:, config:, base_url:)
  else
    select_regular(selector_key, item:, config:, base_url:)
  end
end