Class: Html2rss::Selectors
- Inherits:
-
Object
- Object
- Html2rss::Selectors
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/selectors.rb,
lib/html2rss/selectors/config.rb,
lib/html2rss/selectors/extractors.rb,
lib/html2rss/selectors/extractors/href.rb,
lib/html2rss/selectors/extractors/html.rb,
lib/html2rss/selectors/extractors/text.rb,
lib/html2rss/selectors/post_processors.rb,
lib/html2rss/selectors/extractors/static.rb,
lib/html2rss/selectors/extractors/attribute.rb,
lib/html2rss/selectors/post_processors/base.rb,
lib/html2rss/selectors/post_processors/gsub.rb,
lib/html2rss/selectors/object_to_xml_converter.rb,
lib/html2rss/selectors/post_processors/template.rb,
lib/html2rss/selectors/post_processors/parse_uri.rb,
lib/html2rss/selectors/post_processors/substring.rb,
lib/html2rss/selectors/post_processors/parse_time.rb,
lib/html2rss/selectors/post_processors/sanitize_html.rb,
lib/html2rss/selectors/post_processors/html_to_markdown.rb,
lib/html2rss/selectors/post_processors/markdown_to_html.rb,
lib/html2rss/selectors/post_processors/html_transformers/wrap_img_in_a.rb,
lib/html2rss/selectors/post_processors/html_transformers/transform_urls_to_absolute_ones.rb
Overview
This scraper is designed to scrape articles from a given HTML page using CSS selectors defined in the feed config.
It supports the traditional feed configs that html2rss originally provided, ensuring compatibility with existing setups.
Additionally, it uniquely offers the capability to convert JSON into XML, extending its versatility for diverse data processing workflows.
Defined Under Namespace
Modules: Extractors, PostProcessors Classes: Config, Context, InvalidSelectorName, ObjectToXmlConverter
Constant Summary collapse
- DEFAULT_CONFIG =
{ items: { enhance: true } }.freeze
- ITEMS_SELECTOR_KEY =
:items- ITEM_TAGS =
%i[title url description author comments published_at guid enclosure categories].freeze
- SPECIAL_ATTRIBUTES =
Set[:guid, :enclosure, :categories].freeze
- RENAMED_ATTRIBUTES =
Mapping of new attribute names to their legacy names for backward compatibility.
{ published_at: %i[updated pubDate] }.freeze
Instance Method Summary collapse
-
#articles ⇒ Array<Html2rss::RssBuilder::Article>
Returns articles extracted from the response.
-
#each {|article| ... } ⇒ Enumerator
Iterates over each scraped article.
-
#enhance? ⇒ Boolean
Whether to enhance the article hash with auto_source's semantic HTML extraction.
-
#enhance_article_hash(article_hash, article_tag, base_url = @url) ⇒ Hash
Enhances the article hash using semantic HTML extraction.
-
#extract_article(item, page_response = response) ⇒ Hash
Extracts an article hash for a given item element.
-
#initialize(response, selectors:, time_zone:) ⇒ Selectors
constructor
Initializes a new Selectors instance.
-
#items_selector ⇒ String
Returns the CSS selector for the items.
-
#select(name, item, base_url: @url) ⇒ Object+
Selects the value for a given attribute from an HTML element.
Constructor Details
#initialize(response, selectors:, time_zone:) ⇒ Selectors
Initializes a new Selectors instance.
38 39 40 41 42 43 44 45 46 |
# File 'lib/html2rss/selectors.rb', line 38 def initialize(response, selectors:, time_zone:) @response = response @url = response.url @selectors = selectors @time_zone = time_zone prepare_selectors! @rss_item_attributes = @selectors.keys & Html2rss::RssBuilder::Article::PROVIDED_KEYS end |
Instance Method Details
#articles ⇒ Array<Html2rss::RssBuilder::Article>
Returns articles extracted from the response. Reverses order if config specifies reverse ordering.
53 54 55 |
# File 'lib/html2rss/selectors.rb', line 53 def articles @articles ||= @selectors.dig(ITEMS_SELECTOR_KEY, :order) == 'reverse' ? to_a.tap(&:reverse!) : to_a end |
#each {|article| ... } ⇒ Enumerator
Iterates over each scraped article.
62 63 64 65 66 67 68 69 70 71 72 73 74 |
# File 'lib/html2rss/selectors.rb', line 62 def each(&) return enum_for(:each) unless block_given? enhance = enhance? parsed_body.css(items_selector).each do |item| article_hash = extract_article(item, response) enhance_article_hash(article_hash, item, response.url) if enhance yield Html2rss::RssBuilder::Article.new(**article_hash, scraper: self.class) end end |
#enhance? ⇒ Boolean
Returns whether to enhance the article hash with auto_source's semantic HTML extraction.
82 |
# File 'lib/html2rss/selectors.rb', line 82 def enhance? = !!@selectors.dig(ITEMS_SELECTOR_KEY, :enhance) |
#enhance_article_hash(article_hash, article_tag, base_url = @url) ⇒ Hash
Enhances the article hash using semantic HTML extraction. Only adds keys that are missing from the original hash.
100 101 102 103 104 105 106 107 108 109 110 111 112 |
# File 'lib/html2rss/selectors.rb', line 100 def enhance_article_hash(article_hash, article_tag, base_url = @url) selected_anchor = HtmlExtractor.main_anchor_for(article_tag) return article_hash unless selected_anchor extracted = HtmlExtractor.new(article_tag, base_url:, selected_anchor:).call return article_hash unless extracted extracted.each_with_object(article_hash) do |(key, value), hash| next if value.nil? || (hash.key?(key) && hash[key]) hash[key] = value end end |
#extract_article(item, page_response = response) ⇒ Hash
Extracts an article hash for a given item element.
89 90 91 |
# File 'lib/html2rss/selectors.rb', line 89 def extract_article(item, page_response = response) @rss_item_attributes.to_h { |key| [key, select(key, item, base_url: page_response.url)] }.compact end |
#items_selector ⇒ String
Returns the CSS selector for the items.
79 |
# File 'lib/html2rss/selectors.rb', line 79 def items_selector = @selectors.dig(ITEMS_SELECTOR_KEY, :selector) |
#select(name, item, base_url: @url) ⇒ Object+
Selects the value for a given attribute from an HTML element.
121 122 123 124 125 126 127 128 129 130 131 132 133 |
# File 'lib/html2rss/selectors.rb', line 121 def select(name, item, base_url: @url) name = name.to_sym raise InvalidSelectorName, "Attribute selector '#{name}' is reserved for items." if name == ITEMS_SELECTOR_KEY selector_key, config = selector_config_for(name) if SPECIAL_ATTRIBUTES.member?(selector_key) select_special(selector_key, item:, config:, base_url:) else select_regular(selector_key, item:, config:, base_url:) end end |