Class: Html2rss::CategoryExtractor
- Inherits:
-
Object
- Object
- Html2rss::CategoryExtractor
- Defined in:
- lib/html2rss/category_extractor.rb
Overview
CategoryExtractor is responsible for extracting categories from HTML elements by looking for CSS class names containing common category-related terms.
Constant Summary collapse
- CATEGORY_TERMS =
Common category-related terms to look for in class names
%w[category tag topic section label theme subject].freeze
- CATEGORY_SELECTORS =
CSS selectors to find elements with category-related class names
CATEGORY_TERMS.map { |term| "[class*=\"#{term}\"]" }.freeze
- CATEGORY_ATTR_PATTERN =
Regex pattern for matching category-related attribute names
/#{CATEGORY_TERMS.join('|')}/i
Class Method Summary collapse
-
.call(article_tag) ⇒ Array<String>
Extracts categories from the given article tag by looking for elements with class names containing common category-related terms.
-
.extract_all_categories(article_tag) ⇒ Set<String>
Optimized single DOM traversal that extracts all category types.
-
.extract_element_data_categories(element) ⇒ Set<String>
Extracts categories from data attributes of a single element.
-
.extract_text_categories(element) ⇒ Set<String>
Extracts text-based categories from elements, splitting content into discrete values.
Class Method Details
.call(article_tag) ⇒ Array<String>
Extracts categories from the given article tag by looking for elements with class names containing common category-related terms.
23 24 25 26 27 28 29 30 |
# File 'lib/html2rss/category_extractor.rb', line 23 def self.call(article_tag) return [] unless article_tag # Single optimized traversal that extracts all category types extract_all_categories(article_tag) .map(&:strip) .reject(&:empty?) end |
.extract_all_categories(article_tag) ⇒ Set<String>
Optimized single DOM traversal that extracts all category types.
37 38 39 40 41 42 43 44 45 46 47 |
# File 'lib/html2rss/category_extractor.rb', line 37 def self.extract_all_categories(article_tag) Set.new.tap do |categories| article_tag.css('*').each do |element| # Extract text categories from elements with category-related class names categories.merge(extract_text_categories(element)) if element['class']&.match?(CATEGORY_ATTR_PATTERN) # Extract data categories from all elements categories.merge(extract_element_data_categories(element)) end end end |
.extract_element_data_categories(element) ⇒ Set<String>
Extracts categories from data attributes of a single element.
54 55 56 57 58 59 60 61 62 63 |
# File 'lib/html2rss/category_extractor.rb', line 54 def self.extract_element_data_categories(element) Set.new.tap do |categories| element.attributes.each_value do |attr| next unless attr.name.match?(CATEGORY_ATTR_PATTERN) value = attr.value&.strip categories.add(value) if value && !value.empty? end end end |
.extract_text_categories(element) ⇒ Set<String>
Extracts text-based categories from elements, splitting content into discrete values.
70 71 72 73 74 75 76 77 78 79 80 |
# File 'lib/html2rss/category_extractor.rb', line 70 def self.extract_text_categories(element) anchor_values = element.css('a').filter_map do |node| HtmlExtractor.extract_visible_text(node) end return Set.new(anchor_values.reject(&:empty?)) if anchor_values.any? text = HtmlExtractor.extract_visible_text(element) return Set.new unless text Set.new(text.split(/\n+/).map(&:strip).reject(&:empty?)) end |