Class: Html2rss::CategoryExtractor

Inherits:

Object

Object
Html2rss::CategoryExtractor

show all

Defined in:: lib/html2rss/category_extractor.rb

Overview

CategoryExtractor is responsible for extracting categories from HTML elements by looking for CSS class names containing common category-related terms.

Constant Summary collapse

CATEGORY_TERMS = Common category-related terms to look for in class names

%w[category tag topic section label theme subject].freeze

CATEGORY_SELECTORS = CSS selectors to find elements with category-related class names

CATEGORY_TERMS.map { |term| "[class*=\"#{term}\"]" }.freeze

CATEGORY_ATTR_PATTERN = Regex pattern for matching category-related attribute names

/#{CATEGORY_TERMS.join('|')}/i

Class Method Summary collapse

.call(article_tag) ⇒ Array<String>
Extracts categories from the given article tag by looking for elements with class names containing common category-related terms.
.extract_all_categories(article_tag) ⇒ Set<String>
Optimized single DOM traversal that extracts all category types.
.extract_element_data_categories(element) ⇒ Set<String>
Extracts categories from data attributes of a single element.
.extract_text_categories(element) ⇒ Set<String>
Extracts text-based categories from elements, splitting content into discrete values.

Class Method Details

.call(article_tag) ⇒ `Array<String>`

Extracts categories from the given article tag by looking for elements with class names containing common category-related terms.

Parameters:

article_tag (Nokogiri::XML::Element) —
The article element to extract categories from

Returns:

(Array<String>) —
Array of category strings, empty if none found

# File 'lib/html2rss/category_extractor.rb', line 23

def self.call(article_tag)
  return [] unless article_tag

  # Single optimized traversal that extracts all category types
  extract_all_categories(article_tag)
    .map(&:strip)
    .reject(&:empty?)
end

.extract_all_categories(article_tag) ⇒ `Set<String>`

Optimized single DOM traversal that extracts all category types.

Parameters:

article_tag (Nokogiri::XML::Element) —
The article element

Returns:

(Set<String>) —
Set of category strings

# File 'lib/html2rss/category_extractor.rb', line 37

def self.extract_all_categories(article_tag)
  Set.new.tap do |categories|
    article_tag.css('*').each do |element|
      # Extract text categories from elements with category-related class names
      categories.merge(extract_text_categories(element)) if element['class']&.match?(CATEGORY_ATTR_PATTERN)

      # Extract data categories from all elements
      categories.merge(extract_element_data_categories(element))
    end
  end
end

.extract_element_data_categories(element) ⇒ `Set<String>`

Extracts categories from data attributes of a single element.

Parameters:

element (Nokogiri::XML::Element) —
The element to process

Returns:

(Set<String>) —
Set of category strings

# File 'lib/html2rss/category_extractor.rb', line 54

def self.extract_element_data_categories(element)
  Set.new.tap do |categories|
    element.attributes.each_value do |attr|
      next unless attr.name.match?(CATEGORY_ATTR_PATTERN)

      value = attr.value&.strip
      categories.add(value) if value && !value.empty?
    end
  end
end

.extract_text_categories(element) ⇒ `Set<String>`

Extracts text-based categories from elements, splitting content into discrete values.

Parameters:

element (Nokogiri::XML::Element) —
The element to process

Returns:

(Set<String>) —
Set of category strings

# File 'lib/html2rss/category_extractor.rb', line 70

def self.extract_text_categories(element)
  anchor_values = element.css('a').filter_map do |node|
    HtmlExtractor.extract_visible_text(node)
  end
  return Set.new(anchor_values.reject(&:empty?)) if anchor_values.any?

  text = HtmlExtractor.extract_visible_text(element)
  return Set.new unless text

  Set.new(text.split(/\n+/).map(&:strip).reject(&:empty?))
end

Class: Html2rss::CategoryExtractor

Overview

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.call(article_tag) ⇒ Array<String>

.extract_all_categories(article_tag) ⇒ Set<String>

.extract_element_data_categories(element) ⇒ Set<String>

.extract_text_categories(element) ⇒ Set<String>

.call(article_tag) ⇒ `Array<String>`

.extract_all_categories(article_tag) ⇒ `Set<String>`

.extract_element_data_categories(element) ⇒ `Set<String>`

.extract_text_categories(element) ⇒ `Set<String>`