Class: Html2rss::CategoryExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/category_extractor.rb

Overview

CategoryExtractor is responsible for extracting categories from HTML elements by looking for CSS class names containing common category-related terms.

Constant Summary collapse

CATEGORY_TERMS =

Common category-related terms to look for in class names

%w[category tag topic section label theme subject].freeze
CATEGORY_SELECTORS =

CSS selectors to find elements with category-related class names

CATEGORY_TERMS.map { |term| "[class*=\"#{term}\"]" }.freeze
CATEGORY_ATTR_PATTERN =

Regex pattern for matching category-related attribute names

/#{CATEGORY_TERMS.join('|')}/i

Class Method Summary collapse

Class Method Details

.call(article_tag) ⇒ Array<String>

Extracts categories from the given article tag by looking for elements with class names containing common category-related terms.

Parameters:

  • article_tag (Nokogiri::XML::Element)

    The article element to extract categories from

Returns:

  • (Array<String>)

    Array of category strings, empty if none found



23
24
25
26
27
28
29
30
# File 'lib/html2rss/category_extractor.rb', line 23

def self.call()
  return [] unless 

  # Single optimized traversal that extracts all category types
  extract_all_categories()
    .map(&:strip)
    .reject(&:empty?)
end

.extract_all_categories(article_tag) ⇒ Set<String>

Optimized single DOM traversal that extracts all category types.

Parameters:

  • article_tag (Nokogiri::XML::Element)

    The article element

Returns:

  • (Set<String>)

    Set of category strings



37
38
39
40
41
42
43
44
45
46
47
# File 'lib/html2rss/category_extractor.rb', line 37

def self.extract_all_categories()
  Set.new.tap do |categories|
    .css('*').each do |element|
      # Extract text categories from elements with category-related class names
      categories.merge(extract_text_categories(element)) if element['class']&.match?(CATEGORY_ATTR_PATTERN)

      # Extract data categories from all elements
      categories.merge(extract_element_data_categories(element))
    end
  end
end

.extract_element_data_categories(element) ⇒ Set<String>

Extracts categories from data attributes of a single element.

Parameters:

  • element (Nokogiri::XML::Element)

    The element to process

Returns:

  • (Set<String>)

    Set of category strings



54
55
56
57
58
59
60
61
62
63
# File 'lib/html2rss/category_extractor.rb', line 54

def self.extract_element_data_categories(element)
  Set.new.tap do |categories|
    element.attributes.each_value do |attr|
      next unless attr.name.match?(CATEGORY_ATTR_PATTERN)

      value = attr.value&.strip
      categories.add(value) if value && !value.empty?
    end
  end
end

.extract_text_categories(element) ⇒ Set<String>

Extracts text-based categories from elements, splitting content into discrete values.

Parameters:

  • element (Nokogiri::XML::Element)

    The element to process

Returns:

  • (Set<String>)

    Set of category strings



70
71
72
73
74
75
76
77
78
79
80
# File 'lib/html2rss/category_extractor.rb', line 70

def self.extract_text_categories(element)
  anchor_values = element.css('a').filter_map do |node|
    HtmlExtractor.extract_visible_text(node)
  end
  return Set.new(anchor_values.reject(&:empty?)) if anchor_values.any?

  text = HtmlExtractor.extract_visible_text(element)
  return Set.new unless text

  Set.new(text.split(/\n+/).map(&:strip).reject(&:empty?))
end