Class: Html2rss::AutoSource::Cleanup

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/cleanup.rb

Overview

Cleanup is responsible for cleaning up the extracted articles. :reek:MissingSafeMethod { enabled: false } It applies various strategies to filter and refine the article list.

Constant Summary collapse

DEFAULT_CONFIG =
{
  keep_different_domain: false,
  min_words_title: 3
}.freeze
VALID_SCHEMES =
%w[http https].to_set.freeze

Class Method Summary collapse

Class Method Details

.call(articles, url:, keep_different_domain:, min_words_title:) ⇒ Object



18
19
20
21
22
23
24
25
26
27
28
29
30
31
# File 'lib/html2rss/auto_source/cleanup.rb', line 18

def call(articles, url:, keep_different_domain:, min_words_title:)
  Log.debug "Cleanup: start with #{articles.size} articles"

  articles.select!(&:valid?)

  deduplicate_by!(articles, :url)

  keep_only_http_urls!(articles)
  reject_different_domain!(articles, url) unless keep_different_domain
  keep_only_with_min_words_title!(articles, min_words_title:)

  Log.debug "Cleanup: end with #{articles.size} articles"
  articles
end

.deduplicate_by!(articles, key) ⇒ Object

Deduplicates articles by a given key.

Parameters:

  • articles (Array<Article>)

    The list of articles to process.

  • key (Symbol)

    The key to deduplicate by.



38
39
40
41
42
43
44
# File 'lib/html2rss/auto_source/cleanup.rb', line 38

def deduplicate_by!(articles, key)
  seen = {}
  articles.reject! do |article|
    value = article.public_send(key)
    value.nil? || seen.key?(value).tap { seen[value] = true }
  end
end

.keep_only_http_urls!(articles) ⇒ Object

Keeps only articles with HTTP or HTTPS URLs.

Parameters:

  • articles (Array<Article>)

    The list of articles to process.



50
51
52
# File 'lib/html2rss/auto_source/cleanup.rb', line 50

def keep_only_http_urls!(articles)
  articles.select! { |article| VALID_SCHEMES.include?(article.url&.scheme) }
end

.keep_only_with_min_words_title!(articles, min_words_title:) ⇒ Object

Keeps only articles with a title that is present and has at least min_words_title words.

Parameters:

  • articles (Array<Article>)

    The list of articles to process.

  • min_words_title (Integer)

    The minimum number of words in the title.



69
70
71
72
73
# File 'lib/html2rss/auto_source/cleanup.rb', line 69

def keep_only_with_min_words_title!(articles, min_words_title:)
  articles.select! do |article|
    article.title ? word_count_at_least?(article.title, min_words_title) : true
  end
end

.reject_different_domain!(articles, base_url) ⇒ Object

Rejects articles that have a URL not on the same domain as the source.

Parameters:

  • articles (Array<Article>)

    The list of articles to process.

  • base_url (Html2rss::Url)

    The source URL to compare against.



59
60
61
62
# File 'lib/html2rss/auto_source/cleanup.rb', line 59

def reject_different_domain!(articles, base_url)
  base_host = base_url.host
  articles.select! { |article| article.url&.host == base_host }
end