Class: Html2rss::AutoSource::Cleanup
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Cleanup
- Defined in:
- lib/html2rss/auto_source/cleanup.rb
Overview
Cleanup is responsible for cleaning up the extracted articles. :reek:MissingSafeMethod { enabled: false } It applies various strategies to filter and refine the article list.
Constant Summary collapse
- DEFAULT_CONFIG =
{ keep_different_domain: false, min_words_title: 3 }.freeze
- VALID_SCHEMES =
%w[http https].to_set.freeze
Class Method Summary collapse
- .call(articles, url:, keep_different_domain:, min_words_title:) ⇒ Object
-
.deduplicate_by!(articles, key) ⇒ Object
Deduplicates articles by a given key.
-
.keep_only_http_urls!(articles) ⇒ Object
Keeps only articles with HTTP or HTTPS URLs.
-
.keep_only_with_min_words_title!(articles, min_words_title:) ⇒ Object
Keeps only articles with a title that is present and has at least
min_words_titlewords. -
.reject_different_domain!(articles, base_url) ⇒ Object
Rejects articles that have a URL not on the same domain as the source.
Class Method Details
.call(articles, url:, keep_different_domain:, min_words_title:) ⇒ Object
18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# File 'lib/html2rss/auto_source/cleanup.rb', line 18 def call(articles, url:, keep_different_domain:, min_words_title:) Log.debug "Cleanup: start with #{articles.size} articles" articles.select!(&:valid?) deduplicate_by!(articles, :url) keep_only_http_urls!(articles) reject_different_domain!(articles, url) unless keep_different_domain keep_only_with_min_words_title!(articles, min_words_title:) Log.debug "Cleanup: end with #{articles.size} articles" articles end |
.deduplicate_by!(articles, key) ⇒ Object
Deduplicates articles by a given key.
38 39 40 41 42 43 44 |
# File 'lib/html2rss/auto_source/cleanup.rb', line 38 def deduplicate_by!(articles, key) seen = {} articles.reject! do |article| value = article.public_send(key) value.nil? || seen.key?(value).tap { seen[value] = true } end end |
.keep_only_http_urls!(articles) ⇒ Object
Keeps only articles with HTTP or HTTPS URLs.
50 51 52 |
# File 'lib/html2rss/auto_source/cleanup.rb', line 50 def keep_only_http_urls!(articles) articles.select! { |article| VALID_SCHEMES.include?(article.url&.scheme) } end |
.keep_only_with_min_words_title!(articles, min_words_title:) ⇒ Object
Keeps only articles with a title that is present and has at least min_words_title words.
69 70 71 72 73 |
# File 'lib/html2rss/auto_source/cleanup.rb', line 69 def keep_only_with_min_words_title!(articles, min_words_title:) articles.select! do |article| article.title ? word_count_at_least?(article.title, min_words_title) : true end end |
.reject_different_domain!(articles, base_url) ⇒ Object
Rejects articles that have a URL not on the same domain as the source.
59 60 61 62 |
# File 'lib/html2rss/auto_source/cleanup.rb', line 59 def reject_different_domain!(articles, base_url) base_host = base_url.host articles.select! { |article| article.url&.host == base_host } end |