Class: Html2rss::AutoSource::Scraper::Html
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::Html
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/html.rb
Overview
Scrapes articles from HTML pages by finding similar structures around anchor tags in the parsed_body.
Instance Attribute Summary collapse
-
#parsed_body ⇒ Object
readonly
Returns the value of attribute parsed_body.
Class Method Summary collapse
- .articles?(parsed_body) ⇒ Boolean
- .parent_until_condition(node, condition) ⇒ Object
-
.simplify_xpath(xpath) ⇒ Object
Simplify an XPath selector by removing the index notation.
Instance Method Summary collapse
-
#each {|The| ... } ⇒ Enumerator
Enumerator for the scraped articles.
-
#frequent_selectors(root = @parsed_body.at_css('body'), min_frequency: 2) ⇒ Set<String>
Find all the anchors in root.
-
#initialize(parsed_body, url:) ⇒ Html
constructor
A new instance of Html.
Constructor Details
#initialize(parsed_body, url:) ⇒ Html
Returns a new instance of Html.
32 33 34 35 36 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 32 def initialize(parsed_body, url:) @parsed_body = parsed_body @url = url @css_selectors = Hash.new(0) end |
Instance Attribute Details
#parsed_body ⇒ Object (readonly)
Returns the value of attribute parsed_body.
38 39 40 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 38 def parsed_body @parsed_body end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
15 16 17 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 15 def self.articles?(parsed_body) new(parsed_body, url: '').any? end |
.parent_until_condition(node, condition) ⇒ Object
19 20 21 22 23 24 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 19 def self.parent_until_condition(node, condition) return nil if !node || node.parent.name == 'html' return node if condition.call(node) parent_until_condition(node.parent, condition) end |
.simplify_xpath(xpath) ⇒ Object
Simplify an XPath selector by removing the index notation.
28 29 30 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 28 def self.simplify_xpath(xpath) xpath.gsub(/\[\d+\]/, '') end |
Instance Method Details
#each {|The| ... } ⇒ Enumerator
Returns Enumerator for the scraped articles.
43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 43 def each return enum_for(:each) unless block_given? return if frequent_selectors.empty? frequent_selectors.each do |selector| parsed_body.xpath(selector).each do |selected_tag| article_tag = self.class.parent_until_condition(selected_tag, method(:article_condition)) article_hash = SemanticHtml::Extractor.new(article_tag, url: @url).call yield article_hash if article_hash end end end |
#frequent_selectors(root = @parsed_body.at_css('body'), min_frequency: 2) ⇒ Set<String>
Find all the anchors in root.
62 63 64 65 66 67 68 69 70 71 72 73 74 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 62 def frequent_selectors(root = @parsed_body.at_css('body'), min_frequency: 2) @frequent_selectors ||= begin root.traverse do |node| next if !node.element? || node.name != 'a' @css_selectors[self.class.simplify_xpath(node.path)] += 1 end @css_selectors.keys .select { |selector| (@css_selectors[selector]).to_i >= min_frequency } .to_set end end |