Class: Html2rss::AutoSource::Scraper::Html

Inherits:

Object

Object
Html2rss::AutoSource::Scraper::Html

show all

Includes:: Enumerable

Defined in:: lib/html2rss/auto_source/scraper/html.rb

Overview

Scrapes articles from HTML pages by finding similar structures around anchor tags in the parsed_body.

Instance Attribute Summary collapse

#parsed_body ⇒ Object readonly

Returns the value of attribute parsed_body.

Class Method Summary collapse

.articles?(parsed_body) ⇒ Boolean
.parent_until_condition(node, condition) ⇒ Object
.simplify_xpath(xpath) ⇒ Object

Simplify an XPath selector by removing the index notation.

Instance Method Summary collapse

#each {|The| ... } ⇒ Enumerator

Enumerator for the scraped articles.
#frequent_selectors(root = @parsed_body.at_css('body'), min_frequency: 2) ⇒ Set<String>

Find all the anchors in root.
#initialize(parsed_body, url:) ⇒ Html constructor

A new instance of Html.

Constructor Details

#initialize(parsed_body, url:) ⇒ `Html`

Returns a new instance of Html.

# File 'lib/html2rss/auto_source/scraper/html.rb', line 32

def initialize(parsed_body, url:)
  @parsed_body = parsed_body
  @url = url
  @css_selectors = Hash.new(0)
end

Instance Attribute Details

#parsed_body ⇒ `Object` (readonly)

Returns the value of attribute parsed_body.



38
39
40

# File 'lib/html2rss/auto_source/scraper/html.rb', line 38

def parsed_body
  @parsed_body
end

Class Method Details

.articles?(parsed_body) ⇒ `Boolean`

Returns:

(Boolean)



15
16
17

# File 'lib/html2rss/auto_source/scraper/html.rb', line 15

def self.articles?(parsed_body)
  new(parsed_body, url: '').any?
end

.parent_until_condition(node, condition) ⇒ `Object`

# File 'lib/html2rss/auto_source/scraper/html.rb', line 19

def self.parent_until_condition(node, condition)
  return nil if !node || node.parent.name == 'html'
  return node if condition.call(node)

  parent_until_condition(node.parent, condition)
end

.simplify_xpath(xpath) ⇒ `Object`

Simplify an XPath selector by removing the index notation.



28
29
30

# File 'lib/html2rss/auto_source/scraper/html.rb', line 28

def self.simplify_xpath(xpath)
  xpath.gsub(/\[\d+\]/, '')
end

Instance Method Details

#each {|The| ... } ⇒ `Enumerator`

Returns Enumerator for the scraped articles.

Yield Parameters:

The (Hash) —

scraped article hash

Returns:

(Enumerator) —

Enumerator for the scraped articles

# File 'lib/html2rss/auto_source/scraper/html.rb', line 43

def each
  return enum_for(:each) unless block_given?

  return if frequent_selectors.empty?

  frequent_selectors.each do |selector|
    parsed_body.xpath(selector).each do |selected_tag|
      article_tag = self.class.parent_until_condition(selected_tag, method(:article_condition))
      article_hash = SemanticHtml::Extractor.new(article_tag, url: @url).call

      yield article_hash if article_hash
    end
  end
end

#frequent_selectors(root = @parsed_body.at_css('body'), min_frequency: 2) ⇒ `Set<String>`

Find all the anchors in root.

Parameters:

root (Nokogiri::XML::Node) (defaults to: @parsed_body.at_css('body')) —

The root node to search for anchors

Returns:

(Set<String>) —

The set of CSS selectors which exist at least min_frequency times

# File 'lib/html2rss/auto_source/scraper/html.rb', line 62

def frequent_selectors(root = @parsed_body.at_css('body'), min_frequency: 2)
  @frequent_selectors ||= begin
    root.traverse do |node|
      next if !node.element? || node.name != 'a'

      @css_selectors[self.class.simplify_xpath(node.path)] += 1
    end

    @css_selectors.keys
                  .select { |selector| (@css_selectors[selector]).to_i >= min_frequency }
                  .to_set
  end
end

Class: Html2rss::AutoSource::Scraper::Html

Overview

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:) ⇒ Html

Instance Attribute Details

#parsed_body ⇒ Object (readonly)

Class Method Details

.articles?(parsed_body) ⇒ Boolean

.parent_until_condition(node, condition) ⇒ Object

.simplify_xpath(xpath) ⇒ Object

Instance Method Details

#each {|The| ... } ⇒ Enumerator

#frequent_selectors(root = @parsed_body.at_css('body'), min_frequency: 2) ⇒ Set<String>

#initialize(parsed_body, url:) ⇒ `Html`

#parsed_body ⇒ `Object` (readonly)

.articles?(parsed_body) ⇒ `Boolean`

.parent_until_condition(node, condition) ⇒ `Object`

.simplify_xpath(xpath) ⇒ `Object`

#each {|The| ... } ⇒ `Enumerator`

#frequent_selectors(root = @parsed_body.at_css('body'), min_frequency: 2) ⇒ `Set<String>`