Class: Html2rss::AutoSource::Scraper::SemanticHtml
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::SemanticHtml
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/semantic_html/image.rb,
lib/html2rss/auto_source/scraper/semantic_html/extractor.rb
Overview
Scrapes articles by looking for common markup tags (article, section, li) containing an <a href> tag.
See:
Defined Under Namespace
Constant Summary collapse
- ANCHOR_TAG_SELECTORS =
Map of parent element names to CSS selectors for finding <a href> tags.
{ 'section' => ['section :not(section) a[href]'], 'tr' => ['table tr :not(tr) a[href]'], 'article' => [ 'article :not(article) a[href]', 'article a[href]' ], 'li' => [ 'ul > li :not(li) a[href]', 'ol > li :not(li) a[href]' ] }.freeze
Instance Attribute Summary collapse
-
#parsed_body ⇒ Object
readonly
Returns the value of attribute parsed_body.
Class Method Summary collapse
-
.anchor_tag_selector_pairs ⇒ Array<[String, String]>
Returns an array of [tag_name, selector] pairs.
-
.articles?(parsed_body) ⇒ Boolean
Check if the parsed_body contains articles.
-
.find_closest_selector(current_tag, selector: 'a[href]:not([href=""])') ⇒ Nokogiri::XML::Node?
Finds the closest matching selector upwards in the DOM tree.
-
.find_closest_selector_upwards(current_tag, selector:) ⇒ Nokogiri::XML::Node?
Helper method to find a matching selector upwards.
-
.find_tag_in_ancestors(current_tag, tag_name, stop_tag: 'html') ⇒ Nokogiri::XML::Node
Finds the closest ancestor tag matching the specified tag name.
Instance Method Summary collapse
-
#each {|The| ... } ⇒ Enumerator
Enumerator for the scraped articles.
-
#initialize(parsed_body, url:) ⇒ SemanticHtml
constructor
A new instance of SemanticHtml.
Constructor Details
#initialize(parsed_body, url:) ⇒ SemanticHtml
Returns a new instance of SemanticHtml.
93 94 95 96 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 93 def initialize(parsed_body, url:) @parsed_body = parsed_body @url = url end |
Instance Attribute Details
#parsed_body ⇒ Object (readonly)
Returns the value of attribute parsed_body.
98 99 100 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 98 def parsed_body @parsed_body end |
Class Method Details
.anchor_tag_selector_pairs ⇒ Array<[String, String]>
Returns an array of [tag_name, selector] pairs
87 88 89 90 91 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 87 def self.anchor_tag_selector_pairs ANCHOR_TAG_SELECTORS.flat_map do |tag_name, selectors| selectors.map { |selector| [tag_name, selector] } end end |
.articles?(parsed_body) ⇒ Boolean
Check if the parsed_body contains articles
36 37 38 39 40 41 42 43 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 36 def self.articles?(parsed_body) return false unless parsed_body ANCHOR_TAG_SELECTORS.each_value do |selectors| return true if selectors.any? { |selector| parsed_body.at_css(selector) } end false end |
.find_closest_selector(current_tag, selector: 'a[href]:not([href=""])') ⇒ Nokogiri::XML::Node?
Finds the closest matching selector upwards in the DOM tree
66 67 68 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 66 def self.find_closest_selector(current_tag, selector: 'a[href]:not([href=""])') current_tag.at_css(selector) || find_closest_selector_upwards(current_tag, selector:) end |
.find_closest_selector_upwards(current_tag, selector:) ⇒ Nokogiri::XML::Node?
Helper method to find a matching selector upwards
74 75 76 77 78 79 80 81 82 83 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 74 def self.find_closest_selector_upwards(current_tag, selector:) while current_tag found = current_tag.at_css(selector) return found if found return nil unless current_tag.respond_to?(:parent) current_tag = current_tag.parent end end |
.find_tag_in_ancestors(current_tag, tag_name, stop_tag: 'html') ⇒ Nokogiri::XML::Node
Finds the closest ancestor tag matching the specified tag name
50 51 52 53 54 55 56 57 58 59 60 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 50 def self.find_tag_in_ancestors(current_tag, tag_name, stop_tag: 'html') return current_tag if current_tag.name == tag_name = Set[tag_name, stop_tag] while current_tag.respond_to?(:parent) && !.member?(current_tag.name) current_tag = current_tag.parent end current_tag end |
Instance Method Details
#each {|The| ... } ⇒ Enumerator
Returns Enumerator for the scraped articles.
103 104 105 106 107 108 109 110 111 112 113 114 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 103 def each return enum_for(:each) unless block_given? SemanticHtml.anchor_tag_selector_pairs.each do |tag_name, selector| parsed_body.css(selector).each do |selected_tag| article_tag = SemanticHtml.find_tag_in_ancestors(selected_tag, tag_name) article_hash = Extractor.new(article_tag, url: @url).call yield article_hash if article_hash end end end |