Module: RateBeer::Scraping
- Included in:
- Beer::Beer, Brewery::BeerList, Brewery::Brewery, Location, Search, Style
- Defined in:
- lib/ratebeer/scraping.rb
Overview
The Scraping module contains a series of methods to assist with scraping pages from RateBeer.com, and dealing with the results.
Defined Under Namespace
Classes: PageNotFoundError
Instance Attribute Summary collapse
-
#id ⇒ Object
readonly
Returns the value of attribute id.
Class Method Summary collapse
-
.included(base) ⇒ Object
Run method on inclusion in class.
-
.nbsp ⇒ Object
Emulate character for stripping, substitution, etc.
-
.noko_doc(url) ⇒ Object
Create Nokogiri doc from url.
Instance Method Summary collapse
- #==(other_entity) ⇒ Object
-
#fix_characters(string) ⇒ Object
Fix characters in string scraped from website.
-
#full_details ⇒ Object
Return full details of the scraped entity in a Hash.
-
#id_from_link(node) ⇒ Object
Extracts an ID# from an a element containing a link to an entity.
-
#initialize(id, name: nil, **options) ⇒ Object
Create RateBeer::Scraper instance.
- #inspect ⇒ Object
-
#page_count(doc) ⇒ Integer
Determine the number of pages in a document.
-
#pagination?(doc) ⇒ Boolean
Determine if data is paginated, or not.
-
#post_request(url, params) ⇒ Object
Make POST request to RateBeer form.
-
#symbolize_text(text) ⇒ Object
Convert text keys to symbols.
- #to_s ⇒ Object
- #url ⇒ Object
Instance Attribute Details
#id ⇒ Object (readonly)
Returns the value of attribute id.
13 14 15 |
# File 'lib/ratebeer/scraping.rb', line 13 def id @id end |
Class Method Details
.included(base) ⇒ Object
Run method on inclusion in class.
16 17 18 19 20 21 22 23 24 25 |
# File 'lib/ratebeer/scraping.rb', line 16 def self.included(base) if base.respond_to?(:data_keys) base.data_keys.each do |attr| define_method(attr) do send("scrape_#{attr}") unless instance_variable_defined?("@#{attr}") instance_variable_get("@#{attr}") end end end end |
.nbsp ⇒ Object
Emulate character for stripping, substitution, etc.
115 116 117 |
# File 'lib/ratebeer/scraping.rb', line 115 def nbsp Nokogiri::HTML(" ").text end |
.noko_doc(url) ⇒ Object
Create Nokogiri doc from url.
103 104 105 106 107 108 109 |
# File 'lib/ratebeer/scraping.rb', line 103 def noko_doc(url) begin Nokogiri::HTML(open(url).read) rescue OpenURI::HTTPError => msg raise PageNotFoundError.new("Page not found - #{url}") end end |
Instance Method Details
#==(other_entity) ⇒ Object
53 54 55 |
# File 'lib/ratebeer/scraping.rb', line 53 def ==(other_entity) other_entity.is_a?(self.class) && id == other_entity.id end |
#fix_characters(string) ⇒ Object
Fix characters in string scraped from website.
This method substitutes problematic characters found in strings scraped from RateBeer.com
132 133 134 135 136 137 138 139 140 141 |
# File 'lib/ratebeer/scraping.rb', line 132 def fix_characters(string) string = string.encode('UTF-8', invalid: :replace, undef: :replace, replace: '') characters = { nbsp => " ", "\u0093" => "ž", "\u0092" => "'", "\u0096" => "–", / {2,}/ => " " } characters.each { |c, r| string.gsub!(c, r) } string.strip end |
#full_details ⇒ Object
Return full details of the scraped entity in a Hash.
70 71 72 73 74 75 76 77 |
# File 'lib/ratebeer/scraping.rb', line 70 def full_details data = self.class .data_keys .map { |k| [k, send("#{k}")] } .to_h { id: id, url: url }.merge(data) end |
#id_from_link(node) ⇒ Object
Extracts an ID# from an a element containing a link to an entity.
64 65 66 |
# File 'lib/ratebeer/scraping.rb', line 64 def id_from_link(node) node.attribute('href').value.split('/').last.to_i end |
#initialize(id, name: nil, **options) ⇒ Object
Create RateBeer::Scraper instance.
Requires an ID#, and optionally accepts a name and options parameters.
35 36 37 38 39 40 41 |
# File 'lib/ratebeer/scraping.rb', line 35 def initialize(id, name: nil, **) @id = id @name = name unless name.nil? .each do |k, v| instance_variable_set("@#{k.to_s}", v) end end |
#inspect ⇒ Object
43 44 45 46 47 |
# File 'lib/ratebeer/scraping.rb', line 43 def inspect val = "#<#{self.class} ##{@id}" val << " - #{@name}" if instance_variable_defined?("@name") val << ">" end |
#page_count(doc) ⇒ Integer
Determine the number of pages in a document.
93 94 95 96 97 98 99 |
# File 'lib/ratebeer/scraping.rb', line 93 def page_count(doc) doc.at_css('.pagination') && doc.at_css('.pagination') .css('b') .map(&:text) .map(&:to_i) .max end |
#pagination?(doc) ⇒ Boolean
Determine if data is paginated, or not.
84 85 86 |
# File 'lib/ratebeer/scraping.rb', line 84 def pagination?(doc) !page_count(doc).nil? end |
#post_request(url, params) ⇒ Object
Make POST request to RateBeer form. Return a Nokogiri doc.
145 146 147 148 |
# File 'lib/ratebeer/scraping.rb', line 145 def post_request(url, params) res = Net::HTTP.post_form(url, params) Nokogiri::HTML(res.body) end |
#symbolize_text(text) ⇒ Object
Convert text keys to symbols
123 124 125 |
# File 'lib/ratebeer/scraping.rb', line 123 def symbolize_text(text) text.downcase.gsub(' ', '_').gsub('.', '').to_sym end |
#to_s ⇒ Object
49 50 51 |
# File 'lib/ratebeer/scraping.rb', line 49 def to_s inspect end |
#url ⇒ Object
57 58 59 60 61 |
# File 'lib/ratebeer/scraping.rb', line 57 def url @url ||= if respond_to?("#{demodularized_class_name.downcase}_url", id) send("#{demodularized_class_name.downcase}_url", id) end end |