Class: CraigScrape::Scraper
- Inherits:
-
Object
- Object
- CraigScrape::Scraper
- Defined in:
- lib/scraper.rb
Overview
Scraper is a general-pupose base class for all libcraigscrape Objects. Scraper facilitates all http-related functionality, and adds some useful helpers for dealing with eager-loading of http-objects and general html methods. It also contains the http-related cattr_accessors:
logger - a Logger object to debug http notices too. Defaults to nil
Direct Known Subclasses
Defined Under Namespace
Classes: BadConstructionError, BadUrlError, FetchError, ParseError, ResourceNotFoundError
Constant Summary collapse
- URL_PARTS =
/^(?:([^\:]+)\:\/\/([^\/]*))?(.*)$/
- HTML_TAG =
/<\/?[^>]*>/
- HTML_ENCODING =
We have to specify this to nokogiri. Sometimes it tries to figure out encoding on its own, and craigslist users post crazy bytes sometimes
"UTF-8"
- HTTP_HEADERS =
{ "Cache-Control" => "no-cache", "Pragma" => "no-cache", "Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168 Safari/535.19"}
Instance Attribute Summary collapse
-
#url ⇒ Object
readonly
Returns the full url that corresponds to this resource.
Instance Method Summary collapse
-
#attributes ⇒ Object
This method is mostly useful for our specs, but it’s included in case anyone else wants it.
-
#downloaded? ⇒ Boolean
Indicates whether the resource has yet been retrieved from its associated url.
-
#initialize(init_via = nil) ⇒ Scraper
constructor
Scraper Objects can be created from either a full URL (string), or a Hash.
-
#uri ⇒ Object
A URI object corresponding to this Scraped URL.
Constructor Details
#initialize(init_via = nil) ⇒ Scraper
Scraper Objects can be created from either a full URL (string), or a Hash. Currently, this initializer isn’t intended to be called from libcraigslist API users, though if you know what you’re doing - feel free to try this out.
A (string) url can be passed in a ‘http://’ scheme or a ‘file://’ scheme.
When constructing from a hash, the keys in the hash will be used to set the object’s corresponding values. This is useful to create an object without actually making an html request, this is used to set-up an object before it eager-loads any values not already passed in by the constructor hash. Though optional, if you’re going to be setting this object up for eager-loadnig, be sure to pass in a :url key in your hash, Otherwise this will fail to eager load.
65 66 67 68 69 70 71 72 73 74 75 |
# File 'lib/scraper.rb', line 65 def initialize(init_via = nil) if init_via.nil? # Do nothing - possibly not a great idea, but we'll allow it elsif init_via.kind_of? String @url = init_via elsif init_via.kind_of? Hash init_via.each_pair{|k,v| instance_variable_set "@#{k}", v} else raise BadConstructionError, ("Unrecognized parameter passed to %s.new %s}" % [self.class.to_s, init_via.class.inspect]) end end |
Instance Attribute Details
#url ⇒ Object (readonly)
Returns the full url that corresponds to this resource
37 38 39 |
# File 'lib/scraper.rb', line 37 def url @url end |
Instance Method Details
#attributes ⇒ Object
This method is mostly useful for our specs, but it’s included in case anyone else wants it. It returns all currently-defined instance variables, and is mostly useful for the specs. Probably this doesn’t do what you think, and should only be used to determine what’s been parsed by the object thus-far. (And does not include parseable attributes which have yet to be determined
93 94 95 96 |
# File 'lib/scraper.rb', line 93 def attributes Hash[self.instance_variables.collect{|i| [i.to_s.tr('@','').to_sym, instance_variable_get(i) ] }] end |
#downloaded? ⇒ Boolean
Indicates whether the resource has yet been retrieved from its associated url. This is useful to distinguish whether the instance was instantiated for the purpose of an eager-load, but hasn’t yet been fetched.
80 |
# File 'lib/scraper.rb', line 80 def downloaded?; !@html_source.nil?; end |
#uri ⇒ Object
A URI object corresponding to this Scraped URL
83 84 85 86 |
# File 'lib/scraper.rb', line 83 def uri @uri ||= URI.parse @url if @url @uri end |