Class: Scrapes::Page
- Inherits:
-
Object
- Object
- Scrapes::Page
- Includes:
- Hpricot::Extractors, RuleParser
- Defined in:
- lib/scrapes/page.rb
Overview
The page class is used as a base class for scraping data out of one web page. To use it, you inherit from it and setup some rules. You can also use validators to ensure that the page was scraped correctly.
Setup
class MyPageScraper < Scrapes::Page
rule :rule_name, blah
end
Scrapes::RuleParser explains the use of rules.
Auto Loading
Scrapes will automatically ‘require’ ruby files placed in a special ‘pages’ directory. The idea is to place one Scrapes::Page derived class per file in the pages directory, and have it required for you.
Validations
There are a few class methods that you can use to validate the contents you scraped from a given web page.
Constant Summary collapse
- XSLTPROC =
:nodoc
'xsltproc'
Instance Attribute Summary collapse
-
#hpricot ⇒ Object
Access the Hpricot object that the selectors are passed.
-
#session ⇒ Object
Access the session object that was used to fetch this page’s data.
-
#uri ⇒ Object
Access the URI where this page’s data came from.
Class Method Summary collapse
-
.acts_as_array(method_to_call) ⇒ Object
Make Page.extract return an array by calling the given method.
-
.extract(data, uri, session, &block) ⇒ Object
Called by the crawler to process a web page.
-
.paginated ⇒ Object
If the page that you are parsing is paginated (one page in many of similar data) you can use this class method to automatically fetch all pages.
-
.to(other_class) ⇒ Object
If using acts_as_array that returns links, send them to another class.
-
.validates_format_of(*attrs) ⇒ Object
Ensure that the given attributes have the correct format.
-
.validates_inclusion_of(*attrs) ⇒ Object
Ensure that the given attributes have values in the given list.
-
.validates_not_blank(*attrs) ⇒ Object
Ensure that the given attributes are not #blank?.
-
.validates_numericality_of(*attrs) ⇒ Object
Ensure that the given attribute is a number.
-
.validates_presence_of(*attrs) ⇒ Object
Ensure that the given attributes have been set by matching rules.
-
.with_xslt(filename) ⇒ Object
Preprocess the HTML by sending it through an XSLT stylesheet.
Instance Method Summary collapse
-
#after_parse ⇒ Object
Have a chance to do something after parsing, but before validataion.
-
#validate ⇒ Object
Called by the extract method to validate scraped data.
Methods included from RuleParser
#extractor, #parse, #rule, #rule_1, #rules, #selector
Methods included from Hpricot::Extractors
content, contents, text, text_process, texts, word, words, #xml
Instance Attribute Details
#hpricot ⇒ Object
Access the Hpricot object that the selectors are passed
72 73 74 |
# File 'lib/scrapes/page.rb', line 72 def hpricot @hpricot end |
#session ⇒ Object
Access the session object that was used to fetch this page’s data
68 69 70 |
# File 'lib/scrapes/page.rb', line 68 def session @session end |
#uri ⇒ Object
Access the URI where this page’s data came from
64 65 66 |
# File 'lib/scrapes/page.rb', line 64 def uri @uri end |
Class Method Details
.acts_as_array(method_to_call) ⇒ Object
Make Page.extract return an array by calling the given method. This can be very useful for when your class does nothing more than collect a set of links for some other page to process. It cases Session#page to call the given block once for each object returned from method_to_call.
119 120 121 |
# File 'lib/scrapes/page.rb', line 119 def self.acts_as_array (method_to_call) { @as_array = method_to_call } end |
.extract(data, uri, session, &block) ⇒ Object
Called by the crawler to process a web page
203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
# File 'lib/scrapes/page.rb', line 203 def self.extract (data, uri, session, &block) obj = process_page(data, uri, session) if {@paginated} if obj.respond_to?(:next_page) sister = obj while sister_uri = sister.next_page sister = extract_sister(session, obj, sister_uri) end elsif obj.respond_to?(:link_for_page) (2 .. obj.pages).each do |page| sister_uri = obj.link_for_page(page) extract_sister(session, obj, sister_uri) end end end as_array = {@as_array} obj = obj.send(as_array) if as_array return obj unless block obj.respond_to?(:each) ? obj.each {|o| yield(o)} : yield(obj) end |
.paginated ⇒ Object
If the page that you are parsing is paginated (one page in many of similar data) you can use this class method to automatically fetch all pages. In order for this to work, you need to provide a few special methods:
Next Page
If you know the URL to the next page, then provide a instance method called next_page
. It should return the URL for the next page, or nil when the current page is the last page.
class NextPageExample < Scrapes::Page
rule(:next_page, 'a[href~=next]', '@href', 1)
end
Link for Page
Alternatively, you can provide a instance method link_for_page
and another one called pages
. The pages
method should return the number of pages in this paginated set. The link_for_page
method should take a page number, and return a URL to fetch that page.
class LinkForPageExample < Scrapes::Page
rule_1(:page) {|e| m = e.text.match(/Page\s+\d+\s+of\s+(\d+)/) and m[1].to_i}
def link_for_page (page)
uri.sub(/page=\d+/, "page=#{page}")
end
end
Append to Page
Finally, you must provide a append_page
method. It takes an instance of your Scrapes::Page derived class as an argument. Its job is to add the data found on the current page to its instance variables. This is because when you use paginated, it only returns one instance of your class.
110 111 112 |
# File 'lib/scrapes/page.rb', line 110 def self.paginated { @paginated = true } end |
.to(other_class) ⇒ Object
If using acts_as_array that returns links, send them to another class
197 198 199 |
# File 'lib/scrapes/page.rb', line 197 def self.to (other_class) ToProxy.new(self, other_class) end |
.validates_format_of(*attrs) ⇒ Object
Ensure that the given attributes have the correct format
155 156 157 158 159 160 161 162 |
# File 'lib/scrapes/page.rb', line 155 def self.validates_format_of (*attrs) attrs, = (attrs, { :message => 'did not match regular expression', :with => /.*/, }) validates_from(attrs, , lambda {|a| a.to_s.match([:with])}) end |
.validates_inclusion_of(*attrs) ⇒ Object
Ensure that the given attributes have values in the given list
166 167 168 169 170 171 172 173 |
# File 'lib/scrapes/page.rb', line 166 def self.validates_inclusion_of (*attrs) attrs, = (attrs, { :message => 'is not in the list of accepted values', :in => [], }) validates_from(attrs, , lambda {|a| [:in].include?(a)}) end |
.validates_not_blank(*attrs) ⇒ Object
Ensure that the given attributes are not #blank?
145 146 147 148 149 150 151 |
# File 'lib/scrapes/page.rb', line 145 def self.validates_not_blank (*attrs) attrs, = (attrs, { :message => 'rule never matched', }) validates_from(attrs, , lambda {|a| !a.blank?}) end |
.validates_numericality_of(*attrs) ⇒ Object
Ensure that the given attribute is a number
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
# File 'lib/scrapes/page.rb', line 177 def self.validates_numericality_of (*attrs) attrs, = (attrs, { :message => 'is not a number', }) closure = lambda do |a| begin Kernel.Float(a.to_s) rescue ArgumentError, TypeError false else true end end validates_from(attrs, , closure) end |
.validates_presence_of(*attrs) ⇒ Object
Ensure that the given attributes have been set by matching rules
135 136 137 138 139 140 141 |
# File 'lib/scrapes/page.rb', line 135 def self.validates_presence_of (*attrs) attrs, = (attrs, { :message => 'rule never matched', }) validates_from(attrs, , lambda {|a| !a.nil?}) end |
.with_xslt(filename) ⇒ Object
Preprocess the HTML by sending it through an XSLT stylesheet. The stylesheet should return a document that can be then processed using your rules. Using this feature requires that you have the xsltproc utility in your PATH. You can get xsltproc from libxslt: xmlsoft.org/XSLT/
128 129 130 131 |
# File 'lib/scrapes/page.rb', line 128 def self.with_xslt (filename) raise "#{XSLTPROC} could not be found" unless `#{XSLTPROC} --version 2>&1`.match(/libxslt/) { @with_xslt = filename } end |
Instance Method Details
#after_parse ⇒ Object
Have a chance to do something after parsing, but before validataion
230 231 |
# File 'lib/scrapes/page.rb', line 230 def after_parse end |
#validate ⇒ Object
Called by the extract method to validate scraped data. If you override this method, you should call super. This method will probably be changed in the future so that you don’t have to call super.
237 238 239 240 241 242 243 244 245 246 |
# File 'lib/scrapes/page.rb', line 237 def validate validations = self.class. { @validations } validations.each do |v| raise "#{self.class}.#{v[:name]} #{v[:options][:message]}" unless v[:proc].call(send(v[:name])) end self end |