Class: Scraper::Base
- Inherits:
-
Object
- Object
- Scraper::Base
- Defined in:
- lib/scraper/base.rb
Direct Known Subclasses
Microformats::HAtom, Microformats::HAtom::Entry, Microformats::HAtom::Feed, Microformats::HCard
Defined Under Namespace
Classes: PageInfo
Constant Summary collapse
- READER_OPTIONS =
[:last_modified, :etag, :redirect_limit, :user_agent, :timeout]
Instance Attribute Summary collapse
-
#extracted ⇒ Object
Set to true when the first extractor returns true.
-
#options ⇒ Object
Returns the options for this object.
-
#page_info ⇒ Object
Information about the HTML page scraped.
Class Method Summary collapse
-
.array(*symbols) ⇒ Object
Declares which accessors are arrays.
-
.element(element) ⇒ Object
Returns the element itself.
-
.extractor(map) ⇒ Object
Creates an extractor that will extract values from the selected element and place them in instance variables of the scraper.
-
.options ⇒ Object
Returns the options for this class.
-
.parser(name = :tidy) ⇒ Object
Specifies which parser to use.
-
.parser_options(options) ⇒ Object
Options to pass to the parser.
-
.process(*selector, &block) ⇒ Object
:call-seq: process(symbol?, selector, values?, extractor) process(symbol?, selector, values?) { |element| … }.
-
.process_first(*selector, &block) ⇒ Object
Similar to #process, but only extracts from the first selected element.
-
.result(*symbols) ⇒ Object
Modifies this scraper to return a single value or a structure.
-
.root_element(name) ⇒ Object
The root element to scrape.
-
.rules ⇒ Object
Returns an array of rules defined for this class.
-
.scrape(source, options = nil) ⇒ Object
Scrapes the document and returns the result.
-
.selector(symbol, *selector, &block) ⇒ Object
:call-seq: selector(symbol, selector, values?) selector(symbol, selector, values?) { |elements| … }.
-
.text(element) ⇒ Object
Returns the text of the element.
Instance Method Summary collapse
-
#collect ⇒ Object
Called by #scrape scraping the document, and before calling #result.
-
#document ⇒ Object
Returns the document being processed.
-
#initialize(source, options = nil) ⇒ Base
constructor
Create a new scraper instance.
-
#option(symbol) ⇒ Object
Returns the value of an option.
-
#prepare(document) ⇒ Object
Called by #scrape after creating the document, but before running any processing rules.
- #request(url, options) ⇒ Object
-
#result ⇒ Object
Returns the result of a succcessful scrape.
-
#scrape ⇒ Object
Scrapes the document and returns the result.
-
#skip(elements = nil) ⇒ Object
:call-seq: skip() => true skip(element) => true skip([element …]) => true.
-
#stop ⇒ Object
Stops processing this page.
Constructor Details
#initialize(source, options = nil) ⇒ Base
Create a new scraper instance.
The argument source
is a URL, string containing HTML, or HTML::Node. The optional argument options
are options passed to the scraper. See Base#scrape for more details.
For example:
# The page we want to scrape
url = URI.parse("http://example.com")
# Skip the header
scraper = MyScraper.new(url, :root_element=>"body")
result = scraper.scrape
715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 |
# File 'lib/scraper/base.rb', line 715 def initialize(source, = nil) @page_info = PageInfo[] @options = || {} case source when URI @document = source when String, HTML::Node @document = source # TODO: document and test case these two. @page_info.url = @page_info.original_url = @options[:url] @page_info.encoding = @options[:encoding] else raise ArgumentError, "Can only scrape URI, String or HTML::Node" end end |
Instance Attribute Details
#extracted ⇒ Object
Set to true when the first extractor returns true.
692 693 694 |
# File 'lib/scraper/base.rb', line 692 def extracted @extracted end |
#options ⇒ Object
Returns the options for this object.
700 701 702 |
# File 'lib/scraper/base.rb', line 700 def @options end |
#page_info ⇒ Object
Information about the HTML page scraped. See PageInfo.
696 697 698 |
# File 'lib/scraper/base.rb', line 696 def page_info @page_info end |
Class Method Details
.array(*symbols) ⇒ Object
Declares which accessors are arrays. You can declare the accessor here, or use “symbol[]” as the target.
For example:
array :urls
process "a[href]", :urls=>"@href"
Is equivalent to:
process "a[href]", "urls[]"=>"@href"
473 474 475 476 477 478 479 480 481 482 483 484 |
# File 'lib/scraper/base.rb', line 473 def array(*symbols) @arrays ||= [] symbols.each do |symbol| symbol = symbol.to_sym @arrays << symbol begin self.instance_method(symbol) rescue NameError attr_accessor symbol end end end |
.element(element) ⇒ Object
Returns the element itself.
You can use this method from an extractor, e.g.:
process "h1", :header=>:element
373 374 375 |
# File 'lib/scraper/base.rb', line 373 def element(element) element end |
.extractor(map) ⇒ Object
Creates an extractor that will extract values from the selected element and place them in instance variables of the scraper. You can pass the result to #process.
Example
This example processes a document looking for an element with the class name article
. It extracts the attribute id
and stores it in the instance variable @id. It extracts the article node itself and puts it in the instance variable @article.
class ArticleScraper < Scraper::Base
process ".article", extractor(:id=>"@id", :article=>:element)
attr_reader :id, :article
end
result = ArticleScraper.scrape(html)
puts result.id
puts result.article
Sources
Extractors operate on the selected element, and can extract the following values:
-
"elem_name"
– Extracts the element itself if it matches the element name (e.g. “h2” will extract only level 2 header elements). -
"attr_name"
– Extracts the attribute value from the element if specified (e.g. “@id” will extract the id attribute). -
"elem_name@attr_name"
– Extracts the attribute value from the element if specified, but only if the element has the specified name (e.g. “h2@id”). -
:element
– Extracts the element itself. -
:text
– Extracts the text value of the node. -
Scraper
– Using this class creates a scraper to process the current element and extract the result. This can be used for handling complex structure.
If you use an array of sources, the first source that matches anything is used. For example, ["attr@title", :text]
extracts the value of the title
attribute if the element is abbr
, otherwise the text value of the element.
If you use a hash, you can extract multiple values at the same time. For example, {:id=>"@id", :class=>"@class"}
extracts the id
and class
attribute values.
:element and :text are special cases of symbols. You can pass any symbol that matches a class method and that class method will be called to extract a value from the selected element. You can also pass a Proc or Method directly.
And it’s always possible to pass a static value, quite useful for processing an element with more than one rule (:skip=>false
).
Targets
Extractors assign the extracted value to an instance variable of the scraper. The instance variable contains the last value extracted.
Also creates an accessor for that instance variable. An accessor is created if no such method exists. For example, :title=>:text
creates an accessor for title
. However, :id=>"@id"
does not create an accessor since each object already has a method called id
.
If you want to extract multiple values into the same variables, use #array to declare that accessor as an array.
Alternatively, you can append []
to the variable name. For example:
process "*", "ids[]"=>"@id"
result :ids
The special target :skip
allows you to control whether other rules can apply to the same element. By default a processing rule without a block (or a block that returns true) will skip that element so no other processing rule sees it.
You can change this with :skip=>false
.
283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 |
# File 'lib/scraper/base.rb', line 283 def extractor(map) extracts = [] map.each_pair do |target, source| source = extract_value_from(source) target = extract_value_to(target) define_method :__extractor do |element| value = source.call(element) target.call(self, value) if !value.nil? end extracts << instance_method(:__extractor) remove_method :__extractor end lambda do |element| extracts.each do |extract| extract.bind(self).call(element) end true end end |
.options ⇒ Object
Returns the options for this class.
412 413 414 |
# File 'lib/scraper/base.rb', line 412 def () @options ||= {} end |
.parser(name = :tidy) ⇒ Object
Specifies which parser to use. The default is :tidy
.
379 380 381 |
# File 'lib/scraper/base.rb', line 379 def parser(name = :tidy) self.[:parser] = name end |
.parser_options(options) ⇒ Object
Options to pass to the parser.
For example, when using Tidy, you can use these options to tell Tidy how to clean up the HTML.
This method sets the option for the class. Classes inherit options from their parents. You can also pass options to the scraper object itself using the :parser_options
option.
392 393 394 |
# File 'lib/scraper/base.rb', line 392 def () self.[:parser_options] = end |
.process(*selector, &block) ⇒ Object
:call-seq:
process(symbol?, selector, values?, extractor)
process(symbol?, selector, values?) { |element| ... }
Defines a processing rule. A processing rule consists of a selector that matches element, and an extractor that does something interesting with their value.
Symbol
Rules are processed in the order in which they are defined. Use #rules if you need to change the order of processing.
Rules can be named or anonymous. If the first argument is a symbol, it is used as the rule name. You can use the rule name to position, remove or replace it.
Selector
The first argument is a selector. It selects elements from the document that are potential candidates for extraction. Each selected element is passed to the extractor.
The selector
argument may be a string, an HTML::Selector object or any object that responds to the select
method. Passing an Array (responds to select
) will not do anything useful.
String selectors support value substitution, replacing question marks (?) in the selector expression with values from the method arguments. See HTML::Selector for more information.
Extractor
The last argument or block is the extractor. The extractor does something interested with the selected element, typically assigns it to an instance variable of the scraper.
Since the extractor is called on the scraper, it can also use the scraper to maintain state, e.g. this extractor counts how many div
elements appear in the document:
process "div" { |element| @count += 1 }
The extractor returns true
if the element was processed and should not be passed to any other extractor (including any child elements).
The default implementation of #result returns self
only if at least one extractor returned true
. However, you can override #result and use extractors that return false
.
A block extractor is called with a single element.
You can also use the #extractor method to create extractors that assign elements, attributes and text values to instance variables, or pass a Hash
as the last argument to #process. See #extractor for more information.
When using a block, the last statement is the response. Do not use return
, use next
if you want to return a value before the last statement. return
does not do what you expect it to.
Example
class ScrapePosts < Scraper::Base
# Select the title of a post
selector :select_title, "h2"
# Select the body of a post
selector :select_body, ".body"
# All elements with class name post.
process ".post" do |element|
title = select_title(element)
body = select_body(element)
@posts << Post.new(title, body)
true
end
attr_reader :posts
end
posts = ScrapePosts.scrape(html).posts
To process only a single element:
class ScrapeTitle < Scraper::Base
process "html>head>title", :title=>text
result :title
end
puts ScrapeTitle.scrape(html)
123 124 125 |
# File 'lib/scraper/base.rb', line 123 def process(*selector, &block) create_process(false, *selector, &block) end |
.process_first(*selector, &block) ⇒ Object
Similar to #process, but only extracts from the first selected element. Faster if you know the document contains only one applicable element, or only interested in processing the first one.
132 133 134 |
# File 'lib/scraper/base.rb', line 132 def process_first(*selector, &block) create_process(true, *selector, &block) end |
.result(*symbols) ⇒ Object
Modifies this scraper to return a single value or a structure. Use in combination with accessors.
When called with one symbol, scraping returns the result of calling that method (typically an accessor). When called with two or more symbols, scraping returns a structure of values, one for each symbol.
For example:
class ScrapeTitle < Scraper::Base
process_first "html>head>title", :title=>:text
result :title
end
puts "Title: " + ScrapeTitle.scrape(html)
class ScrapeDts < Scraper::Base
process ".dtstart", :dtstart=>["abbr@title", :text]
process ".dtend", :dtend=>["abbr@title", :text]
result :dtstart, :dtend
end
dts = ScrapeDts.scrape(html)
puts "Starts: #{dts.dtstart}"
puts "Ends: #{dts.dtend}"
449 450 451 452 453 454 455 456 457 458 459 460 461 462 |
# File 'lib/scraper/base.rb', line 449 def result(*symbols) raise ArgumentError, "Use one symbol to return the value of this accessor, multiple symbols to returns a structure" if symbols.empty? symbols = symbols.map {|s| s.to_sym} if symbols.size == 1 define_method :result do return self.send(symbols[0]) end else struct = Struct.new(*symbols) define_method :result do return struct.new(*symbols.collect {|s| self.send(s) }) end end end |
.root_element(name) ⇒ Object
The root element to scrape.
The root element for an HTML document is html
. However, if you want to scrape only the header or body, you can set the root_element to head
or body
.
This method sets the root element for the class. Classes inherit this option from their parents. You can also pass a root element to the scraper object itself using the :root_element
option.
406 407 408 |
# File 'lib/scraper/base.rb', line 406 def root_element(name) self.[:root_element] = name ? name.to_s : nil end |
.rules ⇒ Object
Returns an array of rules defined for this class. You can use this array to change the order of rules.
419 420 421 |
# File 'lib/scraper/base.rb', line 419 def rules() @rules ||= [] end |
.scrape(source, options = nil) ⇒ Object
Scrapes the document and returns the result.
The first argument provides the input document. It can be one of:
-
URI
– Retrieve an HTML page from this URL and scrape it. -
String
– The HTML page as a string. -
HTML::Node
– An HTML node, can be a document or element.
You can specify options for the scraper class, or override these by passing options in the second argument. Some options only make sense in the constructor.
The following options are supported for reading HTML pages:
-
:last_modified
– Last-Modified header used for caching. -
:etag
– ETag header used for caching. -
:redirect_limit
– Limits number of redirects to follow. -
:user_agent
– Value for User-Agent header. -
:timeout
– HTTP open connection/read timeouts (in second).
The following options are supported for parsing the HTML:
-
:root_element
– The root element to scrape, see also #root_elements. -
:parser
– Specifies which parser to use. (Typically, you set this for the class). -
:parser_options
– Options to pass to the parser.
The result is returned by calling the #result method. The default implementation returns self
if any extractor returned true, nil
otherwise.
For example:
result = MyScraper.scrape(url, :root_element=>"body")
The method may raise any number of exceptions. HTTPError indicates it failed to retrieve the HTML page, and HTMLParseError that it failed to parse the page. Other exceptions come from extractors and the #result method.
345 346 347 348 |
# File 'lib/scraper/base.rb', line 345 def scrape(source, = nil) scraper = self.new(source, ); return scraper.scrape end |
.selector(symbol, *selector, &block) ⇒ Object
:call-seq:
selector(symbol, selector, values?)
selector(symbol, selector, values?) { |elements| ... }
Create a selector method. You can call a selector method directly to select elements.
For example, define a selector:
selector :five_divs, "div" { |elems| elems[0..4] }
And call it to retrieve the first five div
elements:
divs = five_divs(element)
Call a selector method with an element and it returns an array of elements that match the selector, beginning with the element argument itself. It returns an empty array if nothing matches.
If the selector is defined with a block, all selected elements are passed to the block and the result of the block is returned.
For convenience, a first_
method is also created that returns (and yields) only the first selected element. For example:
selector :post, "#post"
@post = first_post
Since the selector is defined with a block, both methods call that block with an array of elements.
The selector
argument may be a string, an HTML::Selector object or any object that responds to the select
method. Passing an Array (responds to select
) will not do anything useful.
String selectors support value substitution, replacing question marks (?) in the selector expression with values from the method arguments. See HTML::Selector for more information.
When using a block, the last statement is the response. Do not use return
, use next
if you want to return a value before the last statement. return
does not do what you expect it to.
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
# File 'lib/scraper/base.rb', line 175 def selector(symbol, *selector, &block) raise ArgumentError, "Missing selector: the first argument tells us what to select" if selector.empty? if selector[0].is_a?(String) selector = HTML::Selector.new(*selector) else raise ArgumentError, "Selector must respond to select() method" unless selector.respond_to?(:select) selector = selector[0] end if block define_method symbol do |element| selected = selector.select(element) return block.call(selected) unless selected.empty? end define_method "first_#{symbol}" do |element| selected = selector.select_first(element) return block.call([selected]) if selected end else define_method symbol do |element| return selector.select(element) end define_method "first_#{symbol}" do |element| return selector.select_first(element) end end end |
.text(element) ⇒ Object
Returns the text of the element.
You can use this method from an extractor, e.g.:
process "title", :title=>:text
355 356 357 358 359 360 361 362 363 364 365 366 |
# File 'lib/scraper/base.rb', line 355 def text(element) text = "" stack = element.children.reverse while node = stack.pop if node.tag? stack.concat node.children.reverse else text << node.content end end return text end |
Instance Method Details
#collect ⇒ Object
Called by #scrape scraping the document, and before calling #result. Typically used to run any validation, post-processing steps, resolving referenced elements, etc.
939 940 |
# File 'lib/scraper/base.rb', line 939 def collect() end |
#document ⇒ Object
Returns the document being processed.
If the scraper was created with a URL, this method will attempt to retrieve the page and parse it.
If the scraper was created with a string, this method will attempt to parse the page.
Be advised that calling this method may raise an exception (HTTPError or HTMLParseError).
The document is parsed only the first time this method is called.
856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 |
# File 'lib/scraper/base.rb', line 856 def document if @document.is_a?(URI) # Attempt to read page. May raise HTTPError. = {} READER_OPTIONS.each { |key| [key] = option(key) } request(@document, ) end if @document.is_a?(String) # Parse the page. May raise HTMLParseError. parsed = Reader.parse_page(@document, @page_info.encoding, option(:parser_options), option(:parser)) @document = parsed.document @page_info.encoding = parsed.encoding end return @document if @document.is_a?(HTML::Node) raise RuntimeError, "No document to process" end |
#option(symbol) ⇒ Object
Returns the value of an option.
Returns the value of an option passed to the scraper on creation. If not specified, return the value of the option set for this scraper class. Options are inherited from the parent class.
967 968 969 |
# File 'lib/scraper/base.rb', line 967 def option(symbol) return .has_key?(symbol) ? [symbol] : self.class.[symbol] end |
#prepare(document) ⇒ Object
Called by #scrape after creating the document, but before running any processing rules.
You can override this method to do any preparation work.
932 933 |
# File 'lib/scraper/base.rb', line 932 def prepare(document) end |
#request(url, options) ⇒ Object
875 876 877 878 879 880 881 882 883 884 |
# File 'lib/scraper/base.rb', line 875 def request(url, ) if page = Reader.read_page(@document, ) @page_info.url = page.url @page_info.original_url = @document @page_info.last_modified = page.last_modified @page_info.etag = page.etag @page_info.encoding = page.encoding @document = page.content end end |
#result ⇒ Object
Returns the result of a succcessful scrape.
This method is called by #scrape after running all the rules on the document. You can also call it directly.
Override this method to return a specific object, perform post-scraping processing, validation, etc.
The default implementation returns self
if any extractor returned true, nil
otherwise.
If you override this method, implement your own logic to determine if anything was extracted and return nil
otherwise. Also, make sure calling this method multiple times returns the same result.
957 958 959 |
# File 'lib/scraper/base.rb', line 957 def result() return self if @extracted end |
#scrape ⇒ Object
Scrapes the document and returns the result.
If the scraper was created with a URL, retrieve the page and parse it. If the scraper was created with a string, parse the page.
The result is returned by calling the #result method. The default implementation returns self
if any extractor returned true, nil
otherwise.
The method may raise any number of exceptions. HTTPError indicates it failed to retrieve the HTML page, and HTMLParseError that it failed to parse the page. Other exceptions come from extractors and the #result method.
See also Base#scrape.
747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 |
# File 'lib/scraper/base.rb', line 747 def scrape() # Call prepare with the document, but before doing anything else. prepare document # Retrieve the document. This may raise HTTPError or HTMLParseError. case document when Array stack = @document.reverse # see below when HTML::Node # If a root element is specified, start selecting from there. # The stack is empty if we can't find any root element (makes # sense). However, the node we're going to process may be # a tag, or an HTML::Document.root which is the equivalent of # a document fragment. root_element = option(:root_element) root = root_element ? @document.find(:tag=>root_element) : @document stack = root ? (root.tag? ? [root] : root.children.reverse) : [] else return end # @skip stores all the elements we want to skip (see #skip). # rules stores all the rules we want to process with this # scraper, based on the class definition. @skip = [] @stop = false rules = self.class.rules.clone begin # Process the document one node at a time. We process elements # from the end of the stack, so each time we visit child elements, # we add them to the end of the stack in reverse order. while node = stack.pop break if @stop skip_this = false # Only match nodes that are elements, ignore text nodes. # Also ignore any element that's on the skip list, and if # found one, remove it from the list (since we never visit # the same element twice). But an element may be added twice # to the skip list. # Note: equal? is faster than == for nodes. next unless node.tag? @skip.delete_if { |s| skip_this = true if s.equal?(node) } next if skip_this # Run through all the rules until we process the element or # run out of rules. If skip_this=true then we processed the # element and we can break out of the loop. However, we might # process (and skip) descedants so also watch the skip list. rules.delete_if do |selector, extractor, rule_name, first_only| break if skip_this # The result of calling match (selected) is nil, element # or array of elements. We turn it into an array to # process one element at a time. We process all elements # that are not on the skip list (we haven't visited # them yet). if selected = selector.match(node, first_only) selected = [selected] unless selected.is_a?(Array) selected = [selected.first] if first_only selected.each do |element| # Do not process elements we already skipped # (see above). However, this time we may visit # an element twice, since selected elements may # be descendants of the current element on the # stack. In rare cases two elements on the stack # may pick the same descendants. next if @skip.find { |s| s.equal?(element) } # Call the extractor method with this element. # If it returns true, skip the element and if # the current element, don't process any more # rules. Again, pay attention to descendants. if extractor.bind(self).call(element) @extracted = true end if @skip.delete(true) if element.equal?(node) skip_this = true else @skip << element end end end first_only if !selected.empty? end end # If we did not skip the element, we're going to process its # children. Reverse order since we're popping from the stack. if !skip_this && children = node.children stack.concat children.reverse end end ensure @skip = nil end collect return result end |
#skip(elements = nil) ⇒ Object
:call-seq:
skip() => true
skip(element) => true
skip([element ...]) => true
Skips processing the specified element(s).
If called with a single element, that element will not be processed.
If called with an array of elements, all the elements in the array are skipped.
If called with no element, skips processing the current element. This has the same effect as returning true.
For convenience this method always returns true. For example:
process "h1" do |element|
@header = element
skip
end
907 908 909 910 911 912 913 914 915 916 917 |
# File 'lib/scraper/base.rb', line 907 def skip(elements = nil) case elements when Array then @skip.concat elements when HTML::Node then @skip << elements when nil then @skip << true when true, false then @skip << elements end # Calling skip(element) as the last statement is # redundant by design. return true end |
#stop ⇒ Object
Stops processing this page. You can call this early on if you discover there is no interesting information on the page, or done extracting all useful information.
923 924 925 |
# File 'lib/scraper/base.rb', line 923 def stop() @stop = true end |