Class: Upton::Scraper
- Inherits:
-
Object
- Object
- Upton::Scraper
- Defined in:
- lib/upton.rb,
lib/upton/scraper.rb
Overview
-
specifying the pages to be scraped in ‘new` as an index page
or as an Array of URLs.
-
supplying a block to ‘scrape` or `scrape_to_csv` or using a pre-build
block from Upton::Utils.
For more complicated cases; subclass Upton::Scraper
e.g. +MyScraper < Upton::Scraper+ and override various methods.
Constant Summary collapse
- EMPTY_STRING =
''
Instance Attribute Summary collapse
-
#debug ⇒ Object
Returns the value of attribute debug.
-
#index_debug ⇒ Object
Returns the value of attribute index_debug.
-
#paginated ⇒ Object
Returns the value of attribute paginated.
-
#pagination_interval ⇒ Object
Returns the value of attribute pagination_interval.
-
#pagination_max_pages ⇒ Object
Returns the value of attribute pagination_max_pages.
-
#pagination_param ⇒ Object
Returns the value of attribute pagination_param.
-
#pagination_start_index ⇒ Object
Returns the value of attribute pagination_start_index.
-
#readable_filenames ⇒ Object
Returns the value of attribute readable_filenames.
-
#sleep_time_between_requests ⇒ Object
Returns the value of attribute sleep_time_between_requests.
-
#stash_folder ⇒ Object
Returns the value of attribute stash_folder.
-
#url_array ⇒ Object
Returns the value of attribute url_array.
-
#verbose ⇒ Object
Returns the value of attribute verbose.
Instance Method Summary collapse
-
#initialize(index_url_or_array, selector = "") ⇒ Scraper
constructor
index_url_or_array
: A list of string URLs, OR the URL of the page containing the list of instances. -
#next_index_page_url(url, pagination_index) ⇒ Object
Return the next URL to scrape, given the current URL and its index.
-
#next_instance_page_url(url, pagination_index) ⇒ Object
If instance pages are paginated, you must override this method to return the next URL, given the current URL and its index.
-
#scrape(&blk) ⇒ Object
This is the main user-facing method for a basic scraper.
-
#scrape_to_csv(filename, &blk) ⇒ Object
Writes the scraped result to a CSV at the given filename.
- #scrape_to_tsv(filename, &blk) ⇒ Object
Constructor Details
#initialize(index_url_or_array, selector = "") ⇒ Scraper
index_url_or_array
: A list of string URLs, OR
the URL of the page containing the list of instances.
selector
: The XPath expression or CSS selector that specifies the
anchor elements within the page, if a url is specified for
the previous argument.
These options are a shortcut. If you plan to override get_index
, you do not need to set them. If you don’t specify a selector, the first argument will be treated as a list of URLs.
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
# File 'lib/upton.rb', line 65 def initialize(index_url_or_array, selector="") #if first arg is a valid URL, do already-written stuff; #if it's not (or if it's a list?) don't bother with get_index, etc. #e.g. Scraper.new(["http://jeremybmerrill.com"]) #TODO: rewrite this, because it's a little silly. (i.e. should be a more sensical division of how these arguments work) if index_url_or_array.respond_to? :each_with_index @url_array = index_url_or_array else @index_url = index_url_or_array @index_selector = selector end # If true, then Upton prints information about when it gets # files from the internet and when it gets them from its stash. @verbose = false # If true, then Upton fetches each instance page only once # future requests for that file are responded to with the locally stashed # version. # You may want to set @debug to false for production (but maybe not). # You can also control stashing behavior on a per-call basis with the # optional second argument to get_page, if, for instance, you want to # stash certain instance pages, e.g. based on their modification date. @debug = true # Index debug does the same, but for index pages. @index_debug = false # In order to not hammer servers, Upton waits for, by default, 30 # seconds between requests to the remote server. @sleep_time_between_requests = 30 #seconds # If true, then Upton will attempt to scrape paginated index pages @paginated = false # Default query string parameter used to specify the current page @pagination_param = 'page' # Default number of paginated pages to scrape @pagination_max_pages = 2 # Default starting number for pagination (second page is this plus 1). @pagination_start_index = 1 # Default value to increment page number by @pagination_interval = 1 # Folder name for stashes, if you want them to be stored somewhere else, # e.g. under /tmp. if @stash_folder FileUtils.mkdir_p(@stash_folder) unless Dir.exists?(@stash_folder) end end |
Instance Attribute Details
#debug ⇒ Object
Returns the value of attribute debug.
37 38 39 |
# File 'lib/upton.rb', line 37 def debug @debug end |
#index_debug ⇒ Object
Returns the value of attribute index_debug.
37 38 39 |
# File 'lib/upton.rb', line 37 def index_debug @index_debug end |
#paginated ⇒ Object
Returns the value of attribute paginated.
37 38 39 |
# File 'lib/upton.rb', line 37 def paginated @paginated end |
#pagination_interval ⇒ Object
Returns the value of attribute pagination_interval.
37 38 39 |
# File 'lib/upton.rb', line 37 def pagination_interval @pagination_interval end |
#pagination_max_pages ⇒ Object
Returns the value of attribute pagination_max_pages.
37 38 39 |
# File 'lib/upton.rb', line 37 def pagination_max_pages @pagination_max_pages end |
#pagination_param ⇒ Object
Returns the value of attribute pagination_param.
37 38 39 |
# File 'lib/upton.rb', line 37 def pagination_param @pagination_param end |
#pagination_start_index ⇒ Object
Returns the value of attribute pagination_start_index.
37 38 39 |
# File 'lib/upton.rb', line 37 def pagination_start_index @pagination_start_index end |
#readable_filenames ⇒ Object
Returns the value of attribute readable_filenames.
37 38 39 |
# File 'lib/upton.rb', line 37 def readable_filenames @readable_filenames end |
#sleep_time_between_requests ⇒ Object
Returns the value of attribute sleep_time_between_requests.
37 38 39 |
# File 'lib/upton.rb', line 37 def sleep_time_between_requests @sleep_time_between_requests end |
#stash_folder ⇒ Object
Returns the value of attribute stash_folder.
37 38 39 |
# File 'lib/upton.rb', line 37 def stash_folder @stash_folder end |
#url_array ⇒ Object
Returns the value of attribute url_array.
37 38 39 |
# File 'lib/upton.rb', line 37 def url_array @url_array end |
#verbose ⇒ Object
Returns the value of attribute verbose.
37 38 39 |
# File 'lib/upton.rb', line 37 def verbose @verbose end |
Instance Method Details
#next_index_page_url(url, pagination_index) ⇒ Object
Return the next URL to scrape, given the current URL and its index.
Recursion stops if the fetching URL returns an empty string or an error.
If @paginated is not set (the default), this method returns an empty string.
If @paginated is set, this method will return the next pagination URL to scrape using @pagination_param and the pagination_index.
If the pagination_index is greater than @pagination_max_pages, then the method will return an empty string.
Override this method to handle pagination is an alternative way e.g. next_index_page_url(“whatever.com/articles?page=1”, 2) ought to return “whatever.com/articles?page=2”
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
# File 'lib/upton.rb', line 149 def next_index_page_url(url, pagination_index) return url unless @paginated if pagination_index > @pagination_max_pages puts "Exceeded pagination limit of #{@pagination_max_pages}" if @verbose EMPTY_STRING else uri = URI.parse(url) query = uri.query ? Hash[URI.decode_www_form(uri.query)] : {} # update the pagination query string parameter query[@pagination_param] = pagination_index uri.query = URI.encode_www_form(query) puts "Next index pagination url is #{uri}" if @verbose uri.to_s end end |
#next_instance_page_url(url, pagination_index) ⇒ Object
If instance pages are paginated, you must override this method to return the next URL, given the current URL and its index.
If instance pages aren’t paginated, there’s no need to override this.
Recursion stops if the fetching URL returns an empty string or an error.
e.g. next_instance_page_url(“whatever.com/article/upton-sinclairs-the-jungle?page=1”, 2) ought to return “whatever.com/article/upton-sinclairs-the-jungle?page=2”
127 128 129 |
# File 'lib/upton.rb', line 127 def next_instance_page_url(url, pagination_index) EMPTY_STRING end |
#scrape(&blk) ⇒ Object
This is the main user-facing method for a basic scraper. Call scrape
with a block; this block will be called on the text of each instance page, (and optionally, its URL and its index in the list of instance URLs returned by get_index
).
47 48 49 50 51 |
# File 'lib/upton.rb', line 47 def scrape(&blk) self.url_array = self.get_index unless self.url_array blk = Proc.new{|x| x} if blk.nil? self.scrape_from_list(self.url_array, blk) end |
#scrape_to_csv(filename, &blk) ⇒ Object
Writes the scraped result to a CSV at the given filename.
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
# File 'lib/upton.rb', line 169 def scrape_to_csv filename, &blk require 'csv' self.url_array = self.get_index unless self.url_array CSV.open filename, 'wb' do |csv| #this is a conscious choice: each document is a list of things, either single elements or rows (as lists). self.scrape_from_list(self.url_array, blk).compact.each do |document| if document[0].respond_to? :map document.each{|row| csv << row } else csv << document end end #self.scrape_from_list(self.url_array, blk).compact.each{|document| csv << document } end end |
#scrape_to_tsv(filename, &blk) ⇒ Object
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
# File 'lib/upton.rb', line 185 def scrape_to_tsv filename, &blk require 'csv' self.url_array = self.get_index unless self.url_array CSV.open filename, 'wb', :col_sep => "\t" do |csv| #this is a conscious choice: each document is a list of things, either single elements or rows (as lists). self.scrape_from_list(self.url_array, blk).compact.each do |document| if document[0].respond_to? :map document.each{|row| csv << row } else csv << document end end #self.scrape_from_list(self.url_array, blk).compact.each{|document| csv << document } end end |