Class: Upton::Scraper

Inherits:

Object

Object
Upton::Scraper

show all

Defined in:: lib/upton.rb,
lib/upton/scraper.rb

Overview

specifying the pages to be scraped in ‘new` as an index page
```
or as an Array of URLs.
```
supplying a block to ‘scrape` or `scrape_to_csv` or using a pre-build
```
block from Upton::Utils.
```

For more complicated cases; subclass Upton::Scraper

e.g. +MyScraper < Upton::Scraper+ and override various methods.

Constant Summary collapse

EMPTY_STRING =

''

Instance Attribute Summary collapse

#debug ⇒ Object

Returns the value of attribute debug.
#index_debug ⇒ Object

Returns the value of attribute index_debug.
#paginated ⇒ Object

Returns the value of attribute paginated.
#pagination_interval ⇒ Object

Returns the value of attribute pagination_interval.
#pagination_max_pages ⇒ Object

Returns the value of attribute pagination_max_pages.
#pagination_param ⇒ Object

Returns the value of attribute pagination_param.
#pagination_start_index ⇒ Object

Returns the value of attribute pagination_start_index.
#readable_filenames ⇒ Object

Returns the value of attribute readable_filenames.
#sleep_time_between_requests ⇒ Object

Returns the value of attribute sleep_time_between_requests.
#stash_folder ⇒ Object

Returns the value of attribute stash_folder.
#url_array ⇒ Object

Returns the value of attribute url_array.
#verbose ⇒ Object

Returns the value of attribute verbose.

Instance Method Summary collapse

#initialize(index_url_or_array, selector = "") ⇒ Scraper constructor

index_url_or_array: A list of string URLs, OR the URL of the page containing the list of instances.
#next_index_page_url(url, pagination_index) ⇒ Object

Return the next URL to scrape, given the current URL and its index.
#next_instance_page_url(url, pagination_index) ⇒ Object

If instance pages are paginated, you must override this method to return the next URL, given the current URL and its index.
#scrape(&blk) ⇒ Object

This is the main user-facing method for a basic scraper.
#scrape_to_csv(filename, &blk) ⇒ Object

Writes the scraped result to a CSV at the given filename.
#scrape_to_tsv(filename, &blk) ⇒ Object

Constructor Details

#initialize(index_url_or_array, selector = "") ⇒ `Scraper`

index_url_or_array: A list of string URLs, OR

the URL of the page containing the list of instances.

selector: The XPath expression or CSS selector that specifies the

anchor elements within the page, if a url is specified for
the previous argument.

These options are a shortcut. If you plan to override get_index, you do not need to set them. If you don’t specify a selector, the first argument will be treated as a list of URLs.

# File 'lib/upton.rb', line 65

def initialize(index_url_or_array, selector="")

  #if first arg is a valid URL, do already-written stuff;
  #if it's not (or if it's a list?) don't bother with get_index, etc.
  #e.g. Scraper.new(["http://jeremybmerrill.com"])

  #TODO: rewrite this, because it's a little silly. (i.e. should be a more sensical division of how these arguments work)
  if index_url_or_array.respond_to? :each_with_index
    @url_array = index_url_or_array
  else
    @index_url = index_url_or_array
    @index_selector = selector
  end

  # If true, then Upton prints information about when it gets
  # files from the internet and when it gets them from its stash.
  @verbose = false

  # If true, then Upton fetches each instance page only once
  # future requests for that file are responded to with the locally stashed
  # version.
  # You may want to set @debug to false for production (but maybe not).
  # You can also control stashing behavior on a per-call basis with the
  # optional second argument to get_page, if, for instance, you want to
  # stash certain instance pages, e.g. based on their modification date.
  @debug = true
  # Index debug does the same, but for index pages.
  @index_debug = false

  # In order to not hammer servers, Upton waits for, by default, 30
  # seconds between requests to the remote server.
  @sleep_time_between_requests = 30 #seconds

  # If true, then Upton will attempt to scrape paginated index pages
  @paginated = false
  # Default query string parameter used to specify the current page
  @pagination_param = 'page'
  # Default number of paginated pages to scrape
  @pagination_max_pages = 2
  # Default starting number for pagination (second page is this plus 1).
  @pagination_start_index = 1
  # Default value to increment page number by
  @pagination_interval = 1
 
  # Folder name for stashes, if you want them to be stored somewhere else,
  # e.g. under /tmp.
  if @stash_folder
    FileUtils.mkdir_p(@stash_folder) unless Dir.exists?(@stash_folder)
  end
end

Instance Attribute Details

#debug ⇒ `Object`

Returns the value of attribute debug.



37
38
39

# File 'lib/upton.rb', line 37

def debug
  @debug
end

#index_debug ⇒ `Object`

Returns the value of attribute index_debug.



37
38
39

# File 'lib/upton.rb', line 37

def index_debug
  @index_debug
end

#paginated ⇒ `Object`

Returns the value of attribute paginated.



37
38
39

# File 'lib/upton.rb', line 37

def paginated
  @paginated
end

#pagination_interval ⇒ `Object`

Returns the value of attribute pagination_interval.



37
38
39

# File 'lib/upton.rb', line 37

def pagination_interval
  @pagination_interval
end

#pagination_max_pages ⇒ `Object`

Returns the value of attribute pagination_max_pages.



37
38
39

# File 'lib/upton.rb', line 37

def pagination_max_pages
  @pagination_max_pages
end

#pagination_param ⇒ `Object`

Returns the value of attribute pagination_param.



37
38
39

# File 'lib/upton.rb', line 37

def pagination_param
  @pagination_param
end

#pagination_start_index ⇒ `Object`

Returns the value of attribute pagination_start_index.



37
38
39

# File 'lib/upton.rb', line 37

def pagination_start_index
  @pagination_start_index
end

#readable_filenames ⇒ `Object`

Returns the value of attribute readable_filenames.



37
38
39

# File 'lib/upton.rb', line 37

def readable_filenames
  @readable_filenames
end

#sleep_time_between_requests ⇒ `Object`

Returns the value of attribute sleep_time_between_requests.



37
38
39

# File 'lib/upton.rb', line 37

def sleep_time_between_requests
  @sleep_time_between_requests
end

#stash_folder ⇒ `Object`

Returns the value of attribute stash_folder.



37
38
39

# File 'lib/upton.rb', line 37

def stash_folder
  @stash_folder
end

#url_array ⇒ `Object`

Returns the value of attribute url_array.



37
38
39

# File 'lib/upton.rb', line 37

def url_array
  @url_array
end

#verbose ⇒ `Object`

Returns the value of attribute verbose.



37
38
39

# File 'lib/upton.rb', line 37

def verbose
  @verbose
end

Instance Method Details

#next_index_page_url(url, pagination_index) ⇒ `Object`

Return the next URL to scrape, given the current URL and its index.

Recursion stops if the fetching URL returns an empty string or an error.

If @paginated is not set (the default), this method returns an empty string.

If @paginated is set, this method will return the next pagination URL to scrape using @pagination_param and the pagination_index.

If the pagination_index is greater than @pagination_max_pages, then the method will return an empty string.

Override this method to handle pagination is an alternative way e.g. next_index_page_url(“whatever.com/articles?page=1”, 2) ought to return “whatever.com/articles?page=2”

# File 'lib/upton.rb', line 149

def next_index_page_url(url, pagination_index)
  return url unless @paginated

  if pagination_index > @pagination_max_pages
    puts "Exceeded pagination limit of #{@pagination_max_pages}" if @verbose
    EMPTY_STRING
  else
    uri = URI.parse(url)
    query = uri.query ? Hash[URI.decode_www_form(uri.query)] : {}
    # update the pagination query string parameter
    query[@pagination_param] = pagination_index
    uri.query = URI.encode_www_form(query)
    puts "Next index pagination url is #{uri}" if @verbose
    uri.to_s
  end
end

#next_instance_page_url(url, pagination_index) ⇒ `Object`

If instance pages are paginated, you must override this method to return the next URL, given the current URL and its index.

If instance pages aren’t paginated, there’s no need to override this.

Recursion stops if the fetching URL returns an empty string or an error.

e.g. next_instance_page_url(“whatever.com/article/upton-sinclairs-the-jungle?page=1”, 2) ought to return “whatever.com/article/upton-sinclairs-the-jungle?page=2”



127
128
129

# File 'lib/upton.rb', line 127

def next_instance_page_url(url, pagination_index)
  EMPTY_STRING
end

#scrape(&blk) ⇒ `Object`

This is the main user-facing method for a basic scraper. Call scrape with a block; this block will be called on the text of each instance page, (and optionally, its URL and its index in the list of instance URLs returned by get_index).

# File 'lib/upton.rb', line 47

def scrape(&blk)
  self.url_array = self.get_index unless self.url_array
  blk = Proc.new{|x| x} if blk.nil?
  self.scrape_from_list(self.url_array, blk)
end

#scrape_to_csv(filename, &blk) ⇒ `Object`

Writes the scraped result to a CSV at the given filename.

# File 'lib/upton.rb', line 169

def scrape_to_csv filename, &blk
  require 'csv'
  self.url_array = self.get_index unless self.url_array
  CSV.open filename, 'wb' do |csv|
    #this is a conscious choice: each document is a list of things, either single elements or rows (as lists).
    self.scrape_from_list(self.url_array, blk).compact.each do |document|
      if document[0].respond_to? :map
        document.each{|row| csv << row }
      else
        csv << document
      end
    end
    #self.scrape_from_list(self.url_array, blk).compact.each{|document| csv << document }
  end
end

#scrape_to_tsv(filename, &blk) ⇒ `Object`

# File 'lib/upton.rb', line 185

def scrape_to_tsv filename, &blk
  require 'csv'
  self.url_array = self.get_index unless self.url_array
  CSV.open filename, 'wb', :col_sep => "\t" do |csv|
    #this is a conscious choice: each document is a list of things, either single elements or rows (as lists).
    self.scrape_from_list(self.url_array, blk).compact.each do |document|
      if document[0].respond_to? :map
        document.each{|row| csv << row }
      else
        csv << document
      end
    end
    #self.scrape_from_list(self.url_array, blk).compact.each{|document| csv << document }
  end
end

Class: Upton::Scraper

Overview

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(index_url_or_array, selector = "") ⇒ Scraper

Instance Attribute Details

#debug ⇒ Object

#index_debug ⇒ Object

#paginated ⇒ Object

#pagination_interval ⇒ Object

#pagination_max_pages ⇒ Object

#pagination_param ⇒ Object

#pagination_start_index ⇒ Object

#readable_filenames ⇒ Object

#sleep_time_between_requests ⇒ Object

#stash_folder ⇒ Object

#url_array ⇒ Object

#verbose ⇒ Object