Class: Ninja2k::Scraper

Inherits:
Object
  • Object
show all
Defined in:
lib/ninja2k/scraper.rb

Overview

Scraper will load up a specified resource, and search the page using a combination of your seletor and any clues given. It provides a hooking mechanism so you can override the default parsing action (split on
, one row for each item found)

Examples:

clues = ['Operating system', 'Processors', 'Chipset', 'Memory type', 'Hard drive', 'Graphics',
  'Ports', 'Webcam', 'Pointing device', 'Keyboard', 'Network interface', 'Chipset', 'Wireless',
  'Power supply type', 'Energy efficiency', 'Weight', 'Minimum dimensions (W x D x H)',
  'Warranty', 'Software included', 'Product color']

url =  "http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1"
selector =  "//td[text()='%s']/following-sibling::td"

scraper = Ninja2k::Scraper.new(url, selector, :clues => clues)
scraper.to_xlsx('my_spreadsheet.xlsx')

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(url, selector, options = {}) ⇒ Scraper

Creates a new Scraper

Parameters:

  • url (String)

    The resource to scrape

  • selector (String)

    The xpath select to use when searching for clues. Use %s in the selector to interpolate each clue

  • options (Hash) (defaults to: {})

    each option will be evaluated against a attr_writer using respond_to? If a writer exists, the value for the option is passed to the writer.

  • [Array] (Hash)

    a customizable set of options

  • [Hash] (Hash)

    a customizable set of options



35
36
37
38
39
40
41
# File 'lib/ninja2k/scraper.rb', line 35

def initialize(url, selector, options={})
  self.url = url
  self.selector = selector
  options.each do |o|
    self.send("#{o[0]}=", o[1]) if self.respond_to? "#{o[0]}="
  end
end

Instance Attribute Details

#selectorString

The xpath selector to use when searching for clues

Returns:

  • (String)


49
50
51
# File 'lib/ninja2k/scraper.rb', line 49

def selector
  @selector
end

#urlString

The url we will scrape from

Returns:

  • (String)


45
46
47
# File 'lib/ninja2k/scraper.rb', line 45

def url
  @url
end

Instance Method Details

#add_hook(clue, p_roc) ⇒ Object

Adds a hook to the hook hash

Parameters:

  • clue (String)

    the clue this hook will be called for

  • p_roc (Proc)

    the Proc to call when the clue is found



80
81
82
# File 'lib/ninja2k/scraper.rb', line 80

def add_hook(clue, p_roc)
  hooks[clue] = p_roc
end

#cluesArray

The clues we are going to look for with the selector in the document returned by url

Returns:

  • (Array)


114
115
116
# File 'lib/ninja2k/scraper.rb', line 114

def clues
  @clues ||= []
end

#clues=(value) ⇒ Object

Sets the clues for the scraper

Parameters:

  • value (Arrray)

    The clues to look for.

Raises:

  • (ArugmentError)


121
122
123
124
# File 'lib/ninja2k/scraper.rb', line 121

def clues=(value)
  raise ArugmentError, 'clues must be an array of strings to search for with your selector' unless value.is_a?(Array)
  @clues = value
end

#hooksHash

A hash of Proc object to call when parsing each item found by the selector and clue combination. The element found will be passed to the member of this hash that uses the clue as a key

Returns:

  • (Hash)

See Also:

  • Ninja2k::Scraper.example/exampleexample/example.rb


65
66
67
# File 'lib/ninja2k/scraper.rb', line 65

def hooks
  @hooks ||= {}
end

#hooks=(hash) ⇒ Object

Raises:

  • (ArgumentError)

See Also:



70
71
72
73
# File 'lib/ninja2k/scraper.rb', line 70

def hooks=(hash)
  raise ArgumentError, 'Hooks must be a hash of procs to call when scraping each clue' unless hash.is_a?(Hash)
  @hooks = hash
end

#outputArray

The output from scraping as an array This is populated by the scrape or to_xlsx methods

Returns:

  • (Array)


55
56
57
# File 'lib/ninja2k/scraper.rb', line 55

def output
  @output ||= []
end

#packageAxlsx::Package

The axlsx package used for xlsx serialization

Returns:

  • (Axlsx::Package)


129
130
131
# File 'lib/ninja2k/scraper.rb', line 129

def package
  @package ||= Axlsx::Package.new
end

#scrapeArray

Scrapes the resourse using the clues and hooks provided

Returns:

  • (Array)


87
88
89
90
91
92
93
94
95
96
97
# File 'lib/ninja2k/scraper.rb', line 87

def scrape
  @package = nil
  @output = []
  clues.each do |clue|
    if detail = parse_clue(clue)
      output << [clue, detail.pop]
      detail.each { |datum| output << ['', datum] }
    end
  end
  output
end

#to_xlsx(filename = false) ⇒ Axlsx::Package

seralizes the output to xlsx. If you do not specify the file_name parameter The package will be created, but not serialized to disk. This means you can use the return value to stream the data using to_xlsx(false).to_stream.read

Parameters:

  • filename (String) (defaults to: false)

    the filename to use in output

Returns:

  • (Axlsx::Package)


106
107
108
109
# File 'lib/ninja2k/scraper.rb', line 106

def to_xlsx(filename=false)
  scrape
  serialize(filename)
end