Module: ScraperWiki

Defined in:
lib/scraperwiki.rb

Class Method Summary collapse

Class Method Details

._convdata(unique_keys, scraper_data) ⇒ Object

Internal function to check a row of data, convert to right format



88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# File 'lib/scraperwiki.rb', line 88

def ScraperWiki._convdata(unique_keys, scraper_data)
    if unique_keys
        for key in unique_keys
            if !key.kind_of?(String) and !key.kind_of?(Symbol)
                raise 'unique_keys must each be a string or a symbol, this one is not: ' + key
            end
            if !scraper_data.include?(key) and !scraper_data.include?(key.to_sym)
                raise 'unique_keys must be a subset of data, this one is not: ' + key
            end
            if scraper_data[key] == nil and scraper_data[key.to_sym] == nil
                raise 'unique_key value should not be nil, this one is nil: ' + key
            end
        end
    end

    jdata = { }
    scraper_data.each_pair do |key, value|
        raise 'key must not have blank name' if not key

        key = key.to_s if key.kind_of?(Symbol)
        raise 'key must be string or symbol type: ' + key if key.class != String
        raise 'key must be simple text: ' + key if !/[a-zA-Z0-9_\- ]+$/.match(key)

        # convert formats
        if value.kind_of?(Date)
            value = value.iso8601
        end
        if value.kind_of?(Time)
            value = value.iso8601
            raise "internal error, timezone came out as non-UTC while converting to SQLite format" unless value.match(/([+-]00:00|Z)$/)
            value.gsub!(/([+-]00:00|Z)$/, '')
        end
        if ![Fixnum, Float, String, TrueClass, FalseClass, NilClass].include?(value.class)
            value = value.to_s
        end

        jdata[key] = value
    end
    return jdata
end

.close_sqliteObject



83
84
85
# File 'lib/scraperwiki.rb', line 83

def ScraperWiki.close_sqlite()
    SQLiteMagic.close
end

.save_sqlite(unique_keys, data, table_name = "swdata") ⇒ Object

Saves the provided data into a local database for this scraper. Data is upserted into this table (inserted if it does not exist, updated if the unique keys say it does).

Parameters

  • unique_keys = A list of column names, that used together should be unique

  • data = A hash of the data where the Key is the column name, the Value the row

    value.  If sending lots of data this can be a list of hashes.
    
  • table_name = The name that the newly created table should use.

Example

ScraperWiki::save(, ‘id’=>1)



60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# File 'lib/scraperwiki.rb', line 60

def ScraperWiki.save_sqlite(unique_keys, data, table_name="swdata")
    raise 'unique_keys must be nil or an array' if unique_keys != nil && !unique_keys.kind_of?(Array)
    raise 'data must have a non-nil value' if data == nil

    # convert :symbols to "strings"
    unique_keys = unique_keys.map { |x| x.kind_of?(Symbol) ? x.to_s : x }

    if data.class == Hash
        data = [ data ]
    elsif data.length == 0
        return
    end

    rjdata = [ ]
    for ldata in data
        ljdata = _convdata(unique_keys, ldata)
        rjdata.push(ljdata)

    end

    SQLiteMagic._do_save_sqlite(unique_keys, rjdata, table_name)
end

.scrape(url, params = nil, agent = nil) ⇒ Object

The scrape method fetches the content from a webserver.

Parameters

  • url = The URL to fetch

  • params = The parameters to send with a POST request

  • _agent = A manually supplied useragent string

Example

ScraperWiki::scrape(‘scraperwiki.com’)



19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# File 'lib/scraperwiki.rb', line 19

def ScraperWiki.scrape(url, params = nil, agent = nil)
  if agent
    client = HTTPClient.new(:agent_name => agent)
  else
    client = HTTPClient.new
  end
  client.ssl_config.verify_mode = OpenSSL::SSL::VERIFY_NONE
  if HTTPClient.respond_to?("client.transparent_gzip_decompression=")
    client.transparent_gzip_decompression = true
  end

  if params.nil?
    html = client.get_content(url)
  else
    html = client.post_content(url, params)
  end

  unless HTTPClient.respond_to?("client.transparent_gzip_decompression=")
    begin
      gz = Zlib::GzipReader.new(StringIO.new(html))
      return gz.read
    rescue
      return html
    end
  end
end