Class: ScraperWiki::API

Inherits:
Object
  • Object
show all
Includes:
HTTParty
Defined in:
lib/scraperwiki-api.rb,
lib/scraperwiki-api/version.rb,
lib/scraperwiki-api/matchers.rb

Overview

A Ruby wrapper for the ScraperWiki API.

Defined Under Namespace

Modules: Matchers

Constant Summary collapse

RUN_INTERVALS =
{
  never: -1,
  monthly: 2678400,
  weekly: 604800,
  daily: 86400,
  hourly: 3600,
}
VERSION =
"0.0.6"

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(apikey = nil) ⇒ API

Initializes a ScraperWiki API object.



37
38
39
# File 'lib/scraperwiki-api.rb', line 37

def initialize(apikey = nil)
  @apikey = apikey
end

Class Method Details

.edit_scraper_url(shortname) ⇒ String

Returns the URL to edit the scraper.

Parameters:

  • shortname (String)

    the scraper’s shortname

Returns:

  • (String)

    the URL to edit the scraper



31
32
33
# File 'lib/scraperwiki-api.rb', line 31

def edit_scraper_url(shortname)
  "https://scraperwiki.com/scrapers/#{shortname}/edit/"
end

.scraper_url(shortname) ⇒ String

Returns the URL to the scraper’s overview.

Parameters:

  • shortname (String)

    the scraper’s shortname

Returns:

  • (String)

    the URL to the scraper’s overview



23
24
25
# File 'lib/scraperwiki-api.rb', line 23

def scraper_url(shortname)
  "https://scraperwiki.com/scrapers/#{shortname}/"
end

Instance Method Details

#datastore_sqlite(shortname, query, opts = {}) ⇒ Array, ...

Note:

The query string parameter is name, not shortname as in the ScraperWiki docs

Queries and extracts data via a general purpose SQL interface.

To make an RSS feed you need to use SQL’s AS keyword (e.g. “SELECT name AS description”) to make columns called title, link, description, guid (optional, uses link if not available) and pubDate or date.

jsondict example output:

[
  {
    "fieldA": "valueA",
    "fieldB": "valueB",
    "fieldC": "valueC",
  },
  ...
]

jsonlist example output:

{
  "keys": ["fieldA", "fieldB", "fieldC"],
  "data": [
    ["valueA", "valueB", "valueC"],
    ...
  ]
}

csv example output:

fieldA,fieldB,fieldC
valueA,valueB,valueC
...

Parameters:

  • shortname (String)

    the scraper’s shortname (as it appears in the URL)

  • query (String)

    a SQL query

  • opts (Hash) (defaults to: {})

    optional arguments

Options Hash (opts):

  • :format (String)

    one of “jsondict”, “jsonlist”, “csv”, “htmltable” or “rss2”

  • :attach (Array, String)

    “;”-delimited list of shortnames of other scrapers whose data you need to access

Returns:

  • (Array, Hash, String)

See Also:



86
87
88
89
90
91
# File 'lib/scraperwiki-api.rb', line 86

def datastore_sqlite(shortname, query, opts = {})
  if Array === opts[:attach]
    opts[:attach] = opts[:attach].join ';'
  end
  request_with_apikey '/datastore/sqlite', {name: shortname, query: query}.merge(opts)
end

#scraper_getinfo(shortname, opts = {}) ⇒ Array

Note:

Returns an array although the array seems to always have only one item

Note:

The tags field seems to always be an empty array

Note:

Fields like last_run seem to follow British Summer Time.

Note:

The query string parameter is name, not shortname as in the ScraperWiki docs

Extracts data about a scraper’s code, owner, history, etc.

  • runid is a Unix timestamp with microseconds and a UUID.

  • The value of records is the same as that of total_rows under datasummary.

  • run_interval is the number of seconds between runs. It is one of:

    • -1 (never)

    • 2678400 (monthly)

    • 604800 (weekly)

    • 86400 (daily)

    • 3600 (hourly)

  • privacy_status is one of:

    • “public” (everyone can see and edit the scraper and its data)

    • “visible” (everyone can see the scraper, but only contributors can edit it)

    • “private” (only contributors can see and edit the scraper and its data)

  • An individual runevents hash will have an exception_message key if there was an error during that run.

Example output:

[
  {
    "code": "require 'nokogiri'\n...",
    "datasummary": {
      "tables": {
        "swdata": {
          "keys": [
            "fieldA",
            ...
          ],
          "count": 42,
          "sql": "CREATE TABLE `swdata` (...)"
        },
        "swvariables": {
          "keys": [
            "value_blob",
            "type",
            "name"
          ],
          "count": 2,
          "sql": "CREATE TABLE `swvariables` (`value_blob` blob, `type` text, `name` text)"
        },
        ...
      },
      "total_rows": 44,
      "filesize": 1000000
    },
    "description": "Scrapes websites for data.",
    "language": "ruby",
    "title": "Example scraper",
    "tags": [],
    "short_name": "example-scraper",
    "userroles": {
      "owner": [
        "johndoe"
      ],
      "editor": [
        "janedoe",
        ...
      ]
    },
    "last_run": "1970-01-01T00:00:00",
    "created": "1970-01-01T00:00:00",
    "runevents": [
      {
        "still_running": false,
        "pages_scraped": 5,
        "run_started": "1970-01-01T00:00:00",
        "last_update": "1970-01-01T00:00:00",
        "runid": "1325394000.000000_xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx",
        "records_produced": 42
      },
      ...
    ],
    "records": 44,
    "wiki_type": "scraper",
    "privacy_status": "visible",
    "run_interval": 604800,
    "attachable_here": [],
    "attachables": [],
    "history": [
      ...,
      {
        "date": "1970-01-01T00:00:00",
        "version": 0,
        "user": "johndoe",
        "session": "Thu, 1 Jan 1970 00:00:08 GMT"
      }
    ]
  }
]

Parameters:

  • shortname (String)

    the scraper’s shortname (as it appears in the URL)

  • opts (Hash) (defaults to: {})

    optional arguments

Options Hash (opts):

  • :version (String)

    version number (-1 for most recent) [default -1]

  • :history_start_date (String)

    history and runevents are restricted to this date or after, enter as YYYY-MM-DD

  • :quietfields (Array, String)

    “|”-delimited list of fields to exclude from the output. Must be a subset of ‘code|runevents|datasummary|userroles|history’

Returns:

  • (Array)


198
199
200
201
202
203
# File 'lib/scraperwiki-api.rb', line 198

def scraper_getinfo(shortname, opts = {})
  if Array === opts[:quietfields]
    opts[:quietfields] = opts[:quietfields].join '|'
  end
  request_with_apikey '/scraper/getinfo', {name: shortname}.merge(opts)
end

#scraper_getruninfo(shortname, opts = {}) ⇒ Array

Note:

Returns an array although the array seems to always have only one item

Note:

The query string parameter is name, not shortname as in the ScraperWiki docs

See what the scraper did during each run.

Example output:

[
  {
    "run_ended": "1970-01-01T00:00:00",
    "first_url_scraped": "http://www.iana.org/domains/example/",
    "pages_scraped": 5,
    "run_started": "1970-01-01T00:00:00",
    "runid": "1325394000.000000_xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx",
    "domainsscraped": [
      {
        "domain": "http://example.com",
        "bytes": 1000000,
        "pages": 5
      }
      ...
    ],
    "output": "...",
    "records_produced": 42
  }
]

Parameters:

  • shortname (String)

    the scraper’s shortname (as it appears in the URL)

  • opts (Hash) (defaults to: {})

    optional arguments

Options Hash (opts):

  • runid (String)

    a run ID

Returns:

  • (Array)


237
238
239
# File 'lib/scraperwiki-api.rb', line 237

def scraper_getruninfo(shortname, opts = {})
  request_with_apikey '/scraper/getruninfo', {name: shortname}.merge(opts)
end

#scraper_getuserinfo(username) ⇒ Array

Note:

Returns an array although the array seems to always have only one item

Note:

The date joined field is date_joined (with underscore) on #scraper_usersearch

Find out information about a user.

Example output:

[
  {
    "username": "johndoe",
    "profilename": "John Doe",
    "coderoles": {
      "owner": [
        "johndoe.emailer",
        "example-scraper",
        ...
      ],
      "email": [
        "johndoe.emailer"
      ],
      "editor": [
        "yet-another-scraper",
        ...
      ]
    },
    "datejoined": "1970-01-01T00:00:00"
  }
]

Parameters:

  • username (String)

    a username

Returns:

  • (Array)


273
274
275
# File 'lib/scraperwiki-api.rb', line 273

def scraper_getuserinfo(username)
  request_with_apikey '/scraper/getuserinfo', username: username
end

#scraper_search(opts = {}) ⇒ Array

Search the titles and descriptions of all the scrapers.

Example output:

[
  {
    "description": "Scrapes websites for data.",
    "language": "ruby",
    "created": "1970-01-01T00:00:00",
    "title": "Example scraper",
    "short_name": "example-scraper",
    "privacy_status": "public"
  },
  ...
]

Parameters:

  • opts (Hash) (defaults to: {})

    optional arguments

Options Hash (opts):

  • :searchquery (String)

    search terms

  • :maxrows (Integer)

    number of results to return [default 5]

  • :requestinguser (String)

    the name of the user making the search, which changes the order of the matches

Returns:

  • (Array)


299
300
301
# File 'lib/scraperwiki-api.rb', line 299

def scraper_search(opts = {})
  request_with_apikey '/scraper/search', opts
end

#scraper_usersearch(opts = {}) ⇒ Array

Note:

The date joined field is datejoined (without underscore) on #scraper_getuserinfo

Search for a user by name.

Example output:

[
  {
    "username": "johndoe",
    "profilename": "John Doe",
    "date_joined": "1970-01-01T00:00:00"
  },
  ...
]

Parameters:

  • opts (Hash) (defaults to: {})

    optional arguments

Options Hash (opts):

  • :searchquery (String)

    search terms

  • :maxrows (Integer)

    number of results to return [default 5]

  • :nolist (Array, String)

    space-separated list of usernames to exclude from the output

  • :requestinguser (String)

    the name of the user making the search, which changes the order of the matches

Returns:

  • (Array)


327
328
329
330
331
332
# File 'lib/scraperwiki-api.rb', line 327

def scraper_usersearch(opts = {})
  if Array === opts[:nolist]
    opts[:nolist] = opts[:nolist].join ' '
  end
  request '/scraper/usersearch', opts
end