Introduction
So you want to write an importer for the National Data Catalog (NDC)? You could integrate with the NDC API directly, but we wouldn't recommend it. Instead, we recommend using the NDC importer framework. The framework serves two major purposes:
It simplifies the task of writing an importer. In particular, the importer framework handles the API communication, so all an importer has to do is handle the external translation step (such as scraping of a Web site or integration with an API). It also provides utility functions that come in handy.
It standardizes importers. This encourages the sharing of best practices and it also makes coordination easier. The various importers are automated through the use of the National Data Catalog Importer System.
The importer framework is good at doing a few things and delegating these rest. This document will help you get started. Before long, you'll have an importer ready to liberate government data.
About the National Data Catalog
The National Data Catalog builds community around government data sets. At the core, it is a catalog of datasets, APIs, and interactive tools that provide data about government. By "government" we mean any branch of government and any level of government. By "catalog" we mean useful metadata -- how a data set was collected, how often it is updated, where to download the data, and so on.
The National Data Catalog (NDC) is powered by a decoupled collection of applications centered around a read-write API. All of the applications communicate through the API.
Walkthrough
Let's take a look at some example code in the example folder.
1. Setup the Rakefile
Begin by looking at example/rakefile.rb. In this file, you set some configuration information and call out to the importer framework. It will define some rake tasks for you.
The importer framework handles quite a few things for you provided that you follow its design correctly. Your importer is responsible for providing a Puller class (as defined with :puller => Puller
in rakefile.rb).
2. Hide The Keys
National Data Catalog API keys are private, so please don't store them in code. Actually, don't even store them in source control at all. Separate them out and store them in config.yml
. Make sure that your .gitignore
file is setup to ignore config.yml
. It is a good idea to include config.example.yml
that demonstrates the format of the file.
3. Make the Puller
Next, let's look at the Puller class. It is responsible for defining two methods: initialize
and run
. (The rake tasks constructed above rely on these methods.)
Please note that the example provided here is oversimplified. It is intended to demonstrate how to use the importer framework, but it is not a practical example to borrow heavily from. If you want to steal some importer code, please visit the Sunlight Labs projects page and filter the projects by 'datacatalog-imp-'.
As you would probably expect, initialize
is called once. Its main purpose is to setup the callback handler (@handler
) to refer back to the importer framework.
Put the main logic / algorithm / voodoo of your importer in the run
method. The key responsibility of your importer is to call @handler.source
or @handler.organization
each time your importer finds a data source or organization, respectively. (Historical note: the 0.1.x version of importer framework worked a little bit differently. This is a more flexible style.)
source parameter
@handler.source()
expects a hash parameter of this shape:
{
:title => "Budget for...",
:description => "Congressional budget for...",
:source_type => "dataset",
:url => "http://...",
:documentation_url => "http://...",
:license => "...",
:license_url => "http://...",
:released => Kronos.parse("...").to_hash,
:frequency => "daily",
:period_start => Kronos.parse("...").to_hash,
:period_end => Kronos.parse("...").to_hash,
:organization => {
:name => "", # organization that provides data
}
:downloads => [{
:url => "http://..."
:format => "xml",
}] # include as many download formats as appropiate
:custom => {},
:raw => {},
:catalog_name => "...",
:catalog_url => "http://...",
}
Note that most of these parameters match up with the properties defined for a Source in the National Data Catalog API. These parameters are just passed along to the API, which will validate the values.
The remaining parameters (organization
and downloads
) are handled by the importer framework:
The organization sub-hash is used to lookup or create the associated organization for the source. Then a
organization_id
key/value pair is sent to the API.The downloads array is used to lookup or create the associate download formats for a data source.
You may have noticed the use of Kronos.parse
above. We highly recommend the use of the kronos library for the parsing of dates.
organization parameter
@handler.organization()
expects a hash parameter of this shape:
{
:name => "",
:acronym => "",
:url => "http://...",
:description => "",
:org_type => "governmental",
:organization => {
:name => "", # parent organization, if any
:url => "",
}
:catalog_name => "...",
:catalog_url => "http://...",
}
Note that most of these parameters match up with the properties defined for an Organization in the National Data Catalog API. These parameters are just passed along to the API, which will validate the values.
The remaining parameter, organization
, is handled by the importer framework. The framework just looks up the parent organization using the name or url. It then sends parent_id
with the associated parent organization id to the API.
4. You're Done / Best Practices
That's it. But before you go hacking away, let me say a few words about best practices:
If you are scraping a web site, we highly recommend caching the raw HTML files in your importer. Our production importers are queued up using the NDC Importer System, which integrates nicely with git. It keeps a record of the raw HTML files that correspond to each run. This makes it easier to debug when things go wrong.
Take advantage of the utility functions in /lib/utility.rb. If you find want to make a suggestion regarding the utility function, please let us know.
It goes without saying, but please follow best Ruby practices and put some thought into writing clean code. Follow the conventions of the community and strive to make your code readable by other people.
5. Talk to Us
Please reach out to us on our National Data Catalog Google Group. We can help you with your importer. Once it works reliably, we will want to add it to our importer system. The more up-to-date, relevant government data we bring in, the more useful our data catalog becomes.
The Team
The National Data Catalog includes:
- David James of Sunlight Labs
- Luigi Montanez of Sunlight Labs
- Ryan Wold, a Sunlight Labs intern
- Mike Dvorscak, a Google Summer of Code Student