datapackage-rb
A ruby library for working with Data Packages.
The library is intending to support:
- Parsing and using data package metadata and data
- Validating data packages to ensure they conform with the Data Package specification
Installation
Add the gem into your Gemfile:
gem 'datapackage.rb'
Or:
gem install datapackage
Reading a Data Package
Require the gem, if you need to:
require 'datapackage'
Parsing a data package descriptor from a remote location:
package = DataPackage::Package.new( "http://example.org/datasets/a/datapackage.json" )
This assumes that http://example.org/datasets/a/datapackage.json
exists.
Similarly you can load a package descriptor from a local JSON file.
package = DataPackage::Package.new( "/my/data/package/datapackage.json" )
The data package descriptor
i.e. datapackage.json
file, is expected to be at the root directory
of the data package and the path
attribute of the package's resources
will be resolved
relative to it.
You can also load a data package descriptor directly from a Hash:
descriptor = {
'resources'=> [
{
'name'=> 'example',
'profile'=> 'tabular-data-resource',
'data'=> [
['height', 'age', 'name'],
['180', '18', 'Tony'],
['192', '32', 'Jacob'],
],
'schema'=> {
'fields'=> [
{'name'=> 'height', 'type'=> 'integer'},
{'name'=> 'age', 'type'=> 'integer'},
{'name'=> 'name', 'type'=> 'string'},
],
}
}
]
}
package = DataPackage::Package.new(descriptor)
There are a set of helper methods for accessing data from the package, e.g:
package.name
package.title
package.description
package.homepage
package.license
Reading Data Resources
A data package must contain an array of Data Resources.
You can access the resources in your Data Package either by their name or by their index in the resources
array:
first_resource = package.resources[0]
first_resource = package.get_resource('example')
# Get info about the data source of this resource
first_resource.inline?
first_resource.local?
first_resource.remote?
first_resource.multipart?
first_resource.tabular?
first_resource.source
You can then read the source depending on its type. For example if resource is local and not multipart it could by open as a file: File.open(resource.source)
.
If a resource complies with the Tabular Data Resource spec or uses the
tabular-data-resource
profile you can read resource rows:
resoure = package.resources[0]
resource.tabular?
resource.headers
resource.schema
# Read the the whole rows at once
data = resource.read
data = resource.read(keyed: true)
# Or iterate through it
data = resource.iter {|row| print row}
See TableSchema documentation for other things you can do with tabular resource.
Creating a Data Package
package = DataPackage::Package.new
# Add package properties
package.name = 'my_sleep_duration'
# Add a resource
package.add_resource(
{
'name'=> 'sleep_durations_this_week',
'data'=> [7, 8, 5, 6, 9, 7, 8],
}
)
If the resource is valid it will be added to the resources
array of the Data Package;
if it's invalid it will not be added and you should try creating and validating your resource to see why it fails.
# Update a resource
my_resource = package.get_resource('sleep_durations_this_week')
my_resource['schema'] = {
'fields'=> [
{'name'=> 'number_hours', 'type'=> 'integer'},
]
}
# Save the Data Package descriptor to the target file
package.save('datapackage.json')
# Remove a resource
package.remove_resource('sleep_durations_this_week')
Profiles
Data Package and Data Resource descriptors can be validated against JSON schemas that we call profiles
.
By default, this gem uses the standard Data Package profile and Data Resource profile but alternative profiles are available for both.
According to the specs the value of
the profile
property can be either a URL or an indentifier from the registry.
Profiles in the local cache
The profiles from the registry come bundled with the gem. You can reference them in your Data Package descriptor by their identifier in the registry:
data-package
the default profile for a Data Packagedata-resource
the default profile for a Data Resourcetabular-data-package
for a Tabular Data Packagetabular-data-resource
for a Tabular Data Resourcefiscal-data-package
for a Fiscal Data Package
{
"profile": "tabular-data-package"
}
Profiles from elsewhere
If you have a custom profile schema you can reference it by its URL:
{
"profile": "https://specs.frictionlessdata.io/schemas/tabular-data-package.json"
}
Validation
Data Resources and Data Packages are validated against their profiles to ensure they respect the expected structure.
Validating a Resource
descriptor = {
'name'=> 'incorrect name',
'path'=> 'https://cdn.rawgit.com/frictionlessdata/datapackage-rb/master/spec/fixtures/test-pkg/test.csv',
}
resource = DataPackage::Resource.new(descriptor, base_path='')
# Returns true if resource is valid, false otherwise
resource.valid?
# Returns true or raises DataPackage::ValidationError
resource.validate
# Iterate through validation errors
resource.iter_errors{ |err| p err}
Validating a Package
The same methods used to check the validity of a Resource - valid?
, validate
and iter_errors
- are also available for a Package.
The difference is that after a Package descriptor is validated against its profile
, each of its resources
are also validated against their profile
.
In order for a Package to be valid all its Resources have to be valid.
Developer notes
These notes are intended to help people that want to contribute to this package itself. If you just want to use it, you can safely ignore them.
After checking out the repo, run bundle
to install dependencies. Then, run rake spec
to run the tests.
To install this gem onto your local machine, run bundle exec rake install
.
To release a new version, update the version number in version.rb
, and then run bundle exec rake release
,
which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Updating the local schemas cache
We cache the local schemas from https://specs.frictionlessdata.io/schemas/registry.json. The local schemas should be kept up to date with the remote ones using:
rake update_profiles