Gitset - A data collection and curation tool for humans

Gitset is essentially a key-value store. The key is a path to a file on disk and the value is the contents of that file. The goal is to make collaborating on creating and curating datasets as easy as it is to contribute to an open source code project.

Gitset is not intended to be used as a database, rather it should be used to manage a canonical data source that is imported into a "real" database for production use. On your mark, gitset, go!

Installation

If you haven't already done so, set up git. It's a prerequisite. Once you've done that, you can install gitset from rubygems:

gem install gitset

Usage

Create a new dataset

  1. Create a project directory
  mkdir /path/to/project
  cd /path/to/project
  1. Create a YAML template to be used for new data points
  # template.yaml
  --- 
  name: Full Name
  emails:
    - [email protected]
    - [email protected]
    - [email protected]
  1. Initialize the project with your template
  gitset init template.yaml

Clone an existing dataset

gitset clone git://github.com/username/datasetname.git

Working with a dataset

  1. Add a new data point using the template
  gitset create path/to/datapoint.yaml
  1. Edit the template
  # path/to/datapoint.yaml
  --- 
  name: John Britton
  emails:
    - [email protected]
  1. Stage your changes
  git add path/to/datapoint.yaml
  1. Commit your changes
  git commit -m 'Added a person'

Use standard git to modify existing datapoints in your dataset and commit the changes.

BUT WAIT, THERE'S MORE!

  • Branch the dataset
  • Contribute to a dataset
  • Merge contributions
  • Filter the dataset