Hackboxen
The Hackboxen library is designed to encapsulate data collecting and processing tasks into simple and easy to implement packages.
Any singular hackbox has the following two parts:
- An engine, which contains configuration information and data processing code.
- An output directory, which will contain the fully processed data along with a descriptive schema. This directory may be either local or remote (e.g. S3/HDFS)
A hackbox dataset is defined by a namespace
and a protocol
. The namespace
must be dot(.) separated and both the namespace
and protocol
may contain only lowercase letters, numbers and underscores.
Hackbox Engine
A hackbox engine contains:
Rakefile
: (required) Used to read and combine all the sources of config metadata and executemain
.Gemfile
: (optional) A list of gems necessary for this hackbox to run. Processed automatically by Bundler.config/
: (required) A subdirectory containing:config.yaml
(required) A hackbox specific default configuration YAML file.protocol.icss.yaml
(optional) An Icss schema file describing the output data and publishing targets.
engine/
: (required) A subdirectory containing:main
: (required) An executable data processing file. This may be written in any language.- (optional) Any other executable and support files. There is no restriction on language and complexity.
The hackbox engine lives in the coderoot
directory specified by your configuration settings. An example hackbox engine directory structure:
coderoot
└── language
└── corpora
└── word_freq
└── bnc
├── config
│ ├── config.yaml
│ └── bnc.icss.yaml
├── engine
│ ├── main
│ └── bnc_endpoint.rb
└── Rakefile
Hackbox Output Directory
The hackbox output directory is where all of the data that a hackbox acquires, reads, or creates lives. The location of the data directory is determind by the dataroot
variable specified in your configuration settings. An example hackbox output directory structure:
dataroot
└── language
└── corpora
└── word_freq
├── fixd
│ ├── code
│ │ └── bnc_endpoint.rb
│ └── data
│ └── bnc_fixd_data.tsv
├── log
│ └── bnc_run_0.log
├── rawd
│ └── bnc_data_in_process
├── ripd
│ └── bnc_download.zip
└── env
└── working_config.json
fixd/
: (required) See the output interface described below.log/
: (optional) All logging from a hackbox run goes here.rawd/
: (optional) This will contain all intermediate data processing outputs.ripd/
: (required) This will contain virginal downloaded source data adhering to the directory structure from which it was pulled.env/
: (required) This directory contains a file describing the environment in which the hackbox was run.working_config.json
: (required) All runtime config metadata used to generate the schema and output data.
Engine and output directories are generally created dynamically and are not meant to be archival.
Output Interface (fixd/)
fixd/
is the final output directory and contains the following:
protocol.icss.json
: (required) An Icss schema file describing its respective dataset.code/
: (optional) A directory containing the code assets described in the icss.data/
: (required) A directory containing a single dataset or subdirectories named for each dataset. Each contains:- (required) One or more data files that collectively adhere to the schema of this dataset.
Hackbox Configuration
Hackbox configuration may be one or more files in YAML format and, optionally, the command line. Configuration will be read in using Configliere in the following order:
/etc/hackbox/hackbox.yaml
: Machine-wide config.~/.hackbox/hackbox.yaml
: Install specific config.config/config.yaml
: Hackbox specific config.rake task -- --args=
: Command line arguments.
Later sources on this list overwrite earlier sources. The combined configuration metadata is serialized out as JSON in the fixd/env
directory as working_config.json
. This is done before any other code executes in order for a hackbox to be able to read in this file if necessary.
Getting Started
Here are the general guidelines for creating your own hackbox.
Install Hackboxen
sudo gem install hackboxen
This will create a .hackbox
directory with a hackbox.yaml
file that contains default values for coderoot
, dataroot
, s3_filesystem
, os
, and machine
. The hb-install
command has optional arguments --dataroot=
, --coderoot=
.
hb-install # optionally: hb-install --dataroot=/data/hb --coderoot=/code/hb
A default hackbox.yaml
file:
---
coderoot: /code/hb/
dataroot: /data/hb/
s3_filesystem:
access_key:
secret_key:
mini_bucket:
requires:
machine: x86_64
os: darwin
Hackboxen Dependencies
The Hackboxen library depends on the following gems: configliere, icss, swineherd, and rake.
Creating a Hackbox
Hackboxen comes with scaffold task that creates a template hackbox for you. Required arguments are --namespace=
and --hackbox_name=
. Optional arguments are --protocol=
and --targets=
. The protocol
will default to the hackbox_name
unless otherwise specified.
hb-scaffold --namespace=foo.bar.baz --hackbox_name=qux --targets=catalog,mysql
This will create the following directories and files:
coderoot
└── foo
└──
└── baz
└── qux
├── config
│ ├── config.yaml
│ └── qux.icss.yaml
├── engine
│ ├── main
│ └── qux_endpoint.rb
└── Rakefile
Running a hackbox
Externally, the execution of a hackbox appears as:
- A
Rakefile
is run withrake
from the shell with one of the following targets:get_data
: Performs only the ingest step. The input data (inripd
/rawd
) and any required metadata should exist after this step.default
: Performs the processing step,:get_data
, and executes themain
file.
Execution Results:
- If there is no failure,
rake
can be silent. - If there is a failure,
rake
ends with a thrown exception - After a successful execution, the complete output interface (
fixd
) must exist, with no additional interaction outside ofrake
.
The rough steps of hackbox internal execution are:
- The configuration sources (command line and files) are read and combined.
- The output directory structure (
fixd
) is created. - The hackbox engine is run and the “troop ready” ouput datasets are created in
fixd
.
- Note: Hackbox execution should be idempotent (when it is sensible and efficient), leveraging this behavior from
rake
.*
Hackboxen Best Practices
One should try to avoid redundant computation. In particular, idempotency of output creation should be observed. Sometimes incrementally updated information makes this hard, but should be done if not too painful.
Files read and written by the hackbox should use the Swineherd::FileSystem
abstraction. See swineherd.
Implementation of the Gorillib::Receiver
pattern is recommended. See gorillib.
Any and all output datasets must include an appropriately descriptive schema. See icss.
== Contributing to hackboxen
- Check out the latest master to make sure the feature hasn’t been implemented or the bug hasn’t been fixed yet
- Check out the issue tracker to make sure someone already hasn’t requested it and/or contributed it
- Fork the project
- Start a feature/bugfix branch
- Commit and push until you are happy with your contribution
- Make sure to add tests for it. This is important so I don’t break it in a future version unintentionally.
- Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
== Copyright
Copyright © 2011 Infochimps. See LICENSE.txt for further details.