Wukong Deploy Pack

The Infochimps Platform is an end-to-end, managed solution for building Big Data applications. It integrates best-of-breed technologies like Hadoop, Storm, Kafka, MongoDB, ElasticSearch, HBase, &c. and provides simple interfaces for accessing these powerful tools.

Computation, analytics, scripting, &c. are all handled by Wukong within the platform. Wukong is an abstract framework for defining computations on data. Wukong processors and flows can run in many different execution contexts including:

  • locally on the command-line for testing or development purposes
  • as a Hadoop mapper or reducer for batch analytics or ETL
  • within Storm as part of a real-time data flow

The Infochimps Platform uses the concept of a deploy pack for developers to develop all their processors, flows, and jobs within. The deploy pack can be thought of as a container for all the necessary Wukong code and plugins useful in the context of an Infochimps Platform application. It includes the following libraries:

  • wukong: The core framework for writing processors and chaining them together.
  • wukong-hadoop: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
  • wonderdog: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.

Installation

The deploy pack is installed as a RubyGem:

$ sudo gem install wukong-deploy

Usage

Wukong-Deploy provides a command-line tool wu-deploy which can be used to create or interact with deploy packs.

Creating a New Deploy Pack

Create a new deploy pack:

$ wu-deploy new my_app
Within /home/user/my_app:
      create  .
      create  app/models
      create  app/processors
      ...

This will create a directory my_app in the current directory. Passing the dry_run option will print what should happen without actually doing anything:

$ wu-deploy new my_app --dry_run
Within /home/user/my_app:
      create  .
      create  app/models
      create  app/processors
      ...

You'll be prompted if there is a conflict. You can pass the force option to always overwrite files and the skip option to never overwrite files.

Working with an Existing Deploy Pack

If your current directory is within an existing deploy pack you can start up an IRB console with the deploy pack's environment already loaded:

$ wu-deploy console
irb(main):001:0> 

File Structure

A deploy pack is a repository with the following Rails-like file structure:

├──   app
   ├──   models
   ├──   processors
   ├──   flows
   └──   jobs
├──   config
   ├──   environment.rb
   ├──   application.rb
   ├──   initializers
   ├──   settings.yml
   └──   environments
       ├──   development.yml
       ├──   production.yml
       └──   test.yml
├──   data
├──   Gemfile
├──   Gemfile.lock
├──   lib
├──   log
├──   Rakefile
├──   spec
   ├──   spec_helper.rb
   └──   support
└──   tmp

Let's look at it piece by piece:

  • app: The directory with all the action. It's where you define:
    • models: Your domain models or "nouns", which define and wrap the different kinds of data elements in your application. They are built using whatever framework you like (defaults to Gorillib)
    • processors: Your fundamental operations or "verbs", which are passed records and parse, filter, augment, normalize, or split them.
    • flows: Chain together processors into streaming flows for ingestion, real-time processing, or complex event processing (CEP)
    • jobs: Pair processors together to create batch jobs to run in Hadoop
  • config: Where you place all application configuration for all environments
    • environment.rb: Defines the runtime environment for all code, requiring and configuring all Wukong framework code. You shouldn't have to edit this file directly.
    • application.rb: Require and configure libraries specific to your application. Choose a model framework, pick what application code gets loaded by default (vs. auto-loaded).
    • initializers: Holds any files you need to load before application.rb here. Useful for requiring and configuring external libraries.
    • settings.yml: Defines application-wide settings.
    • environments: Defines environment-specific settings in YAML files named after the environment. Overrides config/settings.yml.
  • data: Holds sample data in flat files. You'll develop and test your application using this data.
  • Gemfile and Gemfile.lock: Defines how libraries are resolved with Bundler.
  • lib: Holds any code you want to use in your application but that isn't "part of" your application (like vendored libraries, Rake tasks, &c.).
  • log: A good place to stash logs.
  • Rakefile: Defines Rake tasks for the development, test, and deploy of your application.
  • spec: Holds all your RSpec unit tests.
    • spec_helper.rb: Loads libraries you'll use during testing, includes spec helper libraries from Wukong.
    • support: Holds support code for your tests.
  • tmp: A good place to stash temporary files.