Wukong Deploy Pack

The Infochimps Platform is an end-to-end, managed solution for building Big Data applications. It integrates best-of-breed technologies like Hadoop, Storm, Kafka, MongoDB, ElasticSearch, HBase, &c. and provides simple interfaces for accessing these powerful tools.

Computation, analytics, scripting, &c. are all handled by Wukong within the platform. Wukong is an abstract framework for defining computations on data. Wukong processors and flows can run in many different execution contexts including:

  • locally on the command-line for testing or development purposes
  • as a Hadoop mapper or reducer for batch analytics or ETL
  • within Storm as part of a real-time data flow

The Infochimps Platform uses the concept of a deploy pack for developers to develop all their processors, flows, and jobs within. The deploy pack can be thought of as a container for all the necessary Wukong code and plugins useful in the context of an Infochimps Platform application. It includes the following libraries:

  • wukong: The core framework for writing processors and chaining them together.
  • wukong-hadoop: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
  • wonderdog: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.

Installation

The deploy pack is installed as a RubyGem:

$ sudo gem install wukong-deploy

File Structure

A deploy pack is a repository with the following Rails-like file structure:

├──   app
   ├──   models
   ├──   processors
   ├──   flows
   └──   jobs
├──   config
   ├──   environment.rb
   ├──   application.rb
   ├──   initializers
   ├──   settings.yml
   └──   environments
       ├──   development.yml
       ├──   production.yml
       └──   test.yml
├──   data
├──   Gemfile
├──   Gemfile.lock
├──   lib
├──   log
├──   Rakefile
├──   spec
   ├──   spec_helper.rb
   └──   support
└──   tmp

Let's look at it piece by piece:

  • app: The directory with all the action. It's where you define:
    • models: Your domain models or "nouns", which define and wrap the different kinds of data elements in your application. They are built using whatever framework you like (defaults to Gorillib)
    • processors: Your fundamental operations or "verbs", which are passed records and parse, filter, augment, normalize, or split them.
    • flows: Chain together processors into streaming flows for ingestion, real-time processing, or complex event processing (CEP)
    • jobs: Pair processors together to create batch jobs to run in Hadoop
  • config: Where you place all application configuration for all environments
    • environment.rb: Defines the runtime environment for all code, requiring and configuring all Wukong framework code. You shouldn't have to edit this file directly.
    • application.rb: Require and configure libraries specific to your application. Choose a model framework, pick what application code gets loaded by default (vs. auto-loaded).
    • initializers: Holds any files you need to load before application.rb here. Useful for requiring and configuring external libraries.
    • settings.yml: Defines application-wide settings.
    • environments: Defines environment-specific settings in YAML files named after the environment. Overrides config/settings.yml.
  • data: Holds sample data in flat files. You'll develop and test your application using this data.
  • Gemfile and Gemfile.lock: Defines how libraries are resolved with Bundler.
  • lib: Holds any code you want to use in your application but that isn't "part of" your application (like vendored libraries, Rake tasks, &c.).
  • log: A good place to stash logs.
  • Rakefile: Defines Rake tasks for the development, test, and deploy of your application.
  • spec: Holds all your RSpec unit tests.
    • spec_helper.rb: Loads libraries you'll use during testing, includes spec helper libraries from Wukong.
    • support: Holds support code for your tests.
  • tmp: A good place to stash temporary files.