Raka is a DSL(Domain Specific Language) on top of Rake for defining and running d*ata* processing workflows. Raka is specifically designed for data processing with improved pattern matching, scopes, language extensions and lots of conventions to prevent verbosity.
Why Raka
Data processing tasks can involve plenty of steps, each with its dependencies. Compared to bare Rake or the more classical Make, Raka offers the following advantages:
- Advanced pattern matching and template resolving to define general rules and maximize code reuse.
- Extensible and context-aware protocol architecture.
- Multilingual. Other programming languages can be easily embedded.
- Auto dependency and naming by conventions.
- Scopes to ease comparative studies.
- Terser syntax.
... and more.
Compared to more comlex, GUI-based solutions (perhaps classified as scientific-workflow software) like Kepler, etc., Raka has the following advantages:
- Lightweight and easy to setup, especially on platforms with ruby preinstalled.
- Easy to deploy, version-control, backup or share workflows since the workflows are merely text files.
- Easy to reuse modules or create reusable modules, which are merely plain ruby code snippets (or in other languages with protocols).
- Expressive so a few lines of code can replace many manual operations.
Installation
Raka is a library based on rake. Though rake is cross platform, raka may not work on Windows since it relies some shell facilities. To use raka, one has to install ruby and rake first. Ruby is available for most *nix systems including Mac OSX so the only task is to install raka like:
gem install raka
QuickStart
First create a file named main.raka and import & initialize the DSL
require 'raka'
dsl = DSL.new(self,
output_types: [:txt, :table, :pdf, :idx],
input_types: [:txt, :table]
)
Then the code below will define two simple rules:
txt.sort.first50 = shell* "cat sort.txt | head -n 50 > $@"
txt.sort = [txt.input] | shell* "cat $< | sort -rn > $@"
For testing let's prepare an input file named input.txt:
seq 1000 > input.txt
We can then invoke rake first50.txt, the script will read data from input.txt, sort the numbers descendingly and get the first 50 lines.
The workflow here is as follows:
- Try to find first50__sort.txt: not exists
- Rule
txt.sort.first50matched - For rule
txt.sort.first50, find input file sort.txt or sort.table. Neither exists - Rule
txt.sortmatched - Rule
txt.sorthas no input but a depended targettxt.input - Find file input.txt or input.table. Use the former
- Run rule
txt.sortand create sort.txt - Run rule
txt.sort.first50and create first50__sort.txt
This illustrates some basic ideas but may not be particularly interesting. Following is a much more sophisticated example from real world research which covers more features.
SRC_DIR = File.absolute_path 'src'
USER = 'postgres'
DB = 'osm'
HOST = 'localhost'
PORT = 5432
def idx_this() [idx._('$(output_stem)')] end
dsl.scope :de
idx._ = psqlf(script_name: '$stem_idx.sql')
pdf.buildings.func['(\S+)_graph'] = r(:graph)* %[
table_input("$(input_stem)") | draw_%{func0} | ggplot_output('$@') ]
table.buildings = [csv.admin] | psqlf(admin: '$<') | idx_this
Assume that we have a schema named de in database osm, have a input file admin.csv, and have graph.R and buildings.sql under src/. Now further assume that graph.R contains two functions:
draw_stat_snapshot <- function(d) { ... }
draw_user_trend <- function(d) { ... }
...and buildings.sql contains table creation code like:
DROP TABLE IF EXISTS buildings;
CREATE TABLE buildings AS ( ... );
We may also have a buildings_idx.sql to create index for the table.
Then we can run either rake de/stat_snapshot_graph__buildings.pdf or rake de/user_trend_graph__buildings.pdf, which will do a bunch of things at first run (take the former as example):
- Target file not found.
- Rule
pdf.buildings.func['(\S+)_graph']matched. "stat_snapshot_graph" is bound tofuncand "stat_snapshot" is bound tofunc0. - None of the four possible input files: de/buildings.table, de/buildings.txt, buildings.table, buildings.txt can be found. Rule
table.buildingsis matched and the only dependecy file admin.csv is found. - The protocol
psqlffinds the source file src/buildings.sql, intepolate the options with automatic variables ($<as "admin.csv"), run the sql, and create a placeholder file de/buildings.table afterwards. - Run the post-job
idx_this, according to the ruleidx._it will find and run buildings_idx.sql, then create a placeholder file de/buildings.idx. - For rule
pdf.buildings.func['(\S+)_graph'], the R code in%[]is interpolated with several automatic variables ($(input_stem)as "buildings",$@as "de/stat_snapshot_graph__buildings.pdf") and the variables (func,func0) bound before. - Run the R code. The buildings table is piped into the function
draw_snapshot_graphand then output toggplot_output, which writes the graph to the specified pdf file.
Syntax of Rules
It is possible to use Raka with little knowledge of ruby / rake, though minimal understandings are highly recommended. The formal syntax of rule can be defined as follows (EBNF form):
rule = lexpr "=" {target_list "|"} protocol {"|" target_list};
target = rexpr | template;
target_list = "[]" | "[" target {"," target} "]";
lexpr = ext "." {ltoken "."} ltoken;
rexpr = ext "." rtoken {"." rtoken};
ltoken = word | word "[" pattern "]";
rtoken = word | word "(" template ")";
word = ("_" | letter) { letter | digit | "_" };
protocol = ("shell" | "r" | "psql") ("*" template | BLOCK )
| "psqlf" | "psqlf" "(" HASH ")";
The corresponding railroad diagrams are:
The definition is concise but several details are omitted for simplicity:
- BLOCK and HASH is ruby's block and hash object.
- A template is just a ruby string, but with some placeholders (see the next section for details)
- A pattern is just a ruby string which represents regex (see the next section for details)
- The listed protocols are merely what we offered now. It can be greatly extended.
- Nearly any concept in the syntax can be replaced by a suitable ruby variable.
Pattern matching and template resolving
When defined a rule like lexpr = rexpr, the left side represents a pattern and the right side contains specifications for extra dependecies, actions and some targets to create thereafter. When raking a target file, the left sides of the rules will be examined one by one until a rule is matched. The matching process based on Regex also support named captures so that some varibales can be bound for use in the right side.
The specifications on the right side of a rule can be incomplete from various aspects, that is, they can contains some templates. The "holes" in the templates will be fulfilled by automatic variables and variables bounded when matching the left side.
Pattern matching
To match a given file with a lexpr, asides the extension, the substrings of the file name between "__" are mapped to tokens separated by ., in reverse order. After that, each substring is matched to the corresponding token or the regex in []. For example, the rule
pdf.buildings.indicator['\S+'].top['top_(\d+)']
can match "top_50__node_num__buildings.pdf". The logical process is:
- The extension
pdfmatches. - The substrings and the tokens are paired and they all match:
buildings ~ buildings'\S+' ~ node_numtop_(\d+) ~ top_50
- Two levels of captures are made. First, 'node_num' is captured as
indicator, 'top_50' is captured astop; Second, '50' is captured astop0since\d+is wrapped in parenthesis and is the first.
One can write special token _ or something[] if the captured value is useful later, as the syntax sugar of something['\S+'].
Template resolving
In some places of rexpr, templates can be written instead of strings, so that it can represent different values at runtime. There are two types of variables that can be used in templates. The first is automatic variables, which is just like $@ in Make or task.name in Rake. We even preserve some Make conventions for easier migrations. All automatic varibales begin with $. The possible automatic variables are:
| symbol | meaning | symbol | meaning |
|---|---|---|---|
| \$@ | output file | \$^ | all dependecies (sep by spaces) |
| \$< | first dependency | $0, $1, … \$i | ith depdency |
| \$(scope) | scope for current task | \$(output_stem) | stem of the output file |
| \$(input_stem) | stem of the input file |
The other type of variables are those bounded during pattern matching,which can be referred to using %{var}. In the example of the pattern matching section, %{indicator} will be replaced by node_num, %{top} will be replaced by top_50 and %{top0} will be replaced by 50. In such case, a template as 'calculate top %{top0} of %{indicator} for $@' will be resolved as 'calculate top 50 of node_num for top_50__node_num__buildings.pdf'
The replacement of variables happen before any process to the template string. So do not include the symbols for automatic variables or %{<anything>} in templates.
Templates can happen in various places. For depdencies and post jobs, tokens with parenthesis can wrap in templates, like csv._('%{indicator}'). The symbol of a token with parenthesis is of no use and is generally omitted. It is also possible to write template literal directly, i.e. '%{indicator}.csv'. Where templates can be applied in actions depends on the protocols and will be explained later in the Protocols section
APIs
Initialization and options
These APIs are bounded to an instance of DSL, you can create the object at the top:
dsl = DSL.new(<env>, <>)
The argument <env> should be the self of a running Rakefile. In most case you can directly write:
dsl = DSL.new(self, <>)
The argument options currently support output_types and input_types. For each item in output_types, you will get an extra function to bootstrap a rule. For example, with
dsl = DSL.new(self, { output_types: [:csv, :pdf] })
you can write these rules like:
csv.data = ...
pdf.graph = ...
which will generate data.csv and graph.pdf
The input_types involves the strategy to find inputs. For example, raka will try to find both numbers.csv and numbers.table for a rule like table.numbers.mean = … if input_type = [:csv, :table].
Scope
Protocols
Currently Raka support 4 lang: shell, psql, r and psqlf.
shell(base_dir='./')* code::templ_str { |task| ... }
psql(={})* code::templ_str { |task| ... }
r(src:str, libs=[])* code::templ_str { |task| ... }
# options = { script_name: , script_file: , params: }
psqlf(={})
Rakefile Template
Write your own protocols
Compare to other tools
Raka borrows some ideas from Drake but not much (currently mainly the name "protocol"). Briefly we have different visions and maybe different suitable senarios.