streaming_join
Ruby classes intended to be used in Hadoop Streaming reducers. It has been tested with jruby 1.6+ and ruby 1.9.2+.
Examples (found in the examples directory) use the test data and scenarios from here:
en.wikipedia.org/wiki/Join_(SQL)
The equivalent sql for each example is listed in its directory. The supported join types are:
inner_join cross_join left_outer_join right_outer_join full_outer_join merge_rows
API
As with most map/reduce jobs, there are two basic components:
Mapper
The mapper in a streaming_join job outputs records of the form:
key TAB side_index TAB value
The “left” side uses side_index 0 while the right side would use 1.
In a simple join, here is one example mapper output:
1 0 Matsumoto 2 0 Wall 3 0 van Rossum
Here is the other side’s output:
1 1 Ruby 1995 2 1 Perl 1987 4 1 Clojure 2007
Reducer
Depending on the join style, the reducer will emit combined records. Using the above mapper output, an inner join would end up like:
1 Matsumoto Ruby 2 Wall Perl
The code would look something like this:
Mapper
require ‘streaming_join’
j = JoinMapper.new j.add_side(/_left/, 0, 1) j.add_side(/_right/, 0, 1) j.process_stream
Reducer
require ‘streaming_join’
Join.new.process_stream
See examples/job for more detail and comments.
Current Limitations
-
As each key is processed in a reducer, the left side of that single keyspace must fit into memory. So, when in doubt, put the smaller table on the left.
-
Only two tables can be joined in a single job.
Please let me know if you find this software useful!
–frank