Module: Cascading::RegexOperations
- Included in:
- Assembly
- Defined in:
- lib/cascading/regex_operations.rb
Overview
Module of pipe assemblies that wrap operations defined in the Cascading cascading.operations.regex package. These are split out only to group similar functionality.
All DSL regex pipes require an input_field, a regex, and either a single into_field or one or more into_fields. Requiring a single input field allows us to raise an exception early if the wrong input is specified and avoids the non-intuitive situation where the first of many fields is silently taken as in Cascading. Requiring a regex means you don’t have to go looking for defaults in code. And into_field(s) means we can propagate field names through the dataflow.
Mapping of DSL pipes into Cascading regex operations:
- parse
- split
- split_rows
- match_rows
- replace
Instance Method Summary collapse
-
#match_rows(input_field, regex, into_field, options = {}) ⇒ Object
(also: #regex_generator)
Emits a new row for each regex group matched in input_field using the specified regular expression.
-
#parse(input_field, regex, into_fields, options = {}) ⇒ Object
(also: #regex_parser)
Parses the given input_field using the specified regular expression to produce one output per group in that expression.
-
#replace(input_field, regex, into_field, replacement, options = {}) ⇒ Object
(also: #regex_replace)
Performs a query/replace on the given input_field using the specified regular expression and replacement.
-
#split(input_field, regex, into_fields, options = {}) ⇒ Object
(also: #regex_splitter)
Splits the given input_field into multiple fields using the specified regular expression.
-
#split_rows(input_field, regex, into_field, options = {}) ⇒ Object
(also: #regex_split_generator)
Splits the given input_field into new rows using the specified regular expression.
Instance Method Details
#match_rows(input_field, regex, into_field, options = {}) ⇒ Object Also known as: regex_generator
Emits a new row for each regex group matched in input_field using the specified regular expression.
Example:
match_rows 'line', /(\w+)\s+(\w+)/, 'word'
91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# File 'lib/cascading/regex_operations.rb', line 91 def match_rows(input_field, regex, into_field, = {}) output = [:output] || all_fields # Overrides Cascading default input_field = fields(input_field) raise "input_field must declare exactly one field, was '#{input_field}'" unless input_field.size == 1 into_field = fields(into_field) raise "into_field must declare exactly one field, was '#{into_field}'" unless into_field.size == 1 each( input_field, :function => Java::CascadingOperationRegex::RegexGenerator.new(into_field, regex.to_s), :output => output ) end |
#parse(input_field, regex, into_fields, options = {}) ⇒ Object Also known as: regex_parser
Parses the given input_field using the specified regular expression to produce one output per group in that expression.
The named options are:
- groups
-
Array of integers specifying which groups to capture if you want a subset of groups.
Example:
parse 'field1', /(\w+)\s+(\w+)/, ['out1', 'out2'], :groups => [1, 2]
30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# File 'lib/cascading/regex_operations.rb', line 30 def parse(input_field, regex, into_fields, = {}) groups = [:groups].to_java(:int) if [:groups] output = [:output] || all_fields # Overrides Cascading default input_field = fields(input_field) raise "input_field must declare exactly one field, was '#{input_field}'" unless input_field.size == 1 parameters = [fields(into_fields), regex.to_s, groups].compact each( input_field, :function => Java::CascadingOperationRegex::RegexParser.new(*parameters), :output => output ) end |
#replace(input_field, regex, into_field, replacement, options = {}) ⇒ Object Also known as: regex_replace
Performs a query/replace on the given input_field using the specified regular expression and replacement.
The named options are:
- replace_all
-
Boolean indicating if all matches should be replaced; defaults to true (the Cascading default).
Example:
replace 'line', /[.,]*\s+/, 'tab_separated_line', "\t"
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
# File 'lib/cascading/regex_operations.rb', line 116 def replace(input_field, regex, into_field, replacement, = {}) output = [:output] || all_fields # Overrides Cascading default input_field = fields(input_field) raise "input_field must declare exactly one field, was '#{input_field}'" unless input_field.size == 1 into_field = fields(into_field) raise "into_field must declare exactly one field, was '#{into_field}'" unless into_field.size == 1 parameters = [into_field, regex.to_s, replacement.to_s, [:replace_all]].compact each( input_field, :function => Java::CascadingOperationRegex::RegexReplace.new(*parameters), :output => output ) end |
#split(input_field, regex, into_fields, options = {}) ⇒ Object Also known as: regex_splitter
Splits the given input_field into multiple fields using the specified regular expression.
Example:
split 'line', /\s+/, ['out1', 'out2']
51 52 53 54 55 56 57 58 59 60 61 62 |
# File 'lib/cascading/regex_operations.rb', line 51 def split(input_field, regex, into_fields, = {}) output = [:output] || all_fields # Overrides Cascading default input_field = fields(input_field) raise "input_field must declare exactly one field, was '#{input_field}'" unless input_field.size == 1 each( input_field, :function => Java::CascadingOperationRegex::RegexSplitter.new(fields(into_fields), regex.to_s), :output => output ) end |
#split_rows(input_field, regex, into_field, options = {}) ⇒ Object Also known as: regex_split_generator
Splits the given input_field into new rows using the specified regular expression.
Example:
split_rows 'line', /\s+/, 'word'
70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
# File 'lib/cascading/regex_operations.rb', line 70 def split_rows(input_field, regex, into_field, = {}) output = [:output] || all_fields # Overrides Cascading default input_field = fields(input_field) raise "input_field must declare exactly one field, was '#{input_field}'" unless input_field.size == 1 into_field = fields(into_field) raise "into_field must declare exactly one field, was '#{into_field}'" unless into_field.size == 1 each( input_field, :function => Java::CascadingOperationRegex::RegexSplitGenerator.new(into_field, regex.to_s), :output => output ) end |