Module: Cascading::RegexOperations

Included in:
Assembly
Defined in:
lib/cascading/regex_operations.rb

Overview

Module of pipe assemblies that wrap operations defined in the Cascading cascading.operations.regex package. These are split out only to group similar functionality.

All DSL regex pipes require an input_field, a regex, and either a single into_field or one or more into_fields. Requiring a single input field allows us to raise an exception early if the wrong input is specified and avoids the non-intuitive situation where the first of many fields is silently taken as in Cascading. Requiring a regex means you don’t have to go looking for defaults in code. And into_field(s) means we can propagate field names through the dataflow.

Mapping of DSL pipes into Cascading regex operations:

parse

RegexParser

split

RegexSplitter

split_rows

RegexSplitGenerator

match_rows

RegexGenerator

replace

RegexReplace

Instance Method Summary collapse

Instance Method Details

#match_rows(input_field, regex, into_field, options = {}) ⇒ Object Also known as: regex_generator

Emits a new row for each regex group matched in input_field using the specified regular expression.

Example:

match_rows 'line', /(\w+)\s+(\w+)/, 'word'


91
92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/cascading/regex_operations.rb', line 91

def match_rows(input_field, regex, into_field, options = {})
  output = options[:output] || all_fields # Overrides Cascading default

  input_field = fields(input_field)
  raise "input_field must declare exactly one field, was '#{input_field}'" unless input_field.size == 1
  into_field = fields(into_field)
  raise "into_field must declare exactly one field, was '#{into_field}'" unless into_field.size == 1

  each(
    input_field,
    :function => Java::CascadingOperationRegex::RegexGenerator.new(into_field, regex.to_s),
    :output => output
  )
end

#parse(input_field, regex, into_fields, options = {}) ⇒ Object Also known as: regex_parser

Parses the given input_field using the specified regular expression to produce one output per group in that expression.

The named options are:

groups

Array of integers specifying which groups to capture if you want a subset of groups.

Example:

parse 'field1', /(\w+)\s+(\w+)/, ['out1', 'out2'], :groups => [1, 2]


30
31
32
33
34
35
36
37
38
39
40
41
42
43
# File 'lib/cascading/regex_operations.rb', line 30

def parse(input_field, regex, into_fields, options = {})
  groups = options[:groups].to_java(:int) if options[:groups]
  output = options[:output] || all_fields # Overrides Cascading default

  input_field = fields(input_field)
  raise "input_field must declare exactly one field, was '#{input_field}'" unless input_field.size == 1

  parameters = [fields(into_fields), regex.to_s, groups].compact
  each(
    input_field,
    :function => Java::CascadingOperationRegex::RegexParser.new(*parameters),
    :output => output
  )
end

#replace(input_field, regex, into_field, replacement, options = {}) ⇒ Object Also known as: regex_replace

Performs a query/replace on the given input_field using the specified regular expression and replacement.

The named options are:

replace_all

Boolean indicating if all matches should be replaced; defaults to true (the Cascading default).

Example:

replace 'line', /[.,]*\s+/, 'tab_separated_line', "\t"


116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# File 'lib/cascading/regex_operations.rb', line 116

def replace(input_field, regex, into_field, replacement, options = {})
  output = options[:output] || all_fields # Overrides Cascading default

  input_field = fields(input_field)
  raise "input_field must declare exactly one field, was '#{input_field}'" unless input_field.size == 1
  into_field = fields(into_field)
  raise "into_field must declare exactly one field, was '#{into_field}'" unless into_field.size == 1

  parameters = [into_field, regex.to_s, replacement.to_s, options[:replace_all]].compact
  each(
    input_field,
    :function => Java::CascadingOperationRegex::RegexReplace.new(*parameters),
    :output => output
  )
end

#split(input_field, regex, into_fields, options = {}) ⇒ Object Also known as: regex_splitter

Splits the given input_field into multiple fields using the specified regular expression.

Example:

split 'line', /\s+/, ['out1', 'out2']


51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/cascading/regex_operations.rb', line 51

def split(input_field, regex, into_fields, options = {})
  output = options[:output] || all_fields # Overrides Cascading default

  input_field = fields(input_field)
  raise "input_field must declare exactly one field, was '#{input_field}'" unless input_field.size == 1

  each(
    input_field,
    :function => Java::CascadingOperationRegex::RegexSplitter.new(fields(into_fields), regex.to_s),
    :output => output
  )
end

#split_rows(input_field, regex, into_field, options = {}) ⇒ Object Also known as: regex_split_generator

Splits the given input_field into new rows using the specified regular expression.

Example:

split_rows 'line', /\s+/, 'word'


70
71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/cascading/regex_operations.rb', line 70

def split_rows(input_field, regex, into_field, options = {})
  output = options[:output] || all_fields # Overrides Cascading default

  input_field = fields(input_field)
  raise "input_field must declare exactly one field, was '#{input_field}'" unless input_field.size == 1
  into_field = fields(into_field)
  raise "into_field must declare exactly one field, was '#{into_field}'" unless into_field.size == 1

  each(
    input_field,
    :function => Java::CascadingOperationRegex::RegexSplitGenerator.new(into_field, regex.to_s),
    :output => output
  )
end