Module: Wukong::Hadoop::EnvMethods
- Defined in:
- lib/wukong-hadoop/hadoop_env_methods.rb
Overview
Hadoop streaming exposes several environment variables to scripts it executes. This module contains methods that make these variables easily accessed from within a processor.
Since these environment variables are ultimately set by Hadoop’s streaming jar when executing inside Hadoop, you’ll have to set them manually when testing locally.
Via @pskomoroch via @tlipcon:
"there is a little known Hadoop Streaming trick buried in this Python
script. You will notice that the date is not actually in the raw log
data itself, but is part of the filename. It turns out that Hadoop makes
job parameters you would fetch in Java with something like
job.get("mapred.input.file") available as environment variables for
streaming jobs, with periods replaced with underscores:
filepath = os.environ["map_input_file"]
filename = os.path.split(filepath)[-1]
Instance Method Summary collapse
-
#attempt_id ⇒ String
ID of the current map/reduce attempt.
-
#curr_task_id ⇒ String
ID of the current map/reduce task.
-
#hadoop_streaming_parameter(name) ⇒ String
Fetch a parameter set by Hadoop streaming in the environment of the currently executing process.
-
#input_dir ⇒ String
Directory of the (data) file currently being processed.
-
#input_file ⇒ String
Path of the (data) file currently being processed.
-
#map_input_length ⇒ String
Length of the chunk currently being processed within the current input file.
-
#map_input_start_offset ⇒ String
Offset of the chunk currently being processed within the current input file.
Instance Method Details
#attempt_id ⇒ String
ID of the current map/reduce attempt.
65 66 67 |
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 65 def attempt_id ENV['mapred_task_id'] end |
#curr_task_id ⇒ String
ID of the current map/reduce task.
72 73 74 |
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 72 def curr_task_id ENV['mapred_tip_id'] end |
#hadoop_streaming_parameter(name) ⇒ String
Fetch a parameter set by Hadoop streaming in the environment of the currently executing process.
30 31 32 |
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 30 def hadoop_streaming_parameter name ENV[name.gsub('.', '_')] end |
#input_dir ⇒ String
Directory of the (data) file currently being processed.
44 45 46 |
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 44 def input_dir ENV['mapred_input_dir'] end |
#input_file ⇒ String
Path of the (data) file currently being processed.
37 38 39 |
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 37 def input_file ENV['map_input_file'] end |
#map_input_length ⇒ String
Length of the chunk currently being processed within the current input file.
58 59 60 |
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 58 def map_input_length ENV['map_input_length'] end |
#map_input_start_offset ⇒ String
Offset of the chunk currently being processed within the current input file.
51 52 53 |
# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 51 def map_input_start_offset ENV['map_input_start'] end |