Module: Wukong::Hadoop::EnvMethods

Defined in:: lib/wukong-hadoop/hadoop_env_methods.rb

Overview

Hadoop streaming exposes several environment variables to scripts it executes. This module contains methods that make these variables easily accessed from within a processor.

Since these environment variables are ultimately set by Hadoop’s streaming jar when executing inside Hadoop, you’ll have to set them manually when testing locally.

Via @pskomoroch via @tlipcon:

"there is a little known Hadoop Streaming trick buried in this Python
 script. You will notice that the date is not actually in the raw log
 data itself, but is part of the filename. It turns out that Hadoop makes
 job parameters you would fetch in Java with something like
 job.get("mapred.input.file") available as environment variables for
 streaming jobs, with periods replaced with underscores:

   filepath = os.environ["map_input_file"]
   filename = os.path.split(filepath)[-1]

Instance Method Summary collapse

#attempt_id ⇒ String

ID of the current map/reduce attempt.
#curr_task_id ⇒ String

ID of the current map/reduce task.
#hadoop_streaming_parameter(name) ⇒ String

Fetch a parameter set by Hadoop streaming in the environment of the currently executing process.
#input_dir ⇒ String

Directory of the (data) file currently being processed.
#input_file ⇒ String

Path of the (data) file currently being processed.
#map_input_length ⇒ String

Length of the chunk currently being processed within the current input file.
#map_input_start_offset ⇒ String

Offset of the chunk currently being processed within the current input file.

Instance Method Details

#attempt_id ⇒ `String`

ID of the current map/reduce attempt.

Returns:

(String)



65
66
67

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 65

def attempt_id
  ENV['mapred_task_id']
end

#curr_task_id ⇒ `String`

ID of the current map/reduce task.

Returns:

(String)



72
73
74

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 72

def curr_task_id
  ENV['mapred_tip_id']
end

#hadoop_streaming_parameter(name) ⇒ `String`

Fetch a parameter set by Hadoop streaming in the environment of the currently executing process.

Parameters:

name (String) —

the ‘.’ separated parameter name to fetch

Returns:

(String) —

the value from the process’ environment



30
31
32

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 30

def hadoop_streaming_parameter name
  ENV[name.gsub('.', '_')]
end

#input_dir ⇒ `String`

Directory of the (data) file currently being processed.

Returns:

(String)



44
45
46

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 44

def input_dir
  ENV['mapred_input_dir']
end

#input_file ⇒ `String`

Path of the (data) file currently being processed.

Returns:

(String)



37
38
39

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 37

def input_file
  ENV['map_input_file']
end

#map_input_length ⇒ `String`

Length of the chunk currently being processed within the current input file.

Returns:

(String)



58
59
60

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 58

def map_input_length
  ENV['map_input_length']
end

#map_input_start_offset ⇒ `String`

Offset of the chunk currently being processed within the current input file.

Returns:

(String)



51
52
53

# File 'lib/wukong-hadoop/hadoop_env_methods.rb', line 51

def map_input_start_offset
  ENV['map_input_start']
end

Module: Wukong::Hadoop::EnvMethods

Overview

Instance Method Summary collapse

Instance Method Details

#attempt_id ⇒ String

#curr_task_id ⇒ String

#hadoop_streaming_parameter(name) ⇒ String

#input_dir ⇒ String

#input_file ⇒ String

#map_input_length ⇒ String

#map_input_start_offset ⇒ String

#attempt_id ⇒ `String`

#curr_task_id ⇒ `String`

#hadoop_streaming_parameter(name) ⇒ `String`

#input_dir ⇒ `String`

#input_file ⇒ `String`

#map_input_length ⇒ `String`

#map_input_start_offset ⇒ `String`