Class: RightAws::EmrInterface

Inherits:
RightAwsBase show all
Includes:
RightAwsBaseInterface
Defined in:
lib/emr/right_emr_interface.rb

Overview

RightAWS::EmrInterface – RightScale Amazon Elastic Map Reduce interface

The RightAws::EmrInterface class provides a complete interface to Amazon Elastic Map Reduce service.

For explanations of the semantics of each call, please refer to Amazon’s documentation at aws.amazon.com/documentation/elasticmapreduce/

Create an interface handle:

emr = RightAws::EmrInterface.new(aws_access_key_id, aws_secret_access_key)

Create a job flow:

emr.run_job_flow(
  :name => 'job flow 1',
  :master_instance_type => 'm1.large',
  :slave_instance_type => 'm1.large',
  :instance_count => 5,
  :log_uri => 's3n://bucket/path/to/logs',
  :steps => [{
    :name => 'step 1',
    :jar => 's3n://bucket/path/to/code.jar',
    :main_class => 'com.foobar.emr.Step1',
    :args => ['arg', 'arg'],
  }]) #=> "j-9K18HM82Q0AE7"

Describe a job flow:

emr.describe_job_flows('j-9K18HM82Q0AE7') #=> {...}

Terminate a job flow:

emr.terminate_job_flows('j-9K18HM82Q0AE7') #=> true

Defined Under Namespace

Classes: AddInstanceGroupsParser, DescribeJobFlowsParser, RunJobFlowParser

Constant Summary collapse

API_VERSION =

Amazon EMR API version being used

'2009-03-31'
DEFAULT_HOST =
'elasticmapreduce.amazonaws.com'
DEFAULT_PATH =
'/'
DEFAULT_PROTOCOL =
'https'
DEFAULT_PORT =
443
EMR_INSTANCES_KEY_MAPPING =

Job Flows

{                                                           # :nodoc:
  :additional_info => 'AdditionalInfo',
  :log_uri => 'LogUri',
  :name => 'Name',
  # JobFlowInstancesConfig
  :ec2_key_name          => 'Instances.Ec2KeyName',
  :hadoop_version => 'Instances.HadoopVersion',
  :instance_count        => 'Instances.InstanceCount',
  :keep_job_flow_alive_when_no_steps   => 'Instances.KeepJobFlowAliveWhenNoSteps',
  :master_instance_type  => 'Instances.MasterInstanceType',
  :slave_instance_type   => 'Instances.SlaveInstanceType',
  :termination_protected => 'Instances.TerminationProtected',
  # PlacementType
  :availability_zone     => 'Instances.Placement.AvailabilityZone',
}
BOOTSTRAP_ACTION_KEY_MAPPING =

:nodoc:

{                                                           # :nodoc:
  :name => 'Name',
  # ScriptBootstrapActionConfig
  :args => 'ScriptBootstrapAction.Args',
  :path => 'ScriptBootstrapAction.Path',
}
INSTANCE_GROUP_KEY_MAPPING =

:nodoc:

{                                                           # :nodoc:
  :bid_price => 'BidPrice',
  :instance_count => 'InstanceCount',
  :instance_role => 'InstanceRole',
  :instance_type => 'InstanceType',
  :market => 'Market',
  :name => 'Name',
}
STEP_CONFIG_KEY_MAPPING =

:nodoc:

{                                                           # :nodoc:
  :action_on_failure => 'ActionOnFailure',
  :name => 'Name',
  # HadoopJarStepConfig
  :args => 'HadoopJarStep.Args',
  :jar => 'HadoopJarStep.Jar',
  :main_class => 'HadoopJarStep.MainClass',
  :properties => 'HadoopJarStep.Properties',
}
KEY_VALUE_KEY_MAPPINGS =
{
  :key => 'Key',
  :value => 'Value',
}
MODIFY_INSTANCE_GROUP_KEY_MAPPINGS =
{
  :instance_group_id => 'InstanceGroupId',
  :instance_count => 'InstanceCount',
}
@@bench =
AwsBenchmarkingBlock.new

Constants included from RightAwsBaseInterface

RightAwsBaseInterface::BLOCK_DEVICE_KEY_MAPPING, RightAwsBaseInterface::DEFAULT_SIGNATURE_VERSION

Constants inherited from RightAwsBase

RightAwsBase::AMAZON_PROBLEMS, RightAwsBase::RAISE_ON_TIMEOUT_ON_ACTIONS

Instance Attribute Summary

Attributes included from RightAwsBaseInterface

#aws_access_key_id, #aws_secret_access_key, #cache, #connection, #last_errors, #last_request, #last_request_id, #last_response, #logger, #params, #signature_version

Class Method Summary collapse

Instance Method Summary collapse

Methods included from RightAwsBaseInterface

#amazonize_block_device_mappings, #amazonize_hash_with_key_mapping, #amazonize_list, #amazonize_list_with_key_mapping, #cache_hits?, caching, caching=, #caching?, #destroy_connection, #generate_request_impl, #get_connection, #get_connections_storage, #get_server_url, #incrementally_list_items, #init, #map_api_keys_and_values, #on_exception, #request_cache_or_info, #request_info_impl, #signed_service_params, #update_cache, #with_connection_options

Methods inherited from RightAwsBase

amazon_problems, amazon_problems=, raise_on_timeout_on_actions, raise_on_timeout_on_actions=

Constructor Details

#initialize(aws_access_key_id = nil, aws_secret_access_key = nil, params = {}) ⇒ EmrInterface

Create a new handle to a EMR service.

All handles share the same per process or per thread HTTP connection to EMR. Each handle is for a specific account. The params have the following options:

  • :endpoint_url a fully qualified url to Amazon API endpoint (this overwrites: :server, :port, :service, :protocol). Example: ‘elasticmapreduce.amazonaws.com

  • :server: EMR service host, default: DEFAULT_HOST

  • :port: EMR service port, default: DEFAULT_PORT

  • :protocol: ‘http’ or ‘https’, default: DEFAULT_PROTOCOL

  • :logger: for log messages, default: RAILS_DEFAULT_LOGGER else STDOUT

emr = RightAws::EmrInterface.new('xxxxxxxxxxxxxxxxxxxxx','xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
  {:logger => Logger.new('/tmp/x.log')}) #=> #<RightAws::EmrInterface::0xb7b3c30c>


97
98
99
100
101
102
103
104
105
106
107
# File 'lib/emr/right_emr_interface.rb', line 97

def initialize(aws_access_key_id=nil, aws_secret_access_key=nil, params={})
  init({ :name                => 'EMR',
         :default_host        => ENV['EMR_URL'] ? URI.parse(ENV['EMR_URL']).host   : DEFAULT_HOST,
         :default_port        => ENV['EMR_URL'] ? URI.parse(ENV['EMR_URL']).port   : DEFAULT_PORT,
         :default_service     => ENV['EMR_URL'] ? URI.parse(ENV['EMR_URL']).path   : DEFAULT_PATH,
         :default_protocol    => ENV['EMR_URL'] ? URI.parse(ENV['EMR_URL']).scheme : DEFAULT_PROTOCOL,
         :default_api_version => ENV['EMR_API_VERSION'] || API_VERSION },
       aws_access_key_id    || ENV['AWS_ACCESS_KEY_ID'] ,
       aws_secret_access_key|| ENV['AWS_SECRET_ACCESS_KEY'],
       params)
end

Class Method Details

.bench_serviceObject



76
77
78
# File 'lib/emr/right_emr_interface.rb', line 76

def self.bench_service
  @@bench.service
end

.bench_xmlObject



73
74
75
# File 'lib/emr/right_emr_interface.rb', line 73

def self.bench_xml
  @@bench.xml
end

Instance Method Details

#add_instance_groups(job_flow_id, *instance_groups) ⇒ Object

Adds instance groups to a running job flow.

Instance group configuration options are the same as the ones accepted by run_job_flow.

Only task instance groups may be added at runtime. Instance groups cannot be added to job flows that have only a master instance (i.e. 1 instance in total).

emr.add_instance_groups('j-2QE0KHA1LP4GS', {
  :bid_price => '0.1',
  :instance_count => '2',
  :instance_role => 'TASK',
  :instance_type => 'm1.small',
  :market => 'SPOT',
  :name => 'core group',
}) #=> true


459
460
461
462
463
464
465
466
# File 'lib/emr/right_emr_interface.rb', line 459

def add_instance_groups(job_flow_id, *instance_groups)
  request_hash = amazonize_instance_groups(instance_groups, 'InstanceGroups')
  request_hash['JobFlowId'] = job_flow_id
  link = generate_request("AddInstanceGroups", request_hash)
  request_info(link, AddInstanceGroupsParser.new(:logger => @logger))
rescue
  on_exception
end

#add_job_flow_steps(job_flow_id, *steps) ⇒ Object

Adds steps to a running job flow.

A maximum of 256 steps are allowed in a job flow. Steps can only be added to job flows that are starting, bootstrapping, running or waiting.

Step configuration options are the same as the ones accepted by run_job_flow.

emr.add_job_flow_steps('j-2QE0KHA1LP4GS', {
  :name => 'step 1',
  :jar => 's3n://bucket/path/to/code.jar',
  :main_class => 'com.foobar.emr.Step1',
  :args => ['arg', 'arg'],
  :properties => {
    'property' => 'value',
  },
  :action_on_failure => 'TERMINATE_JOB_FLOW',
}) #=> true


428
429
430
431
432
433
434
435
# File 'lib/emr/right_emr_interface.rb', line 428

def add_job_flow_steps(job_flow_id, *steps)
  request_hash = amazonize_steps(steps)
  request_hash['JobFlowId'] = job_flow_id
  link = generate_request("AddJobFlowSteps", request_hash)
  request_info(link, RightHttp2xxParser.new(:logger => @logger))
rescue
  on_exception
end

#describe_job_flows(*job_flow_ids_and_options) ⇒ Object

Returns a list of job flows that match all of supplied parameters.

Without parameters, returns job flows started in the last two weeks or running job flows started in the last two months.

Regardless of parameters, only jobs started in the last two months are returned.

# default list:
emr.describe_job_flows #=> [
  {:keep_job_flow_alive_when_no_steps=>false,
    :log_uri=>"s3n://bucket/path/to/logs",
    :master_instance_type=>"m1.small",
    :availability_zone=>"us-east-1d",
    :last_state_change_reason=>"Steps completed",
    :termination_protected=>false,
    :master_instance_id=>"i-1fe51278",
    :instance_count=>1,
    :ready_date_time=>"2011-08-31T18:58:58Z",
    :bootstrap_actions=>[],
    :master_public_dns_name=>"ec2-184-78-29-127.compute-1.amazonaws.com",
    :instance_groups=>
     [{:instance_request_count=>1,
       :last_state_change_reason=>"Job flow terminated",
       :instance_role=>"MASTER",
       :ready_date_time=>"2011-08-31T18:58:56Z",
       :instance_running_count=>0,
       :start_date_time=>"2011-08-31T18:58:19Z",
       :market=>"ON_DEMAND",
       :creation_date_time=>"2011-08-31T18:55:36Z",
       :name=>"master",
       :instance_group_id=>"ig-1D91GQR7A9H2K",
       :state=>"ENDED",
       :instance_type=>"m1.small",
       :end_date_time=>"2011-08-31T19:01:09Z"}],
    :start_date_time=>"2011-08-31T18:58:58Z",
    :steps=>
     [{:jar=>"s3n://bucket/path/to/code.jar",
       :main_class=>"com.foobar.emr.Step1",
       :start_date_time=>"2011-08-31T18:58:58Z",
       :properties=>{},
       :args=>[],
       :creation_date_time=>"2011-08-31T18:55:36Z",
       :action_on_failure=>"TERMINATE_JOB_FLOW",
       :name=>"step 1",
       :state=>"COMPLETED",
       :end_date_time=>"2011-08-31T19:00:34Z"}],
    :normalized_instance_hours=>1,
    :ami_version=>"1.0",
    :creation_date_time=>"2011-08-31T18:55:36Z",
    :name=>"jobflow 1",
    :hadoop_version=>"0.18",
    :job_flow_id=>"j-9K18HM82Q0AE7",
    :state=>"COMPLETED",
    :end_date_time=>"2011-08-31T19:01:09Z"}]

# describe specific job flows:
emr.describe_job_flows('j-9K18HM82Q0AE7', 'j-2QE0KHA1LP4GS') #=> [...]

# specify parameters:
emr.describe_job_flows(
  :created_after => Time.now - 86400,
  :created_before => Time.now - 3600,
  :job_flow_ids => ['j-9K18HM82Q0AE7', 'j-2QE0KHA1LP4GS'],
  :job_flow_states => ['RUNNING']
) #=> [...]

# combined job flow list and parameters syntax:
emr.describe_job_flows('j-9K18HM82Q0AE7', 'j-2QE0KHA1LP4GS',
  :job_flow_states => ['RUNNING']
) #=> [...]


326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
# File 'lib/emr/right_emr_interface.rb', line 326

def describe_job_flows(*job_flow_ids_and_options)
  job_flow_ids, options = AwsUtils::split_items_and_params(job_flow_ids_and_options)
  # merge job flow ids passed in as arguments and in options
  unless job_flow_ids.empty?
    # do not modify passed in options
    options = options.dup
    if job_flow_ids_in_options = options[:job_flow_ids]
      # allow the same ids to be passed in either location;
      # remove duplicates
      options[:job_flow_ids] = (job_flow_ids_in_options + job_flow_ids).uniq
    else
      options[:job_flow_ids] = job_flow_ids
    end
  end
  request_hash = {}
  unless (job_flow_ids = options[:job_flow_ids]).right_blank?
    request_hash.update(amazonize_list("JobFlowIds.member", job_flow_ids))
  end
  unless (job_flow_states = options[:job_flow_states]).right_blank?
    request_hash = amazonize_list("JobFlowStates.member", job_flow_states)
  end
  request_hash['CreatedAfter'] = AwsUtils::utc_iso8601(options[:created_after]) unless options[:created_after].right_blank?
  request_hash['CreatedBefore'] = AwsUtils::utc_iso8601(options[:created_before]) unless options[:created_before].right_blank?
  link = generate_request("DescribeJobFlows", request_hash)
  request_cache_or_info(:describe_job_flows, link,  DescribeJobFlowsParser, @@bench, nil)
rescue
  on_exception
end

#generate_request(action, params = {}) ⇒ Object

:nodoc:



109
110
111
# File 'lib/emr/right_emr_interface.rb', line 109

def generate_request(action, params={}) #:nodoc:
  generate_request_impl(:get, action, params )
end

#modify_instance_groups(*args) ⇒ Object

Modifies instance groups.

The only modifiable parameter is instance count.

An instance group may only be modified when the job flow is running or waiting. Additionally, hadoop 0.20 is required to resize job flows.

# general syntax
emr.modify_instance_groups(
  {:instance_group_id => 'ig-P2OPM2L9ZQ4P', :instance_count => 5},
  {:instance_group_id => 'ig-J82ML0M94A7E', :instance_count => 1}
) #=> true

# shortcut syntax
emr.modify_instance_groups('ig-P2OPM2L9ZQ4P', 5) #=> true

Shortcut syntax supports modifying only one instance group at a time.



491
492
493
494
495
496
497
498
499
500
501
502
503
# File 'lib/emr/right_emr_interface.rb', line 491

def modify_instance_groups(*args)
  unless args.first.is_a?(Hash)
    if args.length != 2
      raise ArgumentError, "Must be given two arguments if arguments are not hashes"
    end
    args = [{:instance_group_id => args.first, :instance_count => args.last}]
  end
  request_hash = amazonize_list_with_key_mapping('InstanceGroups.member', MODIFY_INSTANCE_GROUP_KEY_MAPPINGS, args)
  link = generate_request("ModifyInstanceGroups", request_hash)
  request_info(link, RightHttp2xxParser.new(:logger => @logger))
rescue
  on_exception
end

#request_info(request, parser) ⇒ Object

Sends request to Amazon and parses the response Raises AwsError if any banana happened



115
116
117
# File 'lib/emr/right_emr_interface.rb', line 115

def request_info(request, parser)  #:nodoc:
  request_info_impl(:emr_connection, @@bench, request, parser)
end

#run_job_flow(options = {}) ⇒ Object

Creates and starts running a new job flow.

The job flow will run the steps specified and terminate (unless keep alive option is set).

A maximum of 256 steps are allowed in a job flow.

At least the name, instance types, instance count and one step must be specified.

# simple usage:
emr.run_job_flow(
  :name => 'job flow 1',
  :master_instance_type => 'm1.large',
  :slave_instance_type => 'm1.large',
  :instance_count => 5,
  :log_uri => 's3n://bucket/path/to/logs',
  :steps => [{
    :name => 'step 1',
    :jar => 's3n://bucket/path/to/code.jar',
    :main_class => 'com.foobar.emr.Step1',
    :args => ['arg', 'arg'],
  }]) #=> "j-9K18HM82Q0AE7"

# advanced usage:
emr.run_job_flow(
  :name => 'job flow 1',
  :ec2_key_name => 'gsg-keypair',
  :hadoop_version => '0.20',
  :instance_groups => [{
    :bid_price => '0.1',
    :instance_count => '1',
    :instance_role => 'MASTER',
    :instance_type => 'm1.small',
    :market => 'SPOT',
    :name => 'master group',
  }, {
    :bid_price => '0.1',
    :instance_count => '2',
    :instance_role => 'CORE',
    :instance_type => 'm1.small',
    :market => 'SPOT',
    :name => 'core group',
  }, {
    :bid_price => '0.1',
    :instance_count => '2',
    :instance_role => 'TASK',
    :instance_type => 'm1.small',
    :market => 'SPOT',
    :name => 'task group',
  }],
  :keep_job_flow_alive_when_no_steps => true,
  :availability_zone => 'us-east-1a',
  :termination_protected => true,
  :log_uri => 's3n://bucket/path/to/logs',
  :steps => [{
    :name => 'step 1',
    :jar => 's3n://bucket/path/to/code.jar',
    :main_class => 'com.foobar.emr.Step1',
    :args => ['arg', 'arg'],
    :properties => {
      'property' => 'value',
    },
    :action_on_failure => 'TERMINATE_JOB_FLOW',
  }],
  :additional_info => '',
  :bootstrap_actions => [{
    :name => 'bootstrap action 1',
    :path => 's3n://bucket/path/to/bootstrap',
    :args => ['hello', 'world'],
  }],
) #=> "j-9K18HM82Q0AE7"


243
244
245
246
247
248
249
250
251
252
# File 'lib/emr/right_emr_interface.rb', line 243

def run_job_flow(options={})
  request_hash = amazonize_run_job_flow(options)
  request_hash.update(amazonize_bootstrap_actions(options[:bootstrap_actions]))
  request_hash.update(amazonize_instance_groups(options[:instance_groups]))
  request_hash.update(amazonize_steps(options[:steps]))
  link = generate_request("RunJobFlow", request_hash)
  request_info(link, RunJobFlowParser.new(:logger => @logger))
rescue
  on_exception
end

#set_termination_protection(*job_flow_ids_and_options) ⇒ Object

Locks a job flow so the EC2 instances in the cluster cannot be terminated by user intervention, an API call, or in the event of a job flow error. Cluster will still terminate upon successful completion of the job flow.

emr.set_termination_protection(
  'j-9K18HM82Q0AE7', 'j-2QE0KHA1LP4GS', :termination_protected => true
) #=> true

Protection can be enabled using the shortcut syntax:

emr.set_termination_protection('j-9K18HM82Q0AE7') #=> true


379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
# File 'lib/emr/right_emr_interface.rb', line 379

def set_termination_protection(*job_flow_ids_and_options)
  job_flow_ids, options = AwsUtils::split_items_and_params(job_flow_ids_and_options)
  request_hash = amazonize_list('JobFlowIds.member', job_flow_ids)
  request_hash['TerminationProtected'] = case value = options[:termination_protected]
  when true
    'true'
  when false
    'false'
  when nil
    # if :termination_protected => nil was given, then unprotect;
    # if no :termination_protected option was given, protect
    if options.has_key?(:termination_protected)
      'false'
    else
      'true'
    end
  else
    # pass value through
    value
  end
  link = generate_request("SetTerminationProtection", request_hash)
  request_info(link, RightHttp2xxParser.new(:logger => @logger))
rescue
  on_exception
end

#terminate_job_flows(*job_flow_ids) ⇒ Object

Terminates specified job flows.

emr.terminate_job_flows('j-9K18HM82Q0AE7') #=> true


359
360
361
362
363
364
# File 'lib/emr/right_emr_interface.rb', line 359

def terminate_job_flows(*job_flow_ids)
  link = generate_request("TerminateJobFlows", amazonize_list('JobFlowIds.member', job_flow_ids))
  request_info(link, RightHttp2xxParser.new(:logger => @logger))
rescue
  on_exception
end