Class: Hadupils::Extensions::Hive

Inherits:
Object
  • Object
show all
Includes:
AuxJarsPath
Defined in:
lib/hadupils/extensions/hive.rb

Overview

Hive-targeted extensions derived from filesystem layout

Concept

There are a few ways to “extend” one’s hive session:

  • Adding files, archives, jars to it (+ADD …+).

  • Setting variables and whatnot (+SET …+).

  • Registering your own UDFS.

  • Specifying paths to jars to make available within the session’s classpath (HIVE_AUX_JARS_PATH env. var.).

All of these things can be done through the use of initialization files (via hive’s -i option), except for the auxiliary jar libs environment variable (which is.… wait for it… in the environment).

This class provides an abstraction to enable the following:

  • lay your files out according to its expectations

  • wrap that layout with an instance of this class

  • it’ll give an interface for accessing initialization files (#hivercs) that make the stuff available in a hive session

  • it’ll dynamically assemble the initialization file necessary to ensure appropriate assets are made available in the session

  • if you provide your own initialization file in the expected place, it’ll ensure that the dynamic stuff is applied first and the static one second, such that your static one can assume the neighboring assets are already in the session.

  • it’ll give you a list of jars to make available as auxiliary_jars in the session based on contents of aux-jars.

You lay it down, the object makes sense of it, nothing other than file organization required.

Filesystem Layout

Suppose you have the following stuff (denoting symlinks with ->):

/etc/foo/
    an.archive.tar.gz
    another.archive.tar.gz
    aux-jars/
        aux-only.jar
        ignored.archive.tar.gz
        ignored.file.txt
        jarry.jar -> ../jarry.jar
    dist-only.jar
    hiverc
    jarry.jar
    textie.txt
    yummy.yaml

Now you create an instance:

ext = Hadupils::Extensions::Hive.new('/etc/foo')

You could get the hive command-line options for using this stuff via:

ext.hivercs

It’ll give you objects for two initialization files:

  1. A dynamic one that has the appropriate commands for adding an.archive.tar.gz, another.archive.tar.gz, dist-only.jar, jarry.jar, textie.txt, and yummy.yaml to the session.

  2. The hiverc one that’s in there.

And, the ext.auxiliary_jars accessor will return a list of paths to the jars (only the jars) contained within the aux-jars path; a caller to hive would use this to construct the HIVE_AUX_JARS_PATH variable.

Notice that jarry.jar is common to the distributed usage (it’ll be added to the session and associated distributed cache) and to the auxiliary path. That’s because it appears in the main directory and in the aux-jars subdirectory. There’s nothing magical about the use of a symlink; that just saves disk space. 10 MB ought be enough for anyone.

If there was no hiverc file, then you would only get the initialization file object for the loading of assets in the main directory. Conversely, if there were no such assets, but there was a hiverc file, you would get only the object for that file. If neither were present, the #hivercs will be an empty list.

If there is no aux-jars directory, or that directory has no jars, the ext.auxiliary_jars would be an empty list. Only jars will be included in that list; files without a .jar extension will be ignored.

Defined Under Namespace

Modules: AuxJarsPath

Constant Summary collapse

AUX_PATH =
'aux-jars'
HIVERC_PATH =
'hiverc'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from AuxJarsPath

#hive_aux_jars_path

Constructor Details

#initialize(path) ⇒ Hive

Returns a new instance of Hive.



112
113
114
115
116
117
# File 'lib/hadupils/extensions/hive.rb', line 112

def initialize(path)
  @path = ::File.expand_path(path)
  @auxiliary_jars = self.class.find_auxiliary_jars(@path)
  @dynamic_ext = self.class.assemble_dynamic_extension(@path)
  @static_ext = self.class.assemble_static_extension(@path)
end

Instance Attribute Details

#auxiliary_jarsObject (readonly)

Returns the value of attribute auxiliary_jars.



109
110
111
# File 'lib/hadupils/extensions/hive.rb', line 109

def auxiliary_jars
  @auxiliary_jars
end

#pathObject (readonly)

Returns the value of attribute path.



110
111
112
# File 'lib/hadupils/extensions/hive.rb', line 110

def path
  @path
end

Class Method Details

.assemble_dynamic_extension(path) ⇒ Object



157
158
159
160
161
162
163
# File 'lib/hadupils/extensions/hive.rb', line 157

def self.assemble_dynamic_extension(path)
  Flat.new(path) do
    assets do |list|
      list.reject {|asset| [AUX_PATH, HIVERC_PATH].include? asset.name }
    end
  end
end

.assemble_static_extension(path) ⇒ Object



165
166
167
# File 'lib/hadupils/extensions/hive.rb', line 165

def self.assemble_static_extension(path)
  Static.new(path)
end

.build_archive(io, dist_assets, aux_jars = nil) ⇒ Object

Writes a gzipped tar archive to io, the contents of which are structured appropriately for use with this class.

Provide the static hiverc and any other distributed cache-bound assets in dist_assets, and any auxiliary jars to include in aux_jars.

This utilizes a system call to tar under the hood, which requires that it be installed and on your PATH.

You can use any file-like writable thing for io, so files, pipes, etc.

See this example:

File.open('foo.tar.gz', 'w') do |f|
  Hadupils::Extensions::Hive.build_archive f,
                                           ['/tmp/here/blah.jar',
                                            '/tmp/there/hiverc'],
                                           ['/tmp/elsewhere/foo.jar']
end

The following example would produce an archive named “foo.tar.gz”, the contents of which would be:

aux-jars/foo.jar
blah.jar
hiverc

Note that it collapses things into two distinct directories, such that basename collisions are possible. That’s on you to handle sanely.



201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
# File 'lib/hadupils/extensions/hive.rb', line 201

def self.build_archive(io, dist_assets, aux_jars=nil)
  dist, aux = [dist_assets, (aux_jars || [])].collect do |files|
    files.collect do |asset|
      path = ::File.expand_path(asset)
      raise "Cannot include directory '#{path}'." if ::File.directory? path
      path
    end
  end

  ::Dir.mktmpdir do |workdir|
    basenames = dist.collect do |src|
      FileUtils.cp src, File.join(workdir, File.basename(src))
      File.basename src
    end

    if aux.length > 0
      basenames << AUX_PATH
      aux_dir = File.join(workdir, AUX_PATH)
      Dir.mkdir aux_dir
      aux.each do |src|
        FileUtils.cp src, File.join(aux_dir, File.basename(src))
      end
    end

    ::Dir.chdir(workdir) do |p|
      Open3.popen2('tar', 'cz', *basenames) do |i, o|
        stdout = o.read
        io << stdout
      end
    end
  end
  true
end

.find_auxiliary_jars(path) ⇒ Object



145
146
147
148
149
150
151
152
153
154
155
# File 'lib/hadupils/extensions/hive.rb', line 145

def self.find_auxiliary_jars(path)
  target = ::File.join(path, AUX_PATH)
  if ::File.directory? target
    jars = Hadupils::Assets.assets_in(target).find_all do |asset|
      asset.kind_of? Hadupils::Assets::Jar
    end
    jars.collect {|asset| asset.path}
  else
    []
  end
end

Instance Method Details

#dynamic_hivercsObject

An array of dynamic, managed hive initialization objects (Hadupils::Extensions::HiveRC::Dynamic) based on the assets found within the #path. May be an empty list.



130
131
132
133
134
135
136
# File 'lib/hadupils/extensions/hive.rb', line 130

def dynamic_hivercs
  if @dynamic_ext.assets.length > 0
    @dynamic_ext.hivercs
  else
    []
  end
end

#hivercsObject

An array of hive initialization objects derived from dynamic and static sets. May be an empty list. Dynamic are guaranteed to come before static, so a static hiverc can count on the other assets being available.



123
124
125
# File 'lib/hadupils/extensions/hive.rb', line 123

def hivercs
  dynamic_hivercs + static_hivercs
end

#static_hivercsObject

An array of static hive initialization objects (Hadupils::Extensions::HiveRC::Static) based on the presence of a hiverc file within the #path. May be an empty list.



141
142
143
# File 'lib/hadupils/extensions/hive.rb', line 141

def static_hivercs
  @static_ext.hivercs
end