Module: IMW
- Defined in:
- lib/imw.rb,
lib/imw/boot.rb,
lib/imw/tools.rb,
lib/imw/runner.rb,
lib/imw/dataset.rb,
lib/imw/formats.rb,
lib/imw/parsers.rb,
lib/imw/schemes.rb,
lib/imw/archives.rb,
lib/imw/resource.rb,
lib/imw/utils/log.rb,
lib/imw/repository.rb,
lib/imw/schemes/s3.rb,
lib/imw/utils/misc.rb,
lib/imw/utils/error.rb,
lib/imw/utils/paths.rb,
lib/imw/archives/rar.rb,
lib/imw/archives/tar.rb,
lib/imw/archives/zip.rb,
lib/imw/formats/json.rb,
lib/imw/formats/sgml.rb,
lib/imw/formats/yaml.rb,
lib/imw/schemes/hdfs.rb,
lib/imw/schemes/http.rb,
lib/imw/dataset/paths.rb,
lib/imw/formats/excel.rb,
lib/imw/schemes/local.rb,
lib/imw/utils/version.rb,
lib/imw/archives/targz.rb,
lib/imw/schemes/remote.rb,
lib/imw/tools/archiver.rb,
lib/imw/archives/tarbz2.rb,
lib/imw/compressed_files.rb,
lib/imw/dataset/workflow.rb,
lib/imw/tools/summarizer.rb,
lib/imw/tools/transferer.rb,
lib/imw/utils/extensions.rb,
lib/imw/formats/delimited.rb,
lib/imw/compressed_files/gz.rb,
lib/imw/parsers/html_parser.rb,
lib/imw/parsers/line_parser.rb,
lib/imw/compressed_files/bz2.rb,
lib/imw/parsers/regexp_parser.rb,
lib/imw/parsers/html_parser/matchers.rb,
lib/imw/compressed_files/compressible.rb
Overview
The Infinite Monkeywrench (IMW) is a Ruby library for ripping, extracting, parsing, munging, and packaging datasets. It allows you to handle different data formats transparently as well as organize transformations of data as a network of dependencies (a la Make or Rake).
IMW has a few central concepts: resources, datasets, workflows, and repositories.
Resources represent individual data resources like local files, websites, databases, &c. Resources are typically instantiated via IMW.open, with IMW doing the work of figuring out what to return based on the URI passed in.
Datasets represent collections of related data resources. An IMW::Dataset comes with a pre-defined (but customizable) workflow that takes data resources through several steps: rip, parse, munge, and package. The workflow leverages Rake and so the various tasks that are necessary to process the data till it is nice and pretty can all be linked with dependencies.
Repositories are collections of datasets and it is on these collections that the imw
command line tool operates.
Defined Under Namespace
Modules: Archives, CompressedFiles, Config, Formats, Parsers, Paths, Schemes, Tools, VERSION, Workflow Classes: Counter, Dataset, Repository, Resource, Runner, SystemCallError
Constant Summary collapse
- RunnerError =
Class.new(IMW::Error)
- USER_DEFINED_HANDLERS =
Define this constant in your configuration file to add your own URI handlers to IMW.
[]
- LOG_FILE_DESTINATION =
Default log file.
STDERR
- LOG_TIMEFORMAT =
Default log file time format
"%Y-%m-%d %H:%M:%S "
- VERBOSE =
Default verbosity
false
- PROGRESS_TRACKERS =
{}
- PROGRESS_COUNTERS =
{}
- Error =
Base error class which all IMW errors subclass.
Class.new(StandardError)
- NoMethodError =
Method undefined.
Class.new(Error)
- TypeError =
Type error.
Class.new(Error)
- NotImplementedError =
Not implemented (typically because user needs to define a method when subclassing a base class).
Class.new(Error)
- ParseError =
Error during parsing.
Class.new(Error)
- PathError =
Error with a non-existing, invalid, or inaccessible path.
Class.new(Error)
- NetworkError =
Error communicating with a remote entity.
Class.new(Error)
- ArgumentError =
Error communicating with a remote entity.
Class.new(Error)
- DEFAULT_PATHS =
Default paths for the IMW. Chosen to make sense on most *NIX distributions.
{ :home => ENV['HOME'], :data_root => "/var/lib/imw", :log_root => "/var/log/imw", :scripts_root => "/usr/share/imw", :tmp_root => "/tmp/imw", # the imw library :imw_root => File.(File.dirname(__FILE__) + "/../../.."), :imw_bin => [:imw_root, 'bin'], :imw_etc => [:imw_root, 'etc'], :imw_lib => [:imw_root, 'lib'], # workflow :ripd_root => [:data_root, 'ripd'], :rawd_root => [:data_root, 'rawd'], :fixd_root => [:data_root, 'fixd'], :pkgd_root => [:data_root, 'pkgd'] }
- Task =
An IMW version of Rake::Task
Class.new(Rake::Task)
- FileTask =
An IMW subclass of Rake:FileTask
Class.new(Rake::FileTask)
- FileCreationTask =
An IMW subclass of Rake::FileCreationTask
Class.new(Rake::FileCreationTask)
- COMPRESSION_SETTINGS =
Default settings used when compressing files.
:program
defines the name of the command-line program to use,:compress
gives the flags to use when compressing, and:extension
gives the extension (without the ‘.’) added by the program after compressing. { :program => 'bzip2', :compress => '', :extension => 'bz2' }
Class Attribute Summary collapse
-
.log ⇒ Object
Returns the value of attribute log.
-
.verbose ⇒ Object
Returns the value of attribute verbose.
Class Method Summary collapse
-
.add_path(sym, *pathsegs) ⇒ String
Adds a symbolic path for expansion by
path_to
. - .announce(*events) ⇒ Object
- .announce_if_verbose(*events) ⇒ Object
- .banner(*events) ⇒ Object
-
.dataset(handle, options = {}, &block) ⇒ IMW::Dataset
Create a dataset and put it in the default IMW repository.
-
.instantiate_logger! ⇒ Object
Create a Logger and point it at IMW::LOG_FILE_DESTINATION which is set in ~/.imwrc and defaults to STDERR.
-
.open(obj, options = {}) ⇒ IMW::Resource
Open a resource at the given
uri
. -
.open!(uri, options = {}) ⇒ IMW::Resource
Works the same way as IMW.open except opens the resource for writing.
-
.path_to(*pathsegs) ⇒ String
Expands a shorthand workflow path specification to an actual file path.
-
.remove_path(sym) ⇒ Object
Removes a symbolic path for expansion by
path_to
. -
.repository ⇒ IMW::Repository
The default repository in which to place datasets.
-
.system(*commands) ⇒ Object
A replacement for the standard system call which raises an IMW::SystemCallError if the command fails which prints better debugging info.
-
.verbose? ⇒ nil, ...
Is IMW operating in verbose mode?.
- .warn(*events) ⇒ Object
- .warn_if_verbose(*events) ⇒ Object
Instance Method Summary collapse
-
#track_count(tracker, every = 1000) ⇒ Object
Log repetitions in a given context.
-
#track_progress(tracker, val) ⇒ Object
When the slowly-changing tracked variable
var
changes value, announce its new value.
Class Attribute Details
.log ⇒ Object
Returns the value of attribute log.
14 15 16 |
# File 'lib/imw/utils/log.rb', line 14 def log @log end |
.verbose ⇒ Object
Returns the value of attribute verbose.
14 15 16 |
# File 'lib/imw/utils/log.rb', line 14 def verbose @verbose end |
Class Method Details
.add_path(sym, *pathsegs) ⇒ String
Adds a symbolic path for expansion by path_to
.
IMW.add_path :foo, '~/whoa'
IMW.add_path :bar, :foo, 'baz'
IMW.path_to :bar
=> '~/whoa/baz'
122 123 124 125 |
# File 'lib/imw/utils/paths.rb', line 122 def self.add_path sym, *pathsegs IMW::PATHS[sym] = pathsegs.flatten path_to[sym] end |
.announce(*events) ⇒ Object
36 37 38 39 40 |
# File 'lib/imw/utils/log.rb', line 36 def self.announce *events = events.flatten. .reverse_merge! :level => Logger::INFO IMW.log.add [:level], events.join("\n") end |
.announce_if_verbose(*events) ⇒ Object
41 42 43 |
# File 'lib/imw/utils/log.rb', line 41 def self.announce_if_verbose *events announce(*events) if IMW.verbose? end |
.banner(*events) ⇒ Object
45 46 47 48 49 |
# File 'lib/imw/utils/log.rb', line 45 def self. *events = events.flatten. .reverse_merge! :level => Logger::INFO announce(["*"*75, events, "*"*75], ) end |
.dataset(handle, options = {}, &block) ⇒ IMW::Dataset
Create a dataset and put it in the default IMW repository. Also yields the dataset so you can define its workflow
IMW.dataset :my_dataset do
# Define some paths we're going to use
add_path :raw_data, :ripd, 'raw_data.csv'
add_path :fixd_data, :fixd, 'fixed_data.csv'
# Copy a file from a website to this dataset's +ripd+ directory.
rip do
IMW.open('http://mysite.com/data_archives/2010/03/03.csv').cp(path_to(:raw_data))
end
# Filter the raw data to those values which match some criterion defined by <tt>accept?</tt>
munge do
IMW.open(path_to(:raw_data)).map do |row|
row if accept?(row)
end.compact.dump(path_to(:fixd_data))
end
# Compress this new data
package do
IMW.open(path_to(:fixd_data)).compress.mv(path_to(:pkgd))
end
end
108 109 110 111 112 |
# File 'lib/imw.rb', line 108 def self.dataset handle, ={}, &block d = IMW::Dataset.new(handle, .merge(:repository => IMW.repository)) d.instance_eval(&block) if block_given? d end |
.instantiate_logger! ⇒ Object
Create a Logger and point it at IMW::LOG_FILE_DESTINATION which is set in ~/.imwrc and defaults to STDERR.
30 31 32 33 34 |
# File 'lib/imw/utils/log.rb', line 30 def self.instantiate_logger! IMW.log ||= Logger.new(LOG_FILE_DESTINATION) IMW.log.datetime_format = "%Y%m%d-%H:%M:%S " IMW.log.level = Logger::INFO end |
.open(obj, options = {}) ⇒ IMW::Resource
Open a resource at the given uri
. The resource will automatically be extended by modules which make sense given the uri
.
See the documentation for IMW::Resource and the various modules within IMW::Resources for more information and options.
Passing in an IMW::Resource will simply return it.
53 54 55 56 57 58 |
# File 'lib/imw.rb', line 53 def self.open obj, ={} return obj if obj.is_a?(IMW::Resource) [:use_modules] ||= ([:as] || []) [:skip_modules] ||= ([:without] || []) IMW::Resource.new(obj, ) end |
.open!(uri, options = {}) ⇒ IMW::Resource
Works the same way as IMW.open except opens the resource for writing.
65 66 67 |
# File 'lib/imw.rb', line 65 def self.open! uri, ={} IMW::Resource.new(uri, .merge(:mode => 'w')) end |
.path_to(*pathsegs) ⇒ String
Expands a shorthand workflow path specification to an actual file path. Strings are interpreted literally but symbols are first resolved to the paths they represent.
IMW.add_path :foo, '~/whoa'
IMW.path_to :foo, 'my_thing'
=> '~/whoa/my_thing'
107 108 109 110 |
# File 'lib/imw/utils/paths.rb', line 107 def self.path_to *pathsegs path = Pathname.new IMW.path_to_helper(*pathsegs) path.absolute? ? File.(path) : path.to_s end |
.remove_path(sym) ⇒ Object
Removes a symbolic path for expansion by path_to
.
130 131 132 |
# File 'lib/imw/utils/paths.rb', line 130 def self.remove_path sym IMW::PATHS.delete sym if IMW::PATHS.include? sym end |
.repository ⇒ IMW::Repository
The default repository in which to place datasets. See the documentation for IMW::Repository for more information on how datasets and repositories fit together.
74 75 76 |
# File 'lib/imw.rb', line 74 def self.repository @@repository ||= IMW::Repository.new end |
.system(*commands) ⇒ Object
A replacement for the standard system call which raises an IMW::SystemCallError if the command fails which prints better debugging info.
This function relies upon Kernel.system and obeys the same rules:
-
if
commands
has only only a single element then no shell characters or spaces are escaped – you have to do it yourself or you get to use shell characters, depending on your perspective. -
if
commands
is a list of elements then the second and further elements in the list have their shell characters and spaces escaped
But it also has its own rules:
-
When one of the
commands
is an empty or blank string, Kernel.system honors it and escapes it properly and sends it along for evaluation. This can be a problem for some programs and so IMW.system excludes blank (as inblank?
) elements ofcommands
. -
commands
will be flattened (see the gotcha below)
Calling out to the shell like this is often brittle. Imagine defining
prog = 'some_prog'
flags = '-v -f'
args = 'file.txt'
and later calling
IMW.system prog, flags, args
The space in the second argument (‘-v -f’) will be escaped and will therefore not be properly parsed by some_prog
. Instead try
prog = 'some_prog'
flags = ['-v', '-f']
args = ['file.txt']
IMW.system prog, flags, *args
which will work fine since flags
will automatically be flattend.
58 59 60 61 62 63 64 |
# File 'lib/imw/utils/extensions.rb', line 58 def self.system *commands stripped_commands = commands.flatten.map { |command| command.to_s unless command.blank? }.compact IMW.announce_if_verbose(stripped_commands.join(" ")) exit_code = Kernel.system(*stripped_commands) raise IMW::SystemCallError.new($?.dup, commands.join(' ')) unless $?.success? exit_code end |
.verbose? ⇒ nil, ...
Is IMW operating in verbose mode?
Calls to IMW.warn_if_verbose
and friends utilize this method. Verbosity is controlled on the command line (see IMW::Runner) or by setting IMW::VERBOSE in your configuration file.
24 25 26 |
# File 'lib/imw/utils/log.rb', line 24 def self.verbose? VERBOSE || verbose end |
.warn(*events) ⇒ Object
51 52 53 54 55 |
# File 'lib/imw/utils/log.rb', line 51 def self.warn *events = events.flatten. .reverse_merge! :level => Logger::WARN announce events, end |
Instance Method Details
#track_count(tracker, every = 1000) ⇒ Object
Log repetitions in a given context
At every n’th (default 1000) call, announce progress in the IMW.log
84 85 86 87 88 89 |
# File 'lib/imw/utils/log.rb', line 84 def track_count tracker, every=1000 PROGRESS_COUNTERS[tracker] ||= 0 PROGRESS_COUNTERS[tracker] += 1 chunk = every * (PROGRESS_COUNTERS[tracker]/every).to_i track_progress "count_of_#{tracker}", chunk end |
#track_progress(tracker, val) ⇒ Object
When the slowly-changing tracked variable var
changes value, announce its new value. Always announces on first call.
Ex:
track_progress :indexing_names, name[0..0] # announce at each initial letter
track_progress :files, (i % 1000) # announce at each 1,000 iterations
69 70 71 72 73 74 75 |
# File 'lib/imw/utils/log.rb', line 69 def track_progress tracker, val unless (IMW::PROGRESS_TRACKERS.include?(tracker)) && (IMW::PROGRESS_TRACKERS[tracker] == val) announce "#{tracker.to_s.gsub(/_/,' ')}: #{val}" IMW::PROGRESS_TRACKERS[tracker] = val end end |