Class: CorrectHorseBatteryStaple::Corpus::Isam
- Inherits:
-
Base
- Object
- CorrectHorseBatteryStaple::Corpus
- Base
- CorrectHorseBatteryStaple::Corpus::Isam
- Includes:
- Backend::Isam, Memoize
- Defined in:
- lib/correct_horse_battery_staple/corpus/isam.rb
Overview
Format of header:
0..3 - OB - offset of body start in bytes; network byte order 4..7 - LP - length of prelude in network byte order 8..OB-1 - P - JSON-encoded prelude hash and space padding OB..EOF - array of fixed size records as described in prelude
Contents of Prelude (after JSON decoding):
P - length of word part of record P - length of frequency part of record (always 4 bytes) P - length of total part of record P - number of records P - field name sorted by (word or frequency) P - corpus statistics
Format of record:
2 bytes - LW - actual length of word within field P bytes - LW bytes of word (W) + P-LW bytes of padding P (4) bytes - frequency as network byte order long
Constant Summary collapse
- INITIAL_PRELUDE_LENGTH =
4096
Constants included from Backend::Isam
Backend::Isam::F_PRELUDE_AT_END
Instance Attribute Summary
Attributes inherited from Base
#frequency_mean, #frequency_stddev, #original_size, #probability_mean, #probability_stddev, #weighted_size
Class Method Summary collapse
-
.read(filename) ⇒ Object
factory-ish constructor.
Instance Method Summary collapse
- #_pick(string, count, length_range, max_iterations) ⇒ Object
-
#initialize(filename, stats = nil) ⇒ Isam
constructor
A new instance of Isam.
-
#pick(count, options = {}) ⇒ Object
optimized pick - does NOT support :filter, though.
Methods included from Memoize
Methods included from Backend::Isam
Methods inherited from Base
#candidates, #compose_filters, #count, #count_by_options, #count_candidates, #each, #entropy_per_word, #entropy_per_word_by_filter, #filter, #filter_for_options, #frequencies, #inspect, #load_stats_from_hash, #precache, #recalculate, #reset, #result, #sorted_entries, #stats, #words
Methods included from CorrectHorseBatteryStaple::Common
#array_sample, #logger, #random_in_range, #random_number, #set_sample
Methods inherited from CorrectHorseBatteryStaple::Corpus
Constructor Details
#initialize(filename, stats = nil) ⇒ Isam
Returns a new instance of Isam.
36 37 38 39 40 41 |
# File 'lib/correct_horse_battery_staple/corpus/isam.rb', line 36 def initialize(filename, stats = nil) super @filename = filename @file = CorrectHorseBatteryStaple::Util.open_binary(filename, "r") parse_prelude end |
Class Method Details
.read(filename) ⇒ Object
factory-ish constructor
44 45 46 |
# File 'lib/correct_horse_battery_staple/corpus/isam.rb', line 44 def self.read(filename) self.new filename end |
Instance Method Details
#_pick(string, count, length_range, max_iterations) ⇒ Object
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
# File 'lib/correct_horse_battery_staple/corpus/isam.rb', line 71 def _pick(string, count, length_range, max_iterations) result = [] iterations = 0 # don't bother reading already read words skip_cache = Set.new range_size = string.length / @entry_length # don't cons! entry = CorrectHorseBatteryStaple::Word.new :word => "" while result.length < count && iterations < max_iterations i = random_number(range_size) unless skip_cache.include? i pr = parse_record(string, i, entry, length_range) if pr result << pr.dup else skip_cache << i end end iterations += 1 end result end |
#pick(count, options = {}) ⇒ Object
optimized pick - does NOT support :filter, though
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
# File 'lib/correct_horse_battery_staple/corpus/isam.rb', line 49 def pick(count, = {}) # incompat check raise NotImplementedError, "ISAM does not support :filter option" if [:filter] # options parsing string = record_percentile_range_read([:percentile] || (0..100)) range_size = string.length / @entry_length max_iterations = [[:max_iterations] || 1000, count*10].max if range_size < count raise ArgumentError, "Percentile range contains fewer words than requested count: p=#{[:percentile].inspect}, l=#{string.length}, el=#{@entry_length}, range_size = #{range_size}" end # the real work result = _pick(string, count, [:word_length], max_iterations) # validate that we succeeded raise "Cannot find #{count} words matching criteria" if result.length < count result end |