Class: CorrectHorseBatteryStaple::Corpus::Isam

Inherits:
Base show all
Includes:
Backend::Isam, Memoize
Defined in:
lib/correct_horse_battery_staple/corpus/isam.rb

Overview

Format of header:

0..3 - OB - offset of body start in bytes; network byte order 4..7 - LP - length of prelude in network byte order 8..OB-1 - P - JSON-encoded prelude hash and space padding OB..EOF - array of fixed size records as described in prelude

Contents of Prelude (after JSON decoding):

P - length of word part of record P - length of frequency part of record (always 4 bytes) P - length of total part of record P - number of records P - field name sorted by (word or frequency) P - corpus statistics

Format of record:

2 bytes - LW - actual length of word within field P bytes - LW bytes of word (W) + P-LW bytes of padding P (4) bytes - frequency as network byte order long

Constant Summary collapse

INITIAL_PRELUDE_LENGTH =
4096

Constants included from Backend::Isam

Backend::Isam::F_PRELUDE_AT_END

Instance Attribute Summary

Attributes inherited from Base

#frequency_mean, #frequency_stddev, #original_size, #probability_mean, #probability_stddev, #weighted_size

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Memoize

included

Methods included from Backend::Isam

included

Methods inherited from Base

#candidates, #compose_filters, #count, #count_by_options, #count_candidates, #each, #entropy_per_word, #entropy_per_word_by_filter, #filter, #filter_for_options, #frequencies, #inspect, #load_stats_from_hash, #precache, #recalculate, #reset, #result, #sorted_entries, #stats, #words

Methods included from CorrectHorseBatteryStaple::Common

#array_sample, #logger, #random_in_range, #random_number, #set_sample

Methods inherited from CorrectHorseBatteryStaple::Corpus

format_for

Constructor Details

#initialize(filename, stats = nil) ⇒ Isam

Returns a new instance of Isam.



36
37
38
39
40
41
# File 'lib/correct_horse_battery_staple/corpus/isam.rb', line 36

def initialize(filename, stats = nil)
  super
  @filename = filename
  @file = CorrectHorseBatteryStaple::Util.open_binary(filename, "r")
  parse_prelude
end

Class Method Details

.read(filename) ⇒ Object

factory-ish constructor



44
45
46
# File 'lib/correct_horse_battery_staple/corpus/isam.rb', line 44

def self.read(filename)
  self.new filename
end

Instance Method Details

#_pick(string, count, length_range, max_iterations) ⇒ Object



71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# File 'lib/correct_horse_battery_staple/corpus/isam.rb', line 71

def _pick(string, count, length_range, max_iterations)
  result = []
  iterations = 0

  # don't bother reading already read words
  skip_cache = Set.new
  range_size = string.length / @entry_length

  # don't cons!
  entry = CorrectHorseBatteryStaple::Word.new :word => ""
  while result.length < count && iterations < max_iterations
    i = random_number(range_size)
    unless skip_cache.include? i
      pr = parse_record(string, i, entry, length_range)
      if pr
        result << pr.dup
      else
        skip_cache << i
      end
    end
    iterations += 1
  end
  result
end

#pick(count, options = {}) ⇒ Object

optimized pick - does NOT support :filter, though

Raises:

  • (NotImplementedError)


49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# File 'lib/correct_horse_battery_staple/corpus/isam.rb', line 49

def pick(count, options = {})
  # incompat check
  raise NotImplementedError, "ISAM does not support :filter option" if options[:filter]

  # options parsing
  string         = record_percentile_range_read(options[:percentile] || (0..100))
  range_size     = string.length / @entry_length
  max_iterations = [options[:max_iterations] || 1000, count*10].max

  if range_size < count
    raise ArgumentError, "Percentile range contains fewer words than requested count: p=#{options[:percentile].inspect}, l=#{string.length}, el=#{@entry_length}, range_size = #{range_size}"
  end

  # the real work
  result         = _pick(string, count, options[:word_length], max_iterations)

  # validate that we succeeded
  raise "Cannot find #{count} words matching criteria" if result.length < count

  result
end