Class: Classifier::Streaming::LineReader

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/classifier/streaming/line_reader.rb

Overview

Memory-efficient line reader for large files and IO streams. Reads lines one at a time and can yield in configurable batches.

Examples:

Reading line by line

reader = LineReader.new(File.open('large_corpus.txt'))
reader.each { |line| process(line) }

Reading in batches

reader = LineReader.new(io, batch_size: 100)
reader.each_batch { |batch| process_batch(batch) }

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(io, batch_size: 100) ⇒ LineReader

Creates a new LineReader.



26
27
28
29
# File 'lib/classifier/streaming/line_reader.rb', line 26

def initialize(io, batch_size: 100)
  @io = io
  @batch_size = batch_size
end

Instance Attribute Details

#batch_sizeObject (readonly)

Returns the value of attribute batch_size.



21
22
23
# File 'lib/classifier/streaming/line_reader.rb', line 21

def batch_size
  @batch_size
end

Instance Method Details

#eachObject

Iterates over each line in the IO stream. Lines are chomped (trailing newlines removed).



36
37
38
39
40
41
42
# File 'lib/classifier/streaming/line_reader.rb', line 36

def each
  return enum_for(:each) unless block_given?

  @io.each_line do |line|
    yield line.chomp
  end
end

#each_batch {|batch| ... } ⇒ Object

Iterates over batches of lines. Each batch is an array of chomped lines.

Yields:

  • (batch)


49
50
51
52
53
54
55
56
57
58
59
60
61
# File 'lib/classifier/streaming/line_reader.rb', line 49

def each_batch
  return enum_for(:each_batch) unless block_given?

  batch = [] #: Array[String]
  each do |line|
    batch << line
    if batch.size >= @batch_size
      yield batch
      batch = []
    end
  end
  yield batch unless batch.empty?
end

#estimate_line_count(sample_size: 100) ⇒ Object

Estimates the total number of lines in the IO stream. This is a rough estimate based on file size and average line length. Returns nil for non-seekable streams.



68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# File 'lib/classifier/streaming/line_reader.rb', line 68

def estimate_line_count(sample_size: 100)
  return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind)

  begin
    original_pos = @io.pos
    @io.rewind

    sample_bytes = 0
    sample_lines = 0

    sample_size.times do
      line = @io.gets
      break unless line

      sample_bytes += line.bytesize
      sample_lines += 1
    end

    @io.seek(original_pos)

    return nil if sample_lines.zero?

    avg_line_size = sample_bytes.to_f / sample_lines
    io_size = @io.__send__(:size) #: Integer
    (io_size / avg_line_size).round
  rescue IOError, Errno::ESPIPE
    nil
  end
end