Class: BufferedTokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/buftok.rb

Overview

Statefully split input data by a specifiable token

BufferedTokenizer takes a delimiter upon instantiation, or acts line-based by default. It allows input to be spoon-fed from some outside source which receives arbitrary length datagrams which may-or-may-not contain the token by which entities are delimited.

Examples:

tokenizer = BufferedTokenizer.new("\n")
tokenizer.extract("foo\nbar")  #=> ["foo"]
tokenizer.extract("baz\n")     #=> ["barbaz"]
tokenizer.flush                 #=> ""

Constant Summary collapse

SPLIT_LIMIT =

Limit passed to String#split to preserve trailing empty fields

-1

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(delimiter = "\n") ⇒ BufferedTokenizer

Create a new BufferedTokenizer

Operates on lines delimited by a delimiter, which is by default “\n”.

The input buffer is stored as an array. This is by far the most efficient approach given language constraints (in C a linked list would be a more appropriate data structure). Segments of input data are stored in a list which is only joined when a token is reached, substantially reducing the number of objects required for the operation.

Examples:

tokenizer = BufferedTokenizer.new("<>")

Parameters:

  • delimiter (String) (defaults to: "\n")

    the token delimiter (default: “\n”)



50
51
52
53
54
55
# File 'lib/buftok.rb', line 50

def initialize(delimiter = "\n")
  @delimiter = delimiter
  @input = []
  @tail = +""
  @overlap = @delimiter.length - 1
end

Instance Attribute Details

#overlapInteger (readonly)

Return the delimiter overlap length

The number of characters at the end of a chunk that may contain a partial delimiter, equal to delimiter.length - 1.

Examples:

BufferedTokenizer.new("<>").overlap  #=> 1

Returns:

  • (Integer)

    delimiter.length - 1



30
31
32
# File 'lib/buftok.rb', line 30

def overlap
  @overlap
end

Instance Method Details

#extract(data) ⇒ Array<String>

Extract tokenized entities from the input data

Extract takes an arbitrary string of input data and returns an array of tokenized entities, provided there were any available to extract. This makes for easy processing of datagrams using a pattern like:

tokenizer.extract(data).map { |entity| Decode(entity) }.each { ... }

Using -1 makes split return “” if the token is at the end of the string, meaning the last element is the start of the next chunk.

Examples:

tokenizer = BufferedTokenizer.new
tokenizer.extract("foo\nbar")  #=> ["foo"]

Parameters:

  • data (String)

    a chunk of input data

Returns:

  • (Array<String>)

    complete tokens extracted from the input



94
95
96
97
98
99
100
101
102
103
104
# File 'lib/buftok.rb', line 94

def extract(data)
  data = rejoin_split_delimiter(data)

  @input << @tail
  entities = data.split(@delimiter, SPLIT_LIMIT)
  @tail = entities.shift # : String

  consolidate_input(entities) if entities.length.positive?

  entities
end

#flushString

Flush the contents of the input buffer

Return the contents of the input buffer even though a token has not yet been encountered, then reset the buffer.

Examples:

tokenizer = BufferedTokenizer.new
tokenizer.extract("foo\nbar")
tokenizer.flush  #=> "bar"

Returns:

  • (String)

    the buffered input



119
120
121
122
123
124
125
# File 'lib/buftok.rb', line 119

def flush
  @input << @tail
  buffer = @input.join
  @input.clear
  @tail = +""
  buffer
end

#sizeInteger

Return the byte size of the internal buffer

Size is not cached and is determined every time this method is called in order to optimize throughput for extract.

Examples:

tokenizer = BufferedTokenizer.new
tokenizer.extract("foo")
tokenizer.size  #=> 3

Returns:

  • (Integer)


70
71
72
# File 'lib/buftok.rb', line 70

def size
  @tail.length + @input.sum(&:length)
end