Class: BufferedTokenizer
- Inherits:
-
Object
- Object
- BufferedTokenizer
- Defined in:
- lib/buftok.rb
Overview
Statefully split input data by a specifiable token
BufferedTokenizer takes a delimiter upon instantiation, or acts line-based by default. It allows input to be spoon-fed from some outside source which receives arbitrary length datagrams which may-or-may-not contain the token by which entities are delimited.
Constant Summary collapse
- SPLIT_LIMIT =
Limit passed to String#split to preserve trailing empty fields
-1
Instance Attribute Summary collapse
-
#overlap ⇒ Integer
readonly
Return the delimiter overlap length.
Instance Method Summary collapse
-
#extract(data) ⇒ Array<String>
Extract tokenized entities from the input data.
-
#flush ⇒ String
Flush the contents of the input buffer.
-
#initialize(delimiter = "\n") ⇒ BufferedTokenizer
constructor
Create a new BufferedTokenizer.
-
#size ⇒ Integer
Return the byte size of the internal buffer.
Constructor Details
#initialize(delimiter = "\n") ⇒ BufferedTokenizer
Create a new BufferedTokenizer
Operates on lines delimited by a delimiter, which is by default “\n”.
The input buffer is stored as an array. This is by far the most efficient approach given language constraints (in C a linked list would be a more appropriate data structure). Segments of input data are stored in a list which is only joined when a token is reached, substantially reducing the number of objects required for the operation.
50 51 52 53 54 55 |
# File 'lib/buftok.rb', line 50 def initialize(delimiter = "\n") @delimiter = delimiter @input = [] @tail = +"" @overlap = @delimiter.length - 1 end |
Instance Attribute Details
#overlap ⇒ Integer (readonly)
Return the delimiter overlap length
The number of characters at the end of a chunk that may contain a partial delimiter, equal to delimiter.length - 1.
30 31 32 |
# File 'lib/buftok.rb', line 30 def overlap @overlap end |
Instance Method Details
#extract(data) ⇒ Array<String>
Extract tokenized entities from the input data
Extract takes an arbitrary string of input data and returns an array of tokenized entities, provided there were any available to extract. This makes for easy processing of datagrams using a pattern like:
tokenizer.extract(data).map { |entity| Decode(entity) }.each { ... }
Using -1 makes split return “” if the token is at the end of the string, meaning the last element is the start of the next chunk.
94 95 96 97 98 99 100 101 102 103 104 |
# File 'lib/buftok.rb', line 94 def extract(data) data = rejoin_split_delimiter(data) @input << @tail entities = data.split(@delimiter, SPLIT_LIMIT) @tail = entities.shift # : String consolidate_input(entities) if entities.length.positive? entities end |
#flush ⇒ String
Flush the contents of the input buffer
Return the contents of the input buffer even though a token has not yet been encountered, then reset the buffer.
119 120 121 122 123 124 125 |
# File 'lib/buftok.rb', line 119 def flush @input << @tail buffer = @input.join @input.clear @tail = +"" buffer end |
#size ⇒ Integer
Return the byte size of the internal buffer
Size is not cached and is determined every time this method is called in order to optimize throughput for extract.
70 71 72 |
# File 'lib/buftok.rb', line 70 def size @tail.length + @input.sum(&:length) end |