Class: BufferedTokenizer

Inherits:

Object

Object
BufferedTokenizer

show all

Defined in:: lib/buftok.rb

Overview

Statefully split input data by a specifiable token

BufferedTokenizer takes a delimiter upon instantiation, or acts line-based by default. It allows input to be spoon-fed from some outside source which receives arbitrary length datagrams which may-or-may-not contain the token by which entities are delimited.

Examples:

tokenizer = BufferedTokenizer.new("\n")
tokenizer.extract("foo\nbar")  #=> ["foo"]
tokenizer.extract("baz\n")     #=> ["barbaz"]
tokenizer.flush                 #=> ""

Constant Summary collapse

SPLIT_LIMIT = Limit passed to String#split to preserve trailing empty fields

-1

Instance Attribute Summary collapse

#overlap ⇒ Integer readonly

Return the delimiter overlap length.

Instance Method Summary collapse

#extract(data) ⇒ Array<String>

Extract tokenized entities from the input data.
#flush ⇒ String

Flush the contents of the input buffer.
#initialize(delimiter = "\n") ⇒ BufferedTokenizer constructor

Create a new BufferedTokenizer.
#size ⇒ Integer

Return the byte size of the internal buffer.

Constructor Details

#initialize(delimiter = "\n") ⇒ `BufferedTokenizer`

Create a new BufferedTokenizer

Operates on lines delimited by a delimiter, which is by default “\n”.

The input buffer is stored as an array. This is by far the most efficient approach given language constraints (in C a linked list would be a more appropriate data structure). Segments of input data are stored in a list which is only joined when a token is reached, substantially reducing the number of objects required for the operation.

Examples:

tokenizer = BufferedTokenizer.new("<>")

Parameters:

delimiter (String) (defaults to: "\n") —

the token delimiter (default: “\n”)

# File 'lib/buftok.rb', line 50

def initialize(delimiter = "\n")
  @delimiter = delimiter
  @input = []
  @tail = +""
  @overlap = @delimiter.length - 1
end

Instance Attribute Details

#overlap ⇒ `Integer` (readonly)

Return the delimiter overlap length

The number of characters at the end of a chunk that may contain a partial delimiter, equal to delimiter.length - 1.

Examples:

BufferedTokenizer.new("<>").overlap  #=> 1

Returns:

(Integer) —

delimiter.length - 1



30
31
32

# File 'lib/buftok.rb', line 30

def overlap
  @overlap
end

Instance Method Details

#extract(data) ⇒ `Array<String>`

Extract tokenized entities from the input data

Extract takes an arbitrary string of input data and returns an array of tokenized entities, provided there were any available to extract. This makes for easy processing of datagrams using a pattern like:

tokenizer.extract(data).map { |entity| Decode(entity) }.each { ... }

Using -1 makes split return “” if the token is at the end of the string, meaning the last element is the start of the next chunk.

Examples:

tokenizer = BufferedTokenizer.new
tokenizer.extract("foo\nbar")  #=> ["foo"]

Parameters:

data (String) —

a chunk of input data

Returns:

(Array<String>) —

complete tokens extracted from the input

# File 'lib/buftok.rb', line 94

def extract(data)
  data = rejoin_split_delimiter(data)

  @input << @tail
  entities = data.split(@delimiter, SPLIT_LIMIT)
  @tail = entities.shift # : String

  consolidate_input(entities) if entities.length.positive?

  entities
end

#flush ⇒ `String`

Flush the contents of the input buffer

Return the contents of the input buffer even though a token has not yet been encountered, then reset the buffer.

Examples:

tokenizer = BufferedTokenizer.new
tokenizer.extract("foo\nbar")
tokenizer.flush  #=> "bar"

Returns:

(String) —

the buffered input

# File 'lib/buftok.rb', line 119

def flush
  @input << @tail
  buffer = @input.join
  @input.clear
  @tail = +""
  buffer
end

#size ⇒ `Integer`

Return the byte size of the internal buffer

Size is not cached and is determined every time this method is called in order to optimize throughput for extract.

Examples:

tokenizer = BufferedTokenizer.new
tokenizer.extract("foo")
tokenizer.size  #=> 3

Returns:

(Integer)



70
71
72

# File 'lib/buftok.rb', line 70

def size
  @tail.length + @input.sum(&:length)
end

Class: BufferedTokenizer

Overview

Examples:

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(delimiter = "\n") ⇒ BufferedTokenizer

Examples:

Instance Attribute Details

#overlap ⇒ Integer (readonly)

Examples:

Instance Method Details

#extract(data) ⇒ Array<String>

Examples:

#flush ⇒ String

Examples:

#size ⇒ Integer

Examples:

#initialize(delimiter = "\n") ⇒ `BufferedTokenizer`

#overlap ⇒ `Integer` (readonly)

#extract(data) ⇒ `Array<String>`

#flush ⇒ `String`

#size ⇒ `Integer`