Class: UV::BufferedTokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/uv-rays/buffered_tokenizer.rb

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ BufferedTokenizer

Returns a new instance of BufferedTokenizer.

Parameters:

  • options (Hash)

Raises:

  • (ArgumentError)


22
23
24
25
26
27
28
29
30
31
# File 'lib/uv-rays/buffered_tokenizer.rb', line 22

def initialize(options)
    @delimiter  = options[:delimiter]
    @indicator  = options[:indicator]
    @size_limit = options[:size_limit]
    @verbose    = options[:verbose] if @size_limit

    raise ArgumentError, 'no delimiter provided' unless @delimiter

    @input = ''
end

Instance Attribute Details

#delimiterObject

Returns the value of attribute delimiter.



19
20
21
# File 'lib/uv-rays/buffered_tokenizer.rb', line 19

def delimiter
  @delimiter
end

#indicatorObject

Returns the value of attribute indicator.



19
20
21
# File 'lib/uv-rays/buffered_tokenizer.rb', line 19

def indicator
  @indicator
end

#size_limitObject

Returns the value of attribute size_limit.



19
20
21
# File 'lib/uv-rays/buffered_tokenizer.rb', line 19

def size_limit
  @size_limit
end

#verboseObject

Returns the value of attribute verbose.



19
20
21
# File 'lib/uv-rays/buffered_tokenizer.rb', line 19

def verbose
  @verbose
end

Instance Method Details

#empty?Boolean

Returns:

  • (Boolean)


93
94
95
# File 'lib/uv-rays/buffered_tokenizer.rb', line 93

def empty?
    @input.empty?
end

#extract(data) ⇒ Object

Extract takes an arbitrary string of input data and returns an array of tokenized entities, provided there were any available to extract.

Examples:


tokenizer.extract(data).
    map { |entity| Decode(entity) }.each { ... }

Parameters:

  • data (String)


42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# File 'lib/uv-rays/buffered_tokenizer.rb', line 42

def extract(data)
    @input << data

    # Extract token-delimited entities from the input string with the split command.
    # There's a bit of craftiness here with the -1 parameter.    Normally split would
    # behave no differently regardless of if the token lies at the very end of the
    # input buffer or not (i.e. a literal edge case)    Specifying -1 forces split to
    # return "" in this case, meaning that the last entry in the list represents a
    # new segment of data where the token has not been encountered
    messages = @input.split(@delimiter, -1)

    if @indicator
        @input = messages.pop
        entities = []
        messages.each do |msg|
            res = msg.split(@indicator, -1)
            entities << res.last if res.length > 1
        end
    else
        entities = messages
        @input = entities.pop
    end

    # Check to see if the buffer has exceeded capacity, if we're imposing a limit
    if @size_limit && @input.size > @size_limit
        if @indicator && @indicator.respond_to?(:length) # check for regex
            # save enough of the buffer that if one character of the indicator were
            # missing we would match on next extract (very much an edge case) and
            # best we can do with a full buffer. If we were one char short of a
            # delimiter it would be unfortunate
            @input = @input[-(@indicator.length - 1)..-1]
        else
            @input = ''
        end
        raise 'input buffer exceeded limit' if @verbose
    end

    return entities
end

#flushString

Flush the contents of the input buffer, i.e. return the input buffer even though a token has not yet been encountered.

Returns:

  • (String)


86
87
88
89
90
# File 'lib/uv-rays/buffered_tokenizer.rb', line 86

def flush
    buffer = @input
    @input = ''
    buffer
end