Class: HexaPDF::Content::Tokenizer

Inherits:
Tokenizer
  • Object
show all
Defined in:
lib/hexapdf/content/parser.rb

Overview

More efficient tokenizer for content streams. This tokenizer class works directly on a string and not on an IO.

Changes:

  • Since a content stream is usually parsed front to back, a StopIteration error can be raised instead of returning NO_MORE_TOKENS once the end of the string is reached to avoid costly checks in each iteration. If this behaviour is wanted, pass “raise_on_eos: true” in the constructor.

  • Indirect object references are not supported by this tokenizer!

See: PDF2.0 s7.2

Constant Summary

Constants inherited from Tokenizer

Tokenizer::DELIMITER, Tokenizer::NO_MORE_TOKENS, Tokenizer::TOKEN_ARRAY_END, Tokenizer::TOKEN_ARRAY_START, Tokenizer::TOKEN_DICT_END, Tokenizer::TOKEN_DICT_START, Tokenizer::WHITESPACE, Tokenizer::WHITESPACE_MULTI_RE, Tokenizer::WHITESPACE_OR_DELIMITER_RE

Instance Attribute Summary collapse

Attributes inherited from Tokenizer

#io

Instance Method Summary collapse

Methods inherited from Tokenizer

#next_byte, #next_integer_or_keyword, #next_object, #next_xref_entry, #peek_token, #skip_whitespace

Constructor Details

#initialize(string, raise_on_eos: false) ⇒ Tokenizer

Creates a new tokenizer.



63
64
65
66
67
# File 'lib/hexapdf/content/parser.rb', line 63

def initialize(string, raise_on_eos: false)
  @ss = StringScanner.new(string)
  @string = string
  @raise_on_eos = raise_on_eos
end

Instance Attribute Details

#stringObject (readonly)

The string that is tokenized.



60
61
62
# File 'lib/hexapdf/content/parser.rb', line 60

def string
  @string
end

Instance Method Details

#next_tokenObject

See: HexaPDF::Tokenizer#next_token



85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
# File 'lib/hexapdf/content/parser.rb', line 85

def next_token
  @ss.skip(WHITESPACE_MULTI_RE)
  case (byte = @ss.scan_byte || -1)
  when 43, 45, 46, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57 # + - . 0..9
    @ss.pos -= 1
    parse_number
  when 47 # /
    parse_name
  when 40 # (
    parse_literal_string
  when 60 # <
    if @ss.peek_byte == 60
      @ss.pos += 1
      TOKEN_DICT_START
    else
      parse_hex_string
    end
  when 62 # >
    unless @ss.scan_byte == 62
      raise HexaPDF::MalformedPDFError.new("Delimiter '>' found at invalid position", pos: pos - 1)
    end
    TOKEN_DICT_END
  when 91 # [
    TOKEN_ARRAY_START
  when 93 # ]
    TOKEN_ARRAY_END
  when 41 # )
    raise HexaPDF::MalformedPDFError.new("Delimiter ')' found at invalid position", pos: pos - 1)
  when 123, 125 # { } )
    Token.new(byte.chr.b)
  when 37 # %
    unless @ss.skip_until(/(?=[\r\n])/)
      (@raise_on_eos ? (raise StopIteration) : (return NO_MORE_TOKENS))
    end
    next_token
  when -1
    @raise_on_eos ? raise(StopIteration) : NO_MORE_TOKENS
  else
    @ss.pos -= 1
    parse_keyword
  end
end

#posObject

See: HexaPDF::Tokenizer#pos



70
71
72
# File 'lib/hexapdf/content/parser.rb', line 70

def pos
  @ss.pos
end

#pos=(pos) ⇒ Object

See: HexaPDF::Tokenizer#pos=



75
76
77
# File 'lib/hexapdf/content/parser.rb', line 75

def pos=(pos)
  @ss.pos = pos
end

#scan_until(re) ⇒ Object

See: HexaPDF::Tokenizer#scan_until



80
81
82
# File 'lib/hexapdf/content/parser.rb', line 80

def scan_until(re)
  @ss.scan_until(re)
end