Class: HexaPDF::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/hexapdf/tokenizer.rb

Overview

Tokenizes the content of an IO object following the PDF rules.

See: PDF2.0 s7.2

Direct Known Subclasses

Content::Tokenizer

Defined Under Namespace

Classes: Token

Constant Summary collapse

TOKEN_DICT_START =

:nodoc:

Token.new('<<'.b)
TOKEN_DICT_END =

:nodoc:

Token.new('>>'.b)
TOKEN_ARRAY_START =

:nodoc:

Token.new('['.b)
TOKEN_ARRAY_END =

:nodoc:

Token.new(']'.b)
NO_MORE_TOKENS =

This object is returned when there are no more tokens to read.

::Object.new
WHITESPACE =

Characters defined as whitespace.

See: PDF2.0 s7.2.2

" \n\r\0\t\f"
DELIMITER =

Characters defined as delimiters.

See: PDF2.0 s7.2.2

"()<>{}/[]%"
WHITESPACE_MULTI_RE =

:nodoc:

/[#{WHITESPACE}]+/
WHITESPACE_OR_DELIMITER_RE =

:nodoc:

/(?=[#{Regexp.escape(WHITESPACE + DELIMITER)}])/

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(io, on_correctable_error: nil) ⇒ Tokenizer

Creates a new tokenizer for the given IO stream.

If on_correctable_error is set to an object responding to call(msg, pos), errors for correctable situations are only raised if the return value of calling the object is true.



83
84
85
86
87
88
89
90
# File 'lib/hexapdf/tokenizer.rb', line 83

def initialize(io, on_correctable_error: nil)
  @io = io
  @io_chunk = String.new(''.b)
  @ss = StringScanner.new(''.b)
  @original_pos = -1
  @on_correctable_error = on_correctable_error || proc { false }
  self.pos = 0
end

Instance Attribute Details

#ioObject (readonly)

The IO object from the tokens are read.



77
78
79
# File 'lib/hexapdf/tokenizer.rb', line 77

def io
  @io
end

Instance Method Details

#next_byteObject

Reads the byte (an integer) at the current position and advances the scan pointer.



225
226
227
228
# File 'lib/hexapdf/tokenizer.rb', line 225

def next_byte
  prepare_string_scanner(1)
  @ss.scan_byte
end

#next_integer_or_keywordObject

Returns a single integer or keyword token read from the current position and advances the scan pointer. If the current position doesn’t contain such a token, nil is returned without advancing the scan pointer. The value NO_MORE_TOKENS is returned if there are no more tokens available.

Initial runs of whitespace characters are ignored.

Note: This is a special method meant for use with reconstructing the cross-reference table!



209
210
211
212
213
214
215
216
217
218
219
220
221
222
# File 'lib/hexapdf/tokenizer.rb', line 209

def next_integer_or_keyword
  skip_whitespace
  byte = @ss.peek_byte || -1
  case byte
  when 48, 49, 50, 51, 52, 53, 54, 55, 56, 57
    parse_number
  when 97..122, 65..90
    parse_keyword
  when -1 # we reached the end of the file
    NO_MORE_TOKENS
  else
    nil
  end
end

#next_object(allow_end_array_token: false, allow_keyword: false) ⇒ Object

Returns the PDF object at the current position. This is different from #next_token because references, arrays and dictionaries consist of multiple tokens.

If the allow_end_array_token argument is true, the ‘]’ token is permitted to facilitate the use of this method during array parsing.

See: PDF2.0 s7.3



177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# File 'lib/hexapdf/tokenizer.rb', line 177

def next_object(allow_end_array_token: false, allow_keyword: false)
  token = next_token

  if token.kind_of?(Token)
    case token
    when TOKEN_DICT_START
      token = parse_dictionary
    when TOKEN_ARRAY_START
      token = parse_array
    when TOKEN_ARRAY_END
      unless allow_end_array_token
        raise HexaPDF::MalformedPDFError.new("Found invalid end array token ']'", pos: pos)
      end
    else
      unless allow_keyword
        maybe_raise("Invalid object, got token #{token}", force: token !~ /^-?(nan|inf)$/i)
        token = 0
      end
    end
  end

  token
end

#next_tokenObject

Returns a single token read from the current position and advances the scan pointer.

Comments and a run of whitespace characters are ignored. The value NO_MORE_TOKENS is returned if there are no more tokens available.



118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# File 'lib/hexapdf/tokenizer.rb', line 118

def next_token
  prepare_string_scanner(20)
  prepare_string_scanner(20) while @ss.skip(WHITESPACE_MULTI_RE)
  case (byte = @ss.scan_byte || -1)
  when 43, 45, 46, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57 # + - . 0..9
    @ss.pos -= 1
    parse_number
  when 47 # /
    parse_name
  when 40 # (
    parse_literal_string
  when 60 # <
    if @ss.peek_byte == 60
      @ss.pos += 1
      TOKEN_DICT_START
    else
      parse_hex_string
    end
  when 62 # >
    unless @ss.scan_byte == 62
      raise HexaPDF::MalformedPDFError.new("Delimiter '>' found at invalid position", pos: pos - 1)
    end
    TOKEN_DICT_END
  when 91 # [
    TOKEN_ARRAY_START
  when 93 # ]
    TOKEN_ARRAY_END
  when 41 # )
    raise HexaPDF::MalformedPDFError.new("Delimiter ')' found at invalid position", pos: pos - 1)
  when 123, 125 # { }
    Token.new(byte.chr.b)
  when 37 # %
    until @ss.skip_until(/(?=[\r\n])/)
      return NO_MORE_TOKENS unless prepare_string_scanner
    end
    next_token
  when -1 # we reached the end of the file
    NO_MORE_TOKENS
  else # everything else consisting of regular characters
    @ss.pos -= 1
    parse_keyword
  end
end

#next_xref_entryObject

Reads the cross-reference subsection entry at the current position and advances the scan pointer.

If a problem is detected, yields to caller where the argument recoverable is truthy if the problem is recoverable.

See: PDF2.0 7.5.4



237
238
239
240
241
242
243
# File 'lib/hexapdf/tokenizer.rb', line 237

def next_xref_entry #:yield: recoverable
  prepare_string_scanner(20)
  if !@ss.skip(/(\d{10}) (\d{5}) ([nf])(?: \r| \n|\r\n|(\r\r|\r|\n))/) || @ss[4]
    yield(@ss[4])
  end
  [@ss[1].to_i, @ss[2].to_i, @ss[3]]
end

#peek_tokenObject

Returns the next token but does not advance the scan pointer.



163
164
165
166
167
168
# File 'lib/hexapdf/tokenizer.rb', line 163

def peek_token
  pos = self.pos
  tok = next_token
  self.pos = pos
  tok
end

#posObject

Returns the current position of the tokenizer inside in the IO object.

Note that this position might be different from io.pos since the latter could have been changed somewhere else.



96
97
98
# File 'lib/hexapdf/tokenizer.rb', line 96

def pos
  @original_pos + @ss.pos
end

#pos=(pos) ⇒ Object

Sets the position at which the next token should be read.

Note that this does not set io.pos directly (at the moment of invocation)!



103
104
105
106
107
108
109
110
111
112
# File 'lib/hexapdf/tokenizer.rb', line 103

def pos=(pos)
  if pos >= @original_pos && pos <= @original_pos + @ss.string.size
    @ss.pos = pos - @original_pos
  else
    @original_pos = pos
    @next_read_pos = pos
    @ss.string.clear
    @ss.reset
  end
end

#scan_until(re) ⇒ Object

Utility method for scanning until the given regular expression matches.

If the end of the file is reached in the process, nil is returned. Otherwise the matched string is returned.



257
258
259
260
261
262
# File 'lib/hexapdf/tokenizer.rb', line 257

def scan_until(re)
  until (data = @ss.scan_until(re))
    return nil unless prepare_string_scanner
  end
  data
end

#skip_whitespaceObject

Skips all whitespace at the current position.

See: PDF2.0 s7.2.2



248
249
250
251
# File 'lib/hexapdf/tokenizer.rb', line 248

def skip_whitespace
  prepare_string_scanner
  prepare_string_scanner while @ss.skip(WHITESPACE_MULTI_RE)
end