Class: HexaPDF::Tokenizer
- Inherits:
-
Object
- Object
- HexaPDF::Tokenizer
- Defined in:
- lib/hexapdf/tokenizer.rb
Overview
Tokenizes the content of an IO object following the PDF rules.
See: PDF2.0 s7.2
Direct Known Subclasses
Defined Under Namespace
Classes: Token
Constant Summary collapse
- TOKEN_DICT_START =
:nodoc:
Token.new('<<'.b)
- TOKEN_DICT_END =
:nodoc:
Token.new('>>'.b)
- TOKEN_ARRAY_START =
:nodoc:
Token.new('['.b)
- TOKEN_ARRAY_END =
:nodoc:
Token.new(']'.b)
- NO_MORE_TOKENS =
This object is returned when there are no more tokens to read.
::Object.new
- WHITESPACE =
Characters defined as whitespace.
See: PDF2.0 s7.2.2
" \n\r\0\t\f"
- DELIMITER =
Characters defined as delimiters.
See: PDF2.0 s7.2.2
"()<>{}/[]%"
- WHITESPACE_MULTI_RE =
:nodoc:
/[#{WHITESPACE}]+/
- WHITESPACE_OR_DELIMITER_RE =
:nodoc:
/(?=[#{Regexp.escape(WHITESPACE + DELIMITER)}])/
Instance Attribute Summary collapse
-
#io ⇒ Object
readonly
The IO object from the tokens are read.
Instance Method Summary collapse
-
#initialize(io, on_correctable_error: nil) ⇒ Tokenizer
constructor
Creates a new tokenizer for the given IO stream.
-
#next_byte ⇒ Object
Reads the byte (an integer) at the current position and advances the scan pointer.
-
#next_integer_or_keyword ⇒ Object
Returns a single integer or keyword token read from the current position and advances the scan pointer.
-
#next_object(allow_end_array_token: false, allow_keyword: false) ⇒ Object
Returns the PDF object at the current position.
-
#next_token ⇒ Object
Returns a single token read from the current position and advances the scan pointer.
-
#next_xref_entry ⇒ Object
Reads the cross-reference subsection entry at the current position and advances the scan pointer.
-
#peek_token ⇒ Object
Returns the next token but does not advance the scan pointer.
-
#pos ⇒ Object
Returns the current position of the tokenizer inside in the IO object.
-
#pos=(pos) ⇒ Object
Sets the position at which the next token should be read.
-
#scan_until(re) ⇒ Object
Utility method for scanning until the given regular expression matches.
-
#skip_whitespace ⇒ Object
Skips all whitespace at the current position.
Constructor Details
#initialize(io, on_correctable_error: nil) ⇒ Tokenizer
Creates a new tokenizer for the given IO stream.
If on_correctable_error
is set to an object responding to call(msg, pos), errors for correctable situations are only raised if the return value of calling the object is true
.
83 84 85 86 87 88 89 90 |
# File 'lib/hexapdf/tokenizer.rb', line 83 def initialize(io, on_correctable_error: nil) @io = io @io_chunk = String.new(''.b) @ss = StringScanner.new(''.b) @original_pos = -1 @on_correctable_error = on_correctable_error || proc { false } self.pos = 0 end |
Instance Attribute Details
#io ⇒ Object (readonly)
The IO object from the tokens are read.
77 78 79 |
# File 'lib/hexapdf/tokenizer.rb', line 77 def io @io end |
Instance Method Details
#next_byte ⇒ Object
Reads the byte (an integer) at the current position and advances the scan pointer.
225 226 227 228 229 |
# File 'lib/hexapdf/tokenizer.rb', line 225 def next_byte prepare_string_scanner(1) @ss.pos += 1 @ss.string.getbyte(@ss.pos - 1) end |
#next_integer_or_keyword ⇒ Object
Returns a single integer or keyword token read from the current position and advances the scan pointer. If the current position doesn’t contain such a token, nil
is returned without advancing the scan pointer. The value NO_MORE_TOKENS
is returned if there are no more tokens available.
Initial runs of whitespace characters are ignored.
Note: This is a special method meant for use with reconstructing the cross-reference table!
210 211 212 213 214 215 216 217 218 219 220 221 222 |
# File 'lib/hexapdf/tokenizer.rb', line 210 def next_integer_or_keyword skip_whitespace byte = @ss.string.getbyte(@ss.pos) || -1 if 48 <= byte && byte <= 57 parse_number elsif (97 <= byte && byte <= 122) || (65 <= byte && byte <= 90) parse_keyword elsif byte == -1 # we reached the end of the file NO_MORE_TOKENS else nil end end |
#next_object(allow_end_array_token: false, allow_keyword: false) ⇒ Object
Returns the PDF object at the current position. This is different from #next_token because references, arrays and dictionaries consist of multiple tokens.
If the allow_end_array_token
argument is true
, the ‘]’ token is permitted to facilitate the use of this method during array parsing.
See: PDF2.0 s7.3
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
# File 'lib/hexapdf/tokenizer.rb', line 178 def next_object(allow_end_array_token: false, allow_keyword: false) token = next_token if token.kind_of?(Token) case token when TOKEN_DICT_START token = parse_dictionary when TOKEN_ARRAY_START token = parse_array when TOKEN_ARRAY_END unless allow_end_array_token raise HexaPDF::MalformedPDFError.new("Found invalid end array token ']'", pos: pos) end else unless allow_keyword maybe_raise("Invalid object, got token #{token}", force: token !~ /^-?(nan|inf)$/i) token = 0 end end end token end |
#next_token ⇒ Object
Returns a single token read from the current position and advances the scan pointer.
Comments and a run of whitespace characters are ignored. The value NO_MORE_TOKENS
is returned if there are no more tokens available.
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
# File 'lib/hexapdf/tokenizer.rb', line 118 def next_token prepare_string_scanner(20) prepare_string_scanner(20) while @ss.skip(WHITESPACE_MULTI_RE) byte = @ss.string.getbyte(@ss.pos) || -1 if (48 <= byte && byte <= 57) || byte == 45 || byte == 43 || byte == 46 # 0..9 - + . parse_number elsif byte == 47 # / parse_name elsif byte == 40 # ( parse_literal_string elsif byte == 60 # < if @ss.string.getbyte(@ss.pos + 1) == 60 @ss.pos += 2 TOKEN_DICT_START else parse_hex_string end elsif byte == 62 # > unless @ss.string.getbyte(@ss.pos + 1) == 62 raise HexaPDF::MalformedPDFError.new("Delimiter '>' found at invalid position", pos: pos) end @ss.pos += 2 TOKEN_DICT_END elsif byte == 91 # [ @ss.pos += 1 TOKEN_ARRAY_START elsif byte == 93 # ] @ss.pos += 1 TOKEN_ARRAY_END elsif byte == 41 # ) raise HexaPDF::MalformedPDFError.new("Delimiter ')' found at invalid position", pos: pos) elsif byte == 123 || byte == 125 # { } Token.new(@ss.get_byte) elsif byte == 37 # % until @ss.skip_until(/(?=[\r\n])/) return NO_MORE_TOKENS unless prepare_string_scanner end next_token elsif byte == -1 # we reached the end of the file NO_MORE_TOKENS else # everything else consisting of regular characters parse_keyword end end |
#next_xref_entry ⇒ Object
Reads the cross-reference subsection entry at the current position and advances the scan pointer.
If a problem is detected, yields to caller where the argument recoverable
is truthy if the problem is recoverable.
See: PDF2.0 7.5.4
238 239 240 241 242 243 244 |
# File 'lib/hexapdf/tokenizer.rb', line 238 def next_xref_entry #:yield: recoverable prepare_string_scanner(20) if !@ss.skip(/(\d{10}) (\d{5}) ([nf])(?: \r| \n|\r\n|(\r\r|\r|\n))/) || @ss[4] yield(@ss[4]) end [@ss[1].to_i, @ss[2].to_i, @ss[3]] end |
#peek_token ⇒ Object
Returns the next token but does not advance the scan pointer.
164 165 166 167 168 169 |
# File 'lib/hexapdf/tokenizer.rb', line 164 def peek_token pos = self.pos tok = next_token self.pos = pos tok end |
#pos ⇒ Object
Returns the current position of the tokenizer inside in the IO object.
Note that this position might be different from io.pos
since the latter could have been changed somewhere else.
96 97 98 |
# File 'lib/hexapdf/tokenizer.rb', line 96 def pos @original_pos + @ss.pos end |
#pos=(pos) ⇒ Object
Sets the position at which the next token should be read.
Note that this does not set io.pos
directly (at the moment of invocation)!
103 104 105 106 107 108 109 110 111 112 |
# File 'lib/hexapdf/tokenizer.rb', line 103 def pos=(pos) if pos >= @original_pos && pos <= @original_pos + @ss.string.size @ss.pos = pos - @original_pos else @original_pos = pos @next_read_pos = pos @ss.string.clear @ss.reset end end |
#scan_until(re) ⇒ Object
Utility method for scanning until the given regular expression matches.
If the end of the file is reached in the process, nil
is returned. Otherwise the matched string is returned.
258 259 260 261 262 263 |
# File 'lib/hexapdf/tokenizer.rb', line 258 def scan_until(re) until (data = @ss.scan_until(re)) return nil unless prepare_string_scanner end data end |
#skip_whitespace ⇒ Object
Skips all whitespace at the current position.
See: PDF2.0 s7.2.2
249 250 251 252 |
# File 'lib/hexapdf/tokenizer.rb', line 249 def skip_whitespace prepare_string_scanner prepare_string_scanner while @ss.skip(WHITESPACE_MULTI_RE) end |