Class: HexaPDF::Tokenizer
- Inherits:
-
Object
- Object
- HexaPDF::Tokenizer
- Defined in:
- lib/hexapdf/tokenizer.rb
Overview
Tokenizes the content of an IO object following the PDF rules.
See: PDF2.0 s7.2
Direct Known Subclasses
Defined Under Namespace
Classes: Token
Constant Summary collapse
- TOKEN_DICT_START =
:nodoc:
Token.new('<<'.b)
- TOKEN_DICT_END =
:nodoc:
Token.new('>>'.b)
- TOKEN_ARRAY_START =
:nodoc:
Token.new('['.b)
- TOKEN_ARRAY_END =
:nodoc:
Token.new(']'.b)
- NO_MORE_TOKENS =
This object is returned when there are no more tokens to read.
::Object.new
- WHITESPACE =
Characters defined as whitespace.
See: PDF2.0 s7.2.2
" \n\r\0\t\f"
- DELIMITER =
Characters defined as delimiters.
See: PDF2.0 s7.2.2
"()<>{}/[]%"
- WHITESPACE_MULTI_RE =
:nodoc:
/[#{WHITESPACE}]+/
- WHITESPACE_OR_DELIMITER_RE =
:nodoc:
/(?=[#{Regexp.escape(WHITESPACE + DELIMITER)}])/
Instance Attribute Summary collapse
-
#io ⇒ Object
readonly
The IO object from the tokens are read.
Instance Method Summary collapse
-
#initialize(io, on_correctable_error: nil) ⇒ Tokenizer
constructor
Creates a new tokenizer for the given IO stream.
-
#next_byte ⇒ Object
Reads the byte (an integer) at the current position and advances the scan pointer.
-
#next_integer_or_keyword ⇒ Object
Returns a single integer or keyword token read from the current position and advances the scan pointer.
-
#next_object(allow_end_array_token: false, allow_keyword: false) ⇒ Object
Returns the PDF object at the current position.
-
#next_token ⇒ Object
Returns a single token read from the current position and advances the scan pointer.
-
#next_xref_entry ⇒ Object
Reads the cross-reference subsection entry at the current position and advances the scan pointer.
-
#peek_token ⇒ Object
Returns the next token but does not advance the scan pointer.
-
#pos ⇒ Object
Returns the current position of the tokenizer inside in the IO object.
-
#pos=(pos) ⇒ Object
Sets the position at which the next token should be read.
-
#scan_until(re) ⇒ Object
Utility method for scanning until the given regular expression matches.
-
#skip_whitespace ⇒ Object
Skips all whitespace at the current position.
Constructor Details
#initialize(io, on_correctable_error: nil) ⇒ Tokenizer
Creates a new tokenizer for the given IO stream.
If on_correctable_error
is set to an object responding to call(msg, pos), errors for correctable situations are only raised if the return value of calling the object is true
.
83 84 85 86 87 88 89 90 |
# File 'lib/hexapdf/tokenizer.rb', line 83 def initialize(io, on_correctable_error: nil) @io = io @io_chunk = String.new(''.b) @ss = StringScanner.new(''.b) @original_pos = -1 @on_correctable_error = on_correctable_error || proc { false } self.pos = 0 end |
Instance Attribute Details
#io ⇒ Object (readonly)
The IO object from the tokens are read.
77 78 79 |
# File 'lib/hexapdf/tokenizer.rb', line 77 def io @io end |
Instance Method Details
#next_byte ⇒ Object
Reads the byte (an integer) at the current position and advances the scan pointer.
225 226 227 228 |
# File 'lib/hexapdf/tokenizer.rb', line 225 def next_byte prepare_string_scanner(1) @ss.scan_byte end |
#next_integer_or_keyword ⇒ Object
Returns a single integer or keyword token read from the current position and advances the scan pointer. If the current position doesn’t contain such a token, nil
is returned without advancing the scan pointer. The value NO_MORE_TOKENS
is returned if there are no more tokens available.
Initial runs of whitespace characters are ignored.
Note: This is a special method meant for use with reconstructing the cross-reference table!
209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
# File 'lib/hexapdf/tokenizer.rb', line 209 def next_integer_or_keyword skip_whitespace byte = @ss.peek_byte || -1 case byte when 48, 49, 50, 51, 52, 53, 54, 55, 56, 57 parse_number when 97..122, 65..90 parse_keyword when -1 # we reached the end of the file NO_MORE_TOKENS else nil end end |
#next_object(allow_end_array_token: false, allow_keyword: false) ⇒ Object
Returns the PDF object at the current position. This is different from #next_token because references, arrays and dictionaries consist of multiple tokens.
If the allow_end_array_token
argument is true
, the ‘]’ token is permitted to facilitate the use of this method during array parsing.
See: PDF2.0 s7.3
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
# File 'lib/hexapdf/tokenizer.rb', line 177 def next_object(allow_end_array_token: false, allow_keyword: false) token = next_token if token.kind_of?(Token) case token when TOKEN_DICT_START token = parse_dictionary when TOKEN_ARRAY_START token = parse_array when TOKEN_ARRAY_END unless allow_end_array_token raise HexaPDF::MalformedPDFError.new("Found invalid end array token ']'", pos: pos) end else unless allow_keyword maybe_raise("Invalid object, got token #{token}", force: token !~ /^-?(nan|inf)$/i) token = 0 end end end token end |
#next_token ⇒ Object
Returns a single token read from the current position and advances the scan pointer.
Comments and a run of whitespace characters are ignored. The value NO_MORE_TOKENS
is returned if there are no more tokens available.
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
# File 'lib/hexapdf/tokenizer.rb', line 118 def next_token prepare_string_scanner(20) prepare_string_scanner(20) while @ss.skip(WHITESPACE_MULTI_RE) case (byte = @ss.scan_byte || -1) when 43, 45, 46, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57 # + - . 0..9 @ss.pos -= 1 parse_number when 47 # / parse_name when 40 # ( parse_literal_string when 60 # < if @ss.peek_byte == 60 @ss.pos += 1 TOKEN_DICT_START else parse_hex_string end when 62 # > unless @ss.scan_byte == 62 raise HexaPDF::MalformedPDFError.new("Delimiter '>' found at invalid position", pos: pos - 1) end TOKEN_DICT_END when 91 # [ TOKEN_ARRAY_START when 93 # ] TOKEN_ARRAY_END when 41 # ) raise HexaPDF::MalformedPDFError.new("Delimiter ')' found at invalid position", pos: pos - 1) when 123, 125 # { } Token.new(byte.chr.b) when 37 # % until @ss.skip_until(/(?=[\r\n])/) return NO_MORE_TOKENS unless prepare_string_scanner end next_token when -1 # we reached the end of the file NO_MORE_TOKENS else # everything else consisting of regular characters @ss.pos -= 1 parse_keyword end end |
#next_xref_entry ⇒ Object
Reads the cross-reference subsection entry at the current position and advances the scan pointer.
If a problem is detected, yields to caller where the argument recoverable
is truthy if the problem is recoverable.
See: PDF2.0 7.5.4
237 238 239 240 241 242 243 |
# File 'lib/hexapdf/tokenizer.rb', line 237 def next_xref_entry #:yield: recoverable prepare_string_scanner(20) if !@ss.skip(/(\d{10}) (\d{5}) ([nf])(?: \r| \n|\r\n|(\r\r|\r|\n))/) || @ss[4] yield(@ss[4]) end [@ss[1].to_i, @ss[2].to_i, @ss[3]] end |
#peek_token ⇒ Object
Returns the next token but does not advance the scan pointer.
163 164 165 166 167 168 |
# File 'lib/hexapdf/tokenizer.rb', line 163 def peek_token pos = self.pos tok = next_token self.pos = pos tok end |
#pos ⇒ Object
Returns the current position of the tokenizer inside in the IO object.
Note that this position might be different from io.pos
since the latter could have been changed somewhere else.
96 97 98 |
# File 'lib/hexapdf/tokenizer.rb', line 96 def pos @original_pos + @ss.pos end |
#pos=(pos) ⇒ Object
Sets the position at which the next token should be read.
Note that this does not set io.pos
directly (at the moment of invocation)!
103 104 105 106 107 108 109 110 111 112 |
# File 'lib/hexapdf/tokenizer.rb', line 103 def pos=(pos) if pos >= @original_pos && pos <= @original_pos + @ss.string.size @ss.pos = pos - @original_pos else @original_pos = pos @next_read_pos = pos @ss.string.clear @ss.reset end end |
#scan_until(re) ⇒ Object
Utility method for scanning until the given regular expression matches.
If the end of the file is reached in the process, nil
is returned. Otherwise the matched string is returned.
257 258 259 260 261 262 |
# File 'lib/hexapdf/tokenizer.rb', line 257 def scan_until(re) until (data = @ss.scan_until(re)) return nil unless prepare_string_scanner end data end |
#skip_whitespace ⇒ Object
Skips all whitespace at the current position.
See: PDF2.0 s7.2.2
248 249 250 251 |
# File 'lib/hexapdf/tokenizer.rb', line 248 def skip_whitespace prepare_string_scanner prepare_string_scanner while @ss.skip(WHITESPACE_MULTI_RE) end |