Class: PDF::Reader::Buffer
- Inherits:
-
Object
- Object
- PDF::Reader::Buffer
- Defined in:
- lib/pdf/reader/buffer.rb
Overview
A string tokeniser that recognises PDF grammar. When passed an IO stream or a string, repeated calls to token() will return the next token from the source.
This is very low level, and getting the raw tokens is not very useful in itself.
This will usually be used in conjunction with PDF:Reader::Parser, which converts the raw tokens into objects we can work with (strings, ints, arrays, etc)
Constant Summary collapse
- TOKEN_WHITESPACE =
: Array
[0x00, 0x09, 0x0A, 0x0C, 0x0D, 0x20]
- TOKEN_DELIMITER =
: Array
[0x25, 0x3C, 0x3E, 0x28, 0x5B, 0x7B, 0x29, 0x5D, 0x7D, 0x2F]
- LEFT_PAREN =
some strings for comparissons. Declaring them here avoids creating new strings that need GC over and over
"("
- LESS_THAN =
: String
"<"
- STREAM =
: String
"stream"
- ID =
: String
"ID"
- FWD_SLASH =
: String
"/"
- NULL_BYTE =
: String
"\x00"
- CR =
: String
"\r"
- LF =
: String
"\n"
- CRLF =
: String
"\r\n"
- WHITE_SPACE =
: String
["\n", "\r", ' ']
- TRAILING_BYTECOUNT =
Quite a few PDFs have trailing junk. This can be several k of nuls in some cases Allow for this here
5000
- DIGITS_ONLY =
must match whole tokens
%r{\A\d+\z}
Instance Attribute Summary collapse
-
#pos ⇒ Object
readonly
: Integer.
Instance Method Summary collapse
-
#empty? ⇒ Boolean
return true if there are no more tokens left.
-
#find_first_xref_offset ⇒ Object
return the byte offset where the first XRef table in th source can be found.
-
#initialize(io, opts = {}) ⇒ Buffer
constructor
Creates a new buffer.
-
#read(bytes, opts = {}) ⇒ Object
return raw bytes from the underlying IO stream.
-
#token ⇒ Object
return the next token from the source.
Constructor Details
#initialize(io, opts = {}) ⇒ Buffer
Creates a new buffer.
Params:
io - an IO stream (usually a StringIO) with the raw data to tokenise
options:
:seek - a byte offset to seek to before starting to tokenise
:content_stream - set to true if buffer will be tokenising a
content stream. Defaults to false
: ((StringIO | Tempfile | IO), ?Hash[Symbol, untyped]) -> void
81 82 83 84 85 86 87 88 |
# File 'lib/pdf/reader/buffer.rb', line 81 def initialize(io, opts = {}) @io = io @tokens = [] #: Array[String | PDF::Reader::Reference] @in_content_stream = opts[:content_stream] #: bool @io.seek(opts[:seek]) if opts[:seek] @pos = @io.pos #: Integer end |
Instance Attribute Details
#pos ⇒ Object (readonly)
: Integer
66 67 68 |
# File 'lib/pdf/reader/buffer.rb', line 66 def pos @pos end |
Instance Method Details
#empty? ⇒ Boolean
return true if there are no more tokens left
: () -> bool
93 94 95 96 97 |
# File 'lib/pdf/reader/buffer.rb', line 93 def empty? prepare_tokens if @tokens.size < 3 @tokens.empty? end |
#find_first_xref_offset ⇒ Object
return the byte offset where the first XRef table in th source can be found.
: () -> Integer
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
# File 'lib/pdf/reader/buffer.rb', line 150 def find_first_xref_offset check_size_is_non_zero @io.seek(-TRAILING_BYTECOUNT, IO::SEEK_END) rescue @io.seek(0) data = @io.read(TRAILING_BYTECOUNT) raise MalformedPDFError, "PDF does not contain EOF marker" if data.nil? # the PDF 1.7 spec (section #3.4) says that EOL markers can be either \r, \n, or both. lines = data.split(/[\n\r]+/).reverse eof_index = lines.index { |l| l.strip[/^%%EOF/] } raise MalformedPDFError, "PDF does not contain EOF marker" if eof_index.nil? raise MalformedPDFError, "PDF EOF marker does not follow offset" if eof_index >= lines.size-1 offset = lines[eof_index+1].to_i # a byte offset < 0 doesn't make much sense. This is unlikely to happen, but in theory some # corrupted PDFs might have a line that looks like a negative int preceding the `%%EOF` raise MalformedPDFError, "invalid xref offset" if offset < 0 offset end |
#read(bytes, opts = {}) ⇒ Object
return raw bytes from the underlying IO stream.
bytes - the number of bytes to read
options:
:skip_eol - if true, the IO stream is advanced past a CRLF, CR or LF
that is sitting under the io cursor.
Note:
Skipping a bare CR is not spec-compliant.
This is because the data may start with LF.
However we check for CRLF first, so the ambiguity is avoided.
: (Integer, ?Hash[Symbol, untyped]) -> String?
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
# File 'lib/pdf/reader/buffer.rb', line 112 def read(bytes, opts = {}) reset_pos if opts[:skip_eol] @io.seek(-1, IO::SEEK_CUR) str = @io.read(2) if str.nil? return nil elsif str == CRLF # This MUST be done before checking for CR alone # do nothing elsif str[0, 1] == LF || str[0, 1] == CR # LF or CR alone @io.seek(-1, IO::SEEK_CUR) else @io.seek(-2, IO::SEEK_CUR) end end bytes = @io.read(bytes) save_pos bytes end |
#token ⇒ Object
return the next token from the source. Returns a string if a token is found, nil if there are no tokens left.
: () -> (nil | String | PDF::Reader::Reference)
138 139 140 141 142 143 144 145 |
# File 'lib/pdf/reader/buffer.rb', line 138 def token reset_pos prepare_tokens if @tokens.size < 3 merge_indirect_reference prepare_tokens if @tokens.size < 3 @tokens.shift end |