Class: PDF::Reader::Buffer

Inherits:
Object
  • Object
show all
Defined in:
lib/pdf/reader/buffer.rb

Overview

A string tokeniser that recognises PDF grammar. When passed an IO stream or a string, repeated calls to token() will return the next token from the source.

This is very low level, and getting the raw tokens is not very useful in itself.

This will usually be used in conjunction with PDF:Reader::Parser, which converts the raw tokens into objects we can work with (strings, ints, arrays, etc)

Constant Summary collapse

TOKEN_WHITESPACE =
[0x00, 0x09, 0x0A, 0x0C, 0x0D, 0x20]
LEFT_PAREN =

some strings for comparissons. Declaring them here avoids creating new strings that need GC over and over

"("
LESS_THAN =
"<"
STREAM =
"stream"
ID =
"ID"
FWD_SLASH =
"/"

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(io, opts = {}) ⇒ Buffer

Creates a new buffer.

Params:

io - an IO stream or string with the raw data to tokenise

options:

:seek - a byte offset to seek to before starting to tokenise
:content_stream - set to true if buffer will be tokenising a
                  content stream. Defaults to false


63
64
65
66
67
68
69
70
# File 'lib/pdf/reader/buffer.rb', line 63

def initialize (io, opts = {})
  @io = io
  @tokens = []
  @in_content_stream = opts[:content_stream]

  @io.seek(opts[:seek]) if opts[:seek]
  @pos = @io.pos
end

Instance Attribute Details

#posObject (readonly)

Returns the value of attribute pos.



49
50
51
# File 'lib/pdf/reader/buffer.rb', line 49

def pos
  @pos
end

Instance Method Details

#empty?Boolean

return true if there are no more tokens left

Returns:

  • (Boolean)


74
75
76
77
78
# File 'lib/pdf/reader/buffer.rb', line 74

def empty?
  prepare_tokens if @tokens.size < 3

  @tokens.empty?
end

#find_first_xref_offsetObject

return the byte offset where the first XRef table in th source can be found.

Raises:



125
126
127
128
129
130
131
132
133
134
135
136
137
# File 'lib/pdf/reader/buffer.rb', line 125

def find_first_xref_offset
  check_size_is_non_zero
  @io.seek(-1024, IO::SEEK_END) rescue @io.seek(0)
  data = @io.read(1024)

  # the PDF 1.7 spec (section #3.4) says that EOL markers can be either \r, \n, or both.
  lines = data.split(/[\n\r]+/).reverse
  eof_index = lines.index { |l| l.strip[/^%%EOF/] }

  raise MalformedPDFError, "PDF does not contain EOF marker" if eof_index.nil?
  raise MalformedPDFError, "PDF EOF marker does not follow offset" if eof_index >= lines.size-1
  lines[eof_index+1].to_i
end

#read(bytes, opts = {}) ⇒ Object

return raw bytes from the underlying IO stream.

bytes - the number of bytes to read

options:

:skip_eol - if true, the IO stream is advanced past a CRLF or LF that
            is sitting under the io cursor.


89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# File 'lib/pdf/reader/buffer.rb', line 89

def read(bytes, opts = {})
  reset_pos

  if opts[:skip_eol]
    @io.seek(-1, IO::SEEK_CUR)
    str = @io.read(2)
    if str.nil?
      return nil
    elsif str == "\r\n"
      # do nothing
    elsif str[0,1] == "\n"
      @io.seek(-1, IO::SEEK_CUR)
    else
      @io.seek(-2, IO::SEEK_CUR)
    end
  end

  bytes = @io.read(bytes)
  save_pos
  bytes
end

#tokenObject

return the next token from the source. Returns a string if a token is found, nil if there are no tokens left.



114
115
116
117
118
119
120
121
# File 'lib/pdf/reader/buffer.rb', line 114

def token
  reset_pos
  prepare_tokens if @tokens.size < 3
  merge_indirect_reference
  prepare_tokens if @tokens.size < 3

  @tokens.shift
end