Class: PDF::Reader::Buffer

Inherits:

Object

Object
PDF::Reader::Buffer

show all

Defined in:: lib/pdf/reader/buffer.rb

Overview

A string tokeniser that recognises PDF grammar. When passed an IO stream or a string, repeated calls to token() will return the next token from the source.

This is very low level, and getting the raw tokens is not very useful in itself.

This will usually be used in conjunction with PDF:Reader::Parser, which converts the raw tokens into objects we can work with (strings, ints, arrays, etc)

Constant Summary collapse

TOKEN_WHITESPACE = : Array

[0x00, 0x09, 0x0A, 0x0C, 0x0D, 0x20]

TOKEN_DELIMITER = : Array

[0x25, 0x3C, 0x3E, 0x28, 0x5B, 0x7B, 0x29, 0x5D, 0x7D, 0x2F]

LEFT_PAREN = some strings for comparissons. Declaring them here avoids creating new strings that need GC over and over

"("

LESS_THAN = : String

"<"

STREAM = : String

"stream"

ID = : String

"ID"

FWD_SLASH = : String

"/"

NULL_BYTE = : String

"\x00"

CR = : String

"\r"

LF = : String

"\n"

CRLF = : String

"\r\n"

WHITE_SPACE = : String

["\n", "\r", ' ']

TRAILING_BYTECOUNT = Quite a few PDFs have trailing junk. This can be several k of nuls in some cases Allow for this here

DIGITS_ONLY = must match whole tokens

%r{\A\d+\z}

Instance Attribute Summary collapse

#pos ⇒ Object readonly

: Integer.

Instance Method Summary collapse

#empty? ⇒ Boolean

return true if there are no more tokens left.
#find_first_xref_offset ⇒ Object

return the byte offset where the first XRef table in th source can be found.
#initialize(io, opts = {}) ⇒ Buffer constructor

Creates a new buffer.
#read(bytes, opts = {}) ⇒ Object

return raw bytes from the underlying IO stream.
#token ⇒ Object

return the next token from the source.

Constructor Details

#initialize(io, opts = {}) ⇒ `Buffer`

Creates a new buffer.

Params:

io - an IO stream (usually a StringIO) with the raw data to tokenise

options:

:seek - a byte offset to seek to before starting to tokenise
:content_stream - set to true if buffer will be tokenising a
                  content stream. Defaults to false

: ((StringIO | Tempfile | IO), ?Hash[Symbol, untyped]) -> void

# File 'lib/pdf/reader/buffer.rb', line 81

def initialize(io, opts = {})
  @io = io
  @tokens = [] #: Array[String | PDF::Reader::Reference]
  @in_content_stream = opts[:content_stream] #: bool

  @io.seek(opts[:seek]) if opts[:seek]
  @pos = @io.pos #: Integer
end

Instance Attribute Details

#pos ⇒ `Object` (readonly)

: Integer



66
67
68

# File 'lib/pdf/reader/buffer.rb', line 66

def pos
  @pos
end

Instance Method Details

#empty? ⇒ `Boolean`

return true if there are no more tokens left

: () -> bool

Returns:

(Boolean)

# File 'lib/pdf/reader/buffer.rb', line 93

def empty?
  prepare_tokens if @tokens.size < 3

  @tokens.empty?
end

#find_first_xref_offset ⇒ `Object`

return the byte offset where the first XRef table in th source can be found.

: () -> Integer

Raises:

(MalformedPDFError)

# File 'lib/pdf/reader/buffer.rb', line 150

def find_first_xref_offset
  check_size_is_non_zero
  @io.seek(-TRAILING_BYTECOUNT, IO::SEEK_END) rescue @io.seek(0)
  data = @io.read(TRAILING_BYTECOUNT)

  raise MalformedPDFError, "PDF does not contain EOF marker" if data.nil?

  # the PDF 1.7 spec (section #3.4) says that EOL markers can be either \r, \n, or both.
  lines = data.split(/[\n\r]+/).reverse
  eof_index = lines.index { |l| l.strip[/^%%EOF/] }

  raise MalformedPDFError, "PDF does not contain EOF marker" if eof_index.nil?
  raise MalformedPDFError, "PDF EOF marker does not follow offset" if eof_index >= lines.size-1
  offset = lines[eof_index+1].to_i

  # a byte offset < 0 doesn't make much sense. This is unlikely to happen, but in theory some
  # corrupted PDFs might have a line that looks like a negative int preceding the `%%EOF`
  raise MalformedPDFError, "invalid xref offset" if offset < 0
  offset
end

#read(bytes, opts = {}) ⇒ `Object`

return raw bytes from the underlying IO stream.

bytes - the number of bytes to read

options:

:skip_eol - if true, the IO stream is advanced past a CRLF, CR or LF
            that is sitting under the io cursor.
Note:
Skipping a bare CR is not spec-compliant.
This is because the data may start with LF.
However we check for CRLF first, so the ambiguity is avoided.

: (Integer, ?Hash[Symbol, untyped]) -> String?

# File 'lib/pdf/reader/buffer.rb', line 112

def read(bytes, opts = {})
  reset_pos

  if opts[:skip_eol]
    @io.seek(-1, IO::SEEK_CUR)
    str = @io.read(2)
    if str.nil?
      return nil
    elsif str == CRLF # This MUST be done before checking for CR alone
      # do nothing
    elsif str[0, 1] == LF || str[0, 1] == CR # LF or CR alone
      @io.seek(-1, IO::SEEK_CUR)
    else
      @io.seek(-2, IO::SEEK_CUR)
    end
  end

  bytes = @io.read(bytes)
  save_pos
  bytes
end

#token ⇒ `Object`

return the next token from the source. Returns a string if a token is found, nil if there are no tokens left.

: () -> (nil | String | PDF::Reader::Reference)

# File 'lib/pdf/reader/buffer.rb', line 138

def token
  reset_pos
  prepare_tokens if @tokens.size < 3
  merge_indirect_reference
  prepare_tokens if @tokens.size < 3

  @tokens.shift
end