Class: PDF::Reader::Page

Inherits:

Object

Object
PDF::Reader::Page

show all

Defined in:: lib/pdf/reader/page.rb

Overview

high level representation of a single PDF page. Ties together the various low level classes in PDF::Reader and provides access to the various components of the page (text, images, fonts, etc) in convenient formats.

If you require access to the raw PDF objects for this page, you can access the Page dictionary via the page_object accessor. You will need to use the objects accessor to help walk the page dictionary in any useful way.

Instance Attribute Summary collapse

#objects ⇒ Object readonly

lowlevel hash-like access to all objects in the underlying PDF.
#page_object ⇒ Object readonly

the raw PDF object that defines this page.

Instance Method Summary collapse

#attributes ⇒ Object

Returns the attributes that accompany this page.
#fonts ⇒ Object

return a hash of fonts used on this page.
#initialize(objects, pagenum) ⇒ Page constructor

creates a new page wrapper.
#inspect ⇒ Object

return a friendly string representation of this page.
#number ⇒ Object

return the number of this page within the full document.
#raw_content ⇒ Object

returns the raw content stream for this page.
#resources ⇒ Object

Returns the resources that accompany this page.
#text ⇒ Object (also: #to_s)

returns the plain text content of this page encoded as UTF-8.
#walk(*receivers) ⇒ Object

processes the raw content stream for this page in sequential order and passes callbacks to the receiver objects.
#xobjects ⇒ Object

Returns the XObjects that are available to this page.

Constructor Details

#initialize(objects, pagenum) ⇒ `Page`

creates a new page wrapper.

objects - an ObjectHash instance that wraps a PDF file
pagenum - an int specifying the page number to expose. 1 indexed.

# File 'lib/pdf/reader/page.rb', line 27

def initialize(objects, pagenum)
  @objects, @pagenum = objects, pagenum
  @page_object = objects.deref(objects.page_references[pagenum - 1])

  unless @page_object.is_a?(::Hash)
    raise ArgumentError, "invalid page: #{pagenum}"
  end
end

Instance Attribute Details

#objects ⇒ `Object` (readonly)

lowlevel hash-like access to all objects in the underlying PDF



17
18
19

# File 'lib/pdf/reader/page.rb', line 17

def objects
  @objects
end

#page_object ⇒ `Object` (readonly)

the raw PDF object that defines this page



20
21
22

# File 'lib/pdf/reader/page.rb', line 20

def page_object
  @page_object
end

Instance Method Details

#attributes ⇒ `Object`

Returns the attributes that accompany this page. Includes attributes inherited from parents.

# File 'lib/pdf/reader/page.rb', line 51

def attributes
  hash = {}
  page_with_ancestors.reverse.each do |obj|
    hash.merge!(@objects.deref(obj))
  end
  hash
end

#fonts ⇒ `Object`

return a hash of fonts used on this page.

The keys are the font labels used within the page content stream.

The values are a PDF::Reader::Font instances that provide access to most available metrics for each font.

# File 'lib/pdf/reader/page.rb', line 79

def fonts
  raw_fonts = objects.deref(resources[:Font] || {})
  ::Hash[raw_fonts.map { |label, font|
    [label, PDF::Reader::Font.new(objects, objects.deref(font))]
  }]
end

#inspect ⇒ `Object`

return a friendly string representation of this page



44
45
46

# File 'lib/pdf/reader/page.rb', line 44

def inspect
  "<PDF::Reader::Page page: #{@pagenum}>"
end

#number ⇒ `Object`

return the number of this page within the full document



38
39
40

# File 'lib/pdf/reader/page.rb', line 38

def number
  @pagenum
end

#raw_content ⇒ `Object`

returns the raw content stream for this page. This is plumbing, nothing to see here unless you’re a PDF nerd like me.

# File 'lib/pdf/reader/page.rb', line 123

def raw_content
  contents = objects.deref(@page_object[:Contents])
  [contents].flatten.compact.map { |obj|
    objects.deref(obj)
  }.map { |obj|
    obj.unfiltered_data
  }.join
end

#resources ⇒ `Object`

Returns the resources that accompany this page. Includes resources inherited from parents.



62
63
64

# File 'lib/pdf/reader/page.rb', line 62

def resources
  @resources ||= @objects.deref(attributes[:Resources]) || {}
end

#text ⇒ `Object` Also known as: to_s

returns the plain text content of this page encoded as UTF-8. Any characters that can’t be translated will be returned as a ▯

# File 'lib/pdf/reader/page.rb', line 89

def text
  receiver = PageTextReceiver.new
  walk(receiver)
  receiver.content
end

#walk(*receivers) ⇒ `Object`

processes the raw content stream for this page in sequential order and passes callbacks to the receiver objects.

This is mostly low level and you can probably ignore it unless you need access to soemthing like the raw encoded text. For an example of how this can be used as a basis for higher level functionality, see the text() method

If someone was motivated enough, this method is intended to provide all the data required to faithfully render the entire page. If you find some required data isn’t available it’s a bug - let me know.

Many operators that generate callbacks will reference resources stored in the page header - think images, fonts, etc. To facilitate these operators, the first available callback is page=. If your receiver accepts that callback it will be passed the current PDF::Reader::Page object. Use the Page#resources method to grab any required resources.

# File 'lib/pdf/reader/page.rb', line 115

def walk(*receivers)
  callback(receivers, :page=, [self])
  content_stream(receivers, raw_content)
end

#xobjects ⇒ `Object`

Returns the XObjects that are available to this page



68
69
70

# File 'lib/pdf/reader/page.rb', line 68

def xobjects
  resources[:XObject] || {}
end

Class: PDF::Reader::Page

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(objects, pagenum) ⇒ Page

Instance Attribute Details

#objects ⇒ Object (readonly)

#page_object ⇒ Object (readonly)

Instance Method Details

#attributes ⇒ Object

#fonts ⇒ Object

#inspect ⇒ Object

#number ⇒ Object

#raw_content ⇒ Object

#resources ⇒ Object

#text ⇒ Object Also known as: to_s

#walk(*receivers) ⇒ Object

#xobjects ⇒ Object