Class: PDF::Reader::Page

Inherits:
Object
  • Object
show all
Extended by:
Forwardable
Defined in:
lib/pdf/reader/page.rb

Overview

high level representation of a single PDF page. Ties together the various low level classes in PDF::Reader and provides access to the various components of the page (text, images, fonts, etc) in convenient formats.

If you require access to the raw PDF objects for this page, you can access the Page dictionary via the page_object accessor. You will need to use the objects accessor to help walk the page dictionary in any useful way.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(objects, pagenum, options = {}) ⇒ Page

creates a new page wrapper.

  • objects - an ObjectHash instance that wraps a PDF file

  • pagenum - an int specifying the page number to expose. 1 indexed.

: (PDF::Reader::ObjectHash, Integer, ?Hash[Symbol, untyped]) -> void



50
51
52
53
54
55
56
57
58
59
60
61
62
63
# File 'lib/pdf/reader/page.rb', line 50

def initialize(objects, pagenum, options = {})
  @objects = objects
  @pagenum = pagenum
  @page_ref = objects.page_references[pagenum - 1] #: (Reference | Hash[Symbol, untyped])?
  @page_object = objects.deref_hash(@page_ref) || {} #: Hash[Symbol, untyped]
  @cache       = options[:cache] || {} #: PDF::Reader::ObjectCache | Hash[untyped, untyped]
  @attributes = nil #: Hash[Symbol, untyped] | nil
  @root = nil #: Hash[Symbol, untyped] | nil
  @resources = nil #: PDF::Reader::Resources | nil

  if @page_object.empty?
    raise InvalidPageError, "Invalid page: #{pagenum}"
  end
end

Instance Attribute Details

#cacheObject (readonly)

a Hash-like object for storing cached data. Generally this is scoped to the current document and is used to avoid repeating expensive operations : PDF::Reader::ObjectCache | Hash[untyped, untyped]



33
34
35
# File 'lib/pdf/reader/page.rb', line 33

def cache
  @cache
end

#objectsObject (readonly)

lowlevel hash-like access to all objects in the underlying PDF : PDF::Reader::ObjectHash



23
24
25
# File 'lib/pdf/reader/page.rb', line 23

def objects
  @objects
end

#page_objectObject (readonly)

the raw PDF object that defines this page : Hash[Symbol, untyped]



27
28
29
# File 'lib/pdf/reader/page.rb', line 27

def page_object
  @page_object
end

Instance Method Details

#attributesObject

Returns the attributes that accompany this page, including attributes inherited from parents.

: () -> Hash[Symbol, untyped]



83
84
85
86
87
88
89
90
91
92
93
# File 'lib/pdf/reader/page.rb', line 83

def attributes
  @attributes ||= {}.tap { |hash|
    page_with_ancestors.reverse.each do |obj|
      hash.merge!(@objects.deref_hash(obj) || {})
    end
  }
  # This shouldn't be necesary, but some non compliant PDFs leave MediaBox
  # out. Assuming 8.5" x 11" is what Acobat does, so we do it too.
  @attributes[:MediaBox] ||= [0,0,612,792]
  @attributes
end

#boxesObject

returns the “boxes” that define the page object. values are defaulted according to section 7.7.3.3 of the PDF Spec 1.7

DEPRECATED. Recommend using Page#rectangles instead

: () -> Hash[Symbol, Array]



215
216
217
218
# File 'lib/pdf/reader/page.rb', line 215

def boxes
  # In ruby 2.4+ we could use Hash#transform_values
  Hash[rectangles.map{ |k,rect| [k,rect.to_a] } ]
end

#heightObject

: () -> Numeric



96
97
98
99
100
# File 'lib/pdf/reader/page.rb', line 96

def height
  rect = Rectangle.new(*attributes[:MediaBox])
  rect.apply_rotation(rotate) if rotate > 0
  rect.height
end

#inspectObject

return a friendly string representation of this page

: () -> String



75
76
77
# File 'lib/pdf/reader/page.rb', line 75

def inspect
  "<PDF::Reader::Page page: #{@pagenum}>"
end

#numberObject

return the number of this page within the full document

: () -> Integer



68
69
70
# File 'lib/pdf/reader/page.rb', line 68

def number
  @pagenum
end

#orientationObject

Convenience method to identify the page’s orientation.

: () -> String



120
121
122
123
124
125
126
# File 'lib/pdf/reader/page.rb', line 120

def orientation
  if height > width
    "portrait"
  else
    "landscape"
  end
end

#originObject

: () -> Array



110
111
112
113
114
115
# File 'lib/pdf/reader/page.rb', line 110

def origin
  rect = Rectangle.new(*attributes[:MediaBox])
  rect.apply_rotation(rotate) if rotate > 0

  rect.bottom_left
end

#raw_contentObject

returns the raw content stream for this page. This is plumbing, nothing to see here unless you’re a PDF nerd like me.

: () -> String



187
188
189
190
191
192
193
194
# File 'lib/pdf/reader/page.rb', line 187

def raw_content
  contents = objects.deref_stream_or_array(@page_object[:Contents])
  [contents].flatten.compact.map { |obj|
    objects.deref_stream(obj)
  }.compact.map { |obj|
    obj.unfiltered_data
  }.join(" ")
end

#rectanglesObject

returns the “boxes” that define the page object. values are defaulted according to section 7.7.3.3 of the PDF Spec 1.7

: () -> Hash[Symbol, PDF::Reader::Rectangle]



224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
# File 'lib/pdf/reader/page.rb', line 224

def rectangles
  # attributes[:MediaBox] can never be nil, but I have no easy way to tell sorbet that atm
  mediabox = objects.deref_array_of_numbers(attributes[:MediaBox]) || []
  cropbox = objects.deref_array_of_numbers(attributes[:CropBox]) || mediabox
  bleedbox = objects.deref_array_of_numbers(attributes[:BleedBox]) || cropbox
  trimbox = objects.deref_array_of_numbers(attributes[:TrimBox]) || cropbox
  artbox = objects.deref_array_of_numbers(attributes[:ArtBox]) || cropbox

  begin
    mediarect = Rectangle.from_array(mediabox)
    croprect = Rectangle.from_array(cropbox)
    bleedrect = Rectangle.from_array(bleedbox)
    trimrect = Rectangle.from_array(trimbox)
    artrect = Rectangle.from_array(artbox)
  rescue ArgumentError => e
    raise MalformedPDFError, e.message
  end

  if rotate > 0
    mediarect.apply_rotation(rotate)
    croprect.apply_rotation(rotate)
    bleedrect.apply_rotation(rotate)
    trimrect.apply_rotation(rotate)
    artrect.apply_rotation(rotate)
  end

  {
    MediaBox: mediarect,
    CropBox: croprect,
    BleedBox: bleedrect,
    TrimBox: trimrect,
    ArtBox: artrect,
  }
end

#rotateObject

returns the angle to rotate the page clockwise. Always 0, 90, 180 or 270

: () -> Integer



199
200
201
202
203
204
205
206
207
# File 'lib/pdf/reader/page.rb', line 199

def rotate
  value = attributes[:Rotate].to_i
  case value
  when 0, 90, 180, 270
    value
  else
    0
  end
end

#runs(opts = {}) ⇒ Object

: (?Hash[Symbol, untyped]) -> Array



145
146
147
148
149
# File 'lib/pdf/reader/page.rb', line 145

def runs(opts = {})
  receiver = PageTextReceiver.new
  walk(receiver)
  receiver.runs(opts)
end

#text(opts = {}) ⇒ Object Also known as: to_s

returns the plain text content of this page encoded as UTF-8. Any characters that can’t be translated will be returned as a ▯

: (?Hash[Symbol, untyped]) -> String



132
133
134
135
136
137
138
139
140
141
# File 'lib/pdf/reader/page.rb', line 132

def text(opts = {})
  receiver = PageTextReceiver.new
  walk(receiver)
  runs = receiver.runs(opts)

  # rectangles[:MediaBox] can never be nil, but I have no easy way to tell sorbet that atm
  mediabox = rectangles[:MediaBox] || Rectangle.new(0, 0, 0, 0)

  PageLayout.new(runs, mediabox).to_s
end

#walk(*receivers) ⇒ Object

processes the raw content stream for this page in sequential order and passes callbacks to the receiver objects.

This is mostly low level and you can probably ignore it unless you need access to something like the raw encoded text. For an example of how this can be used as a basis for higher level functionality, see the text() method

If someone was motivated enough, this method is intended to provide all the data required to faithfully render the entire page. If you find some required data isn’t available it’s a bug - let me know.

Many operators that generate callbacks will reference resources stored in the page header - think images, fonts, etc. To facilitate these operators, the first available callback is page=. If your receiver accepts that callback it will be passed the current PDF::Reader::Page object. Use the Page#resources method to grab any required resources.

It may help to think of each page as a self contained program made up of a set of instructions and associated resources. Calling walk() executes the program in the correct order and calls out to your implementation.

: (*untyped) -> untyped



175
176
177
178
179
180
181
# File 'lib/pdf/reader/page.rb', line 175

def walk(*receivers)
  receivers = receivers.map { |receiver|
    ValidatingReceiver.new(receiver)
  }
  callback(receivers, :page=, [self])
  content_stream(receivers, raw_content)
end

#widthObject

: () -> Numeric



103
104
105
106
107
# File 'lib/pdf/reader/page.rb', line 103

def width
  rect = Rectangle.new(*attributes[:MediaBox])
  rect.apply_rotation(rotate) if rotate > 0
  rect.width
end