Class: PDF::Reader::Page
- Inherits:
-
Object
- Object
- PDF::Reader::Page
- Extended by:
- Forwardable
- Defined in:
- lib/pdf/reader/page.rb
Overview
high level representation of a single PDF page. Ties together the various low level classes in PDF::Reader and provides access to the various components of the page (text, images, fonts, etc) in convenient formats.
If you require access to the raw PDF objects for this page, you can access the Page dictionary via the page_object accessor. You will need to use the objects accessor to help walk the page dictionary in any useful way.
Instance Attribute Summary collapse
-
#cache ⇒ Object
readonly
a Hash-like object for storing cached data.
-
#objects ⇒ Object
readonly
lowlevel hash-like access to all objects in the underlying PDF : PDF::Reader::ObjectHash.
-
#page_object ⇒ Object
readonly
the raw PDF object that defines this page : Hash[Symbol, untyped].
Instance Method Summary collapse
-
#attributes ⇒ Object
Returns the attributes that accompany this page, including attributes inherited from parents.
-
#boxes ⇒ Object
returns the “boxes” that define the page object.
-
#height ⇒ Object
: () -> Numeric.
-
#initialize(objects, pagenum, options = {}) ⇒ Page
constructor
creates a new page wrapper.
-
#inspect ⇒ Object
return a friendly string representation of this page.
-
#number ⇒ Object
return the number of this page within the full document.
-
#orientation ⇒ Object
Convenience method to identify the page’s orientation.
-
#origin ⇒ Object
: () -> Array.
-
#raw_content ⇒ Object
returns the raw content stream for this page.
-
#rectangles ⇒ Object
returns the “boxes” that define the page object.
-
#rotate ⇒ Object
returns the angle to rotate the page clockwise.
-
#runs(opts = {}) ⇒ Object
: (?Hash[Symbol, untyped]) -> Array.
-
#text(opts = {}) ⇒ Object
(also: #to_s)
returns the plain text content of this page encoded as UTF-8.
-
#walk(*receivers) ⇒ Object
processes the raw content stream for this page in sequential order and passes callbacks to the receiver objects.
-
#width ⇒ Object
: () -> Numeric.
Constructor Details
#initialize(objects, pagenum, options = {}) ⇒ Page
creates a new page wrapper.
-
objects - an ObjectHash instance that wraps a PDF file
-
pagenum - an int specifying the page number to expose. 1 indexed.
: (PDF::Reader::ObjectHash, Integer, ?Hash[Symbol, untyped]) -> void
50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
# File 'lib/pdf/reader/page.rb', line 50 def initialize(objects, pagenum, = {}) @objects = objects @pagenum = pagenum @page_ref = objects.page_references[pagenum - 1] #: (Reference | Hash[Symbol, untyped])? @page_object = objects.deref_hash(@page_ref) || {} #: Hash[Symbol, untyped] @cache = [:cache] || {} #: PDF::Reader::ObjectCache | Hash[untyped, untyped] @attributes = nil #: Hash[Symbol, untyped] | nil @root = nil #: Hash[Symbol, untyped] | nil @resources = nil #: PDF::Reader::Resources | nil if @page_object.empty? raise InvalidPageError, "Invalid page: #{pagenum}" end end |
Instance Attribute Details
#cache ⇒ Object (readonly)
a Hash-like object for storing cached data. Generally this is scoped to the current document and is used to avoid repeating expensive operations : PDF::Reader::ObjectCache | Hash[untyped, untyped]
33 34 35 |
# File 'lib/pdf/reader/page.rb', line 33 def cache @cache end |
#objects ⇒ Object (readonly)
lowlevel hash-like access to all objects in the underlying PDF : PDF::Reader::ObjectHash
23 24 25 |
# File 'lib/pdf/reader/page.rb', line 23 def objects @objects end |
#page_object ⇒ Object (readonly)
the raw PDF object that defines this page : Hash[Symbol, untyped]
27 28 29 |
# File 'lib/pdf/reader/page.rb', line 27 def page_object @page_object end |
Instance Method Details
#attributes ⇒ Object
Returns the attributes that accompany this page, including attributes inherited from parents.
: () -> Hash[Symbol, untyped]
83 84 85 86 87 88 89 90 91 92 93 |
# File 'lib/pdf/reader/page.rb', line 83 def attributes @attributes ||= {}.tap { |hash| page_with_ancestors.reverse.each do |obj| hash.merge!(@objects.deref_hash(obj) || {}) end } # This shouldn't be necesary, but some non compliant PDFs leave MediaBox # out. Assuming 8.5" x 11" is what Acobat does, so we do it too. @attributes[:MediaBox] ||= [0,0,612,792] @attributes end |
#boxes ⇒ Object
returns the “boxes” that define the page object. values are defaulted according to section 7.7.3.3 of the PDF Spec 1.7
DEPRECATED. Recommend using Page#rectangles instead
: () -> Hash[Symbol, Array]
215 216 217 218 |
# File 'lib/pdf/reader/page.rb', line 215 def boxes # In ruby 2.4+ we could use Hash#transform_values Hash[rectangles.map{ |k,rect| [k,rect.to_a] } ] end |
#height ⇒ Object
: () -> Numeric
96 97 98 99 100 |
# File 'lib/pdf/reader/page.rb', line 96 def height rect = Rectangle.new(*attributes[:MediaBox]) rect.apply_rotation(rotate) if rotate > 0 rect.height end |
#inspect ⇒ Object
return a friendly string representation of this page
: () -> String
75 76 77 |
# File 'lib/pdf/reader/page.rb', line 75 def inspect "<PDF::Reader::Page page: #{@pagenum}>" end |
#number ⇒ Object
return the number of this page within the full document
: () -> Integer
68 69 70 |
# File 'lib/pdf/reader/page.rb', line 68 def number @pagenum end |
#orientation ⇒ Object
Convenience method to identify the page’s orientation.
: () -> String
120 121 122 123 124 125 126 |
# File 'lib/pdf/reader/page.rb', line 120 def orientation if height > width "portrait" else "landscape" end end |
#origin ⇒ Object
: () -> Array
110 111 112 113 114 115 |
# File 'lib/pdf/reader/page.rb', line 110 def origin rect = Rectangle.new(*attributes[:MediaBox]) rect.apply_rotation(rotate) if rotate > 0 rect.bottom_left end |
#raw_content ⇒ Object
returns the raw content stream for this page. This is plumbing, nothing to see here unless you’re a PDF nerd like me.
: () -> String
187 188 189 190 191 192 193 194 |
# File 'lib/pdf/reader/page.rb', line 187 def raw_content contents = objects.deref_stream_or_array(@page_object[:Contents]) [contents].flatten.compact.map { |obj| objects.deref_stream(obj) }.compact.map { |obj| obj.unfiltered_data }.join(" ") end |
#rectangles ⇒ Object
returns the “boxes” that define the page object. values are defaulted according to section 7.7.3.3 of the PDF Spec 1.7
: () -> Hash[Symbol, PDF::Reader::Rectangle]
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 |
# File 'lib/pdf/reader/page.rb', line 224 def rectangles # attributes[:MediaBox] can never be nil, but I have no easy way to tell sorbet that atm mediabox = objects.deref_array_of_numbers(attributes[:MediaBox]) || [] cropbox = objects.deref_array_of_numbers(attributes[:CropBox]) || mediabox bleedbox = objects.deref_array_of_numbers(attributes[:BleedBox]) || cropbox trimbox = objects.deref_array_of_numbers(attributes[:TrimBox]) || cropbox artbox = objects.deref_array_of_numbers(attributes[:ArtBox]) || cropbox begin mediarect = Rectangle.from_array(mediabox) croprect = Rectangle.from_array(cropbox) bleedrect = Rectangle.from_array(bleedbox) trimrect = Rectangle.from_array(trimbox) artrect = Rectangle.from_array(artbox) rescue ArgumentError => e raise MalformedPDFError, e. end if rotate > 0 mediarect.apply_rotation(rotate) croprect.apply_rotation(rotate) bleedrect.apply_rotation(rotate) trimrect.apply_rotation(rotate) artrect.apply_rotation(rotate) end { MediaBox: mediarect, CropBox: croprect, BleedBox: bleedrect, TrimBox: trimrect, ArtBox: artrect, } end |
#rotate ⇒ Object
returns the angle to rotate the page clockwise. Always 0, 90, 180 or 270
: () -> Integer
199 200 201 202 203 204 205 206 207 |
# File 'lib/pdf/reader/page.rb', line 199 def rotate value = attributes[:Rotate].to_i case value when 0, 90, 180, 270 value else 0 end end |
#runs(opts = {}) ⇒ Object
: (?Hash[Symbol, untyped]) -> Array
145 146 147 148 149 |
# File 'lib/pdf/reader/page.rb', line 145 def runs(opts = {}) receiver = PageTextReceiver.new walk(receiver) receiver.runs(opts) end |
#text(opts = {}) ⇒ Object Also known as: to_s
returns the plain text content of this page encoded as UTF-8. Any characters that can’t be translated will be returned as a ▯
: (?Hash[Symbol, untyped]) -> String
132 133 134 135 136 137 138 139 140 141 |
# File 'lib/pdf/reader/page.rb', line 132 def text(opts = {}) receiver = PageTextReceiver.new walk(receiver) runs = receiver.runs(opts) # rectangles[:MediaBox] can never be nil, but I have no easy way to tell sorbet that atm mediabox = rectangles[:MediaBox] || Rectangle.new(0, 0, 0, 0) PageLayout.new(runs, mediabox).to_s end |
#walk(*receivers) ⇒ Object
processes the raw content stream for this page in sequential order and passes callbacks to the receiver objects.
This is mostly low level and you can probably ignore it unless you need access to something like the raw encoded text. For an example of how this can be used as a basis for higher level functionality, see the text() method
If someone was motivated enough, this method is intended to provide all the data required to faithfully render the entire page. If you find some required data isn’t available it’s a bug - let me know.
Many operators that generate callbacks will reference resources stored in the page header - think images, fonts, etc. To facilitate these operators, the first available callback is page=. If your receiver accepts that callback it will be passed the current PDF::Reader::Page object. Use the Page#resources method to grab any required resources.
It may help to think of each page as a self contained program made up of a set of instructions and associated resources. Calling walk() executes the program in the correct order and calls out to your implementation.
: (*untyped) -> untyped
175 176 177 178 179 180 181 |
# File 'lib/pdf/reader/page.rb', line 175 def walk(*receivers) receivers = receivers.map { |receiver| ValidatingReceiver.new(receiver) } callback(receivers, :page=, [self]) content_stream(receivers, raw_content) end |