Class: PDF::Reader::Page
- Inherits:
-
Object
- Object
- PDF::Reader::Page
- Defined in:
- lib/pdf/reader/page.rb
Overview
high level representation of a single PDF page. Ties together the various low level classes in PDF::Reader and provides access to the various components of the page (text, images, fonts, etc) in convenient formats.
If you require access to the raw PDF objects for this page, you can access the Page dictionary via the page_object accessor. You will need to use the objects accessor to help walk the page dictionary in any useful way.
Instance Attribute Summary collapse
-
#objects ⇒ Object
readonly
lowlevel hash-like access to all objects in the underlying PDF.
-
#page_object ⇒ Object
readonly
the raw PDF object that defines this page.
Instance Method Summary collapse
-
#attributes ⇒ Object
Returns the attributes that accompany this page.
-
#fonts ⇒ Object
return a hash of fonts used on this page.
-
#initialize(objects, pagenum) ⇒ Page
constructor
creates a new page wrapper.
-
#inspect ⇒ Object
return a friendly string representation of this page.
-
#number ⇒ Object
return the number of this page within the full document.
-
#raw_content ⇒ Object
returns the raw content stream for this page.
-
#resources ⇒ Object
Returns the resources that accompany this page.
-
#text ⇒ Object
(also: #to_s)
returns the plain text content of this page encoded as UTF-8.
-
#walk(*receivers) ⇒ Object
processes the raw content stream for this page in sequential order and passes callbacks to the receiver objects.
-
#xobjects ⇒ Object
Returns the XObjects that are available to this page.
Constructor Details
#initialize(objects, pagenum) ⇒ Page
creates a new page wrapper.
-
objects - an ObjectHash instance that wraps a PDF file
-
pagenum - an int specifying the page number to expose. 1 indexed.
27 28 29 30 31 32 33 34 |
# File 'lib/pdf/reader/page.rb', line 27 def initialize(objects, pagenum) @objects, @pagenum = objects, pagenum @page_object = objects.deref(objects.page_references[pagenum - 1]) unless @page_object.is_a?(::Hash) raise ArgumentError, "invalid page: #{pagenum}" end end |
Instance Attribute Details
#objects ⇒ Object (readonly)
lowlevel hash-like access to all objects in the underlying PDF
17 18 19 |
# File 'lib/pdf/reader/page.rb', line 17 def objects @objects end |
#page_object ⇒ Object (readonly)
the raw PDF object that defines this page
20 21 22 |
# File 'lib/pdf/reader/page.rb', line 20 def page_object @page_object end |
Instance Method Details
#attributes ⇒ Object
Returns the attributes that accompany this page. Includes attributes inherited from parents.
51 52 53 54 55 56 57 |
# File 'lib/pdf/reader/page.rb', line 51 def attributes hash = {} page_with_ancestors.reverse.each do |obj| hash.merge!(@objects.deref(obj)) end hash end |
#fonts ⇒ Object
return a hash of fonts used on this page.
The keys are the font labels used within the page content stream.
The values are a PDF::Reader::Font instances that provide access to most available metrics for each font.
79 80 81 82 83 84 |
# File 'lib/pdf/reader/page.rb', line 79 def fonts raw_fonts = objects.deref(resources[:Font] || {}) ::Hash[raw_fonts.map { |label, font| [label, PDF::Reader::Font.new(objects, objects.deref(font))] }] end |
#inspect ⇒ Object
return a friendly string representation of this page
44 45 46 |
# File 'lib/pdf/reader/page.rb', line 44 def inspect "<PDF::Reader::Page page: #{@pagenum}>" end |
#number ⇒ Object
return the number of this page within the full document
38 39 40 |
# File 'lib/pdf/reader/page.rb', line 38 def number @pagenum end |
#raw_content ⇒ Object
returns the raw content stream for this page. This is plumbing, nothing to see here unless you’re a PDF nerd like me.
123 124 125 126 127 128 129 130 |
# File 'lib/pdf/reader/page.rb', line 123 def raw_content contents = objects.deref(@page_object[:Contents]) [contents].flatten.compact.map { |obj| objects.deref(obj) }.map { |obj| obj.unfiltered_data }.join end |
#resources ⇒ Object
Returns the resources that accompany this page. Includes resources inherited from parents.
62 63 64 |
# File 'lib/pdf/reader/page.rb', line 62 def resources @resources ||= @objects.deref(attributes[:Resources]) || {} end |
#text ⇒ Object Also known as: to_s
returns the plain text content of this page encoded as UTF-8. Any characters that can’t be translated will be returned as a ▯
89 90 91 92 93 |
# File 'lib/pdf/reader/page.rb', line 89 def text receiver = PageTextReceiver.new walk(receiver) receiver.content end |
#walk(*receivers) ⇒ Object
processes the raw content stream for this page in sequential order and passes callbacks to the receiver objects.
This is mostly low level and you can probably ignore it unless you need access to soemthing like the raw encoded text. For an example of how this can be used as a basis for higher level functionality, see the text() method
If someone was motivated enough, this method is intended to provide all the data required to faithfully render the entire page. If you find some required data isn’t available it’s a bug - let me know.
Many operators that generate callbacks will reference resources stored in the page header - think images, fonts, etc. To facilitate these operators, the first available callback is page=. If your receiver accepts that callback it will be passed the current PDF::Reader::Page object. Use the Page#resources method to grab any required resources.
115 116 117 118 |
# File 'lib/pdf/reader/page.rb', line 115 def walk(*receivers) callback(receivers, :page=, [self]) content_stream(receivers, raw_content) end |
#xobjects ⇒ Object
Returns the XObjects that are available to this page
68 69 70 |
# File 'lib/pdf/reader/page.rb', line 68 def xobjects resources[:XObject] || {} end |