Class: PDF::Reader::Turtletext
- Inherits:
-
Object
- Object
- PDF::Reader::Turtletext
- Defined in:
- lib/pdf/reader/turtletext.rb,
lib/pdf/reader/turtletext/version.rb
Overview
Class for reading structured text content
Typical usage:
reader = PDF::Reader::Turtletext.new(pdf_filename)
page = 1
heading_position = reader.text_position(/transaction table/i)
next_section = reader.text_position(/transaction summary/i)
transaction_rows = reader.text_in_region(
heading_position[x], 900,
heading_position[y] + 1,next_section[:y] -1
)
Defined Under Namespace
Instance Attribute Summary collapse
-
#options ⇒ Object
readonly
Returns the value of attribute options.
-
#reader ⇒ Object
readonly
Returns the value of attribute reader.
Instance Method Summary collapse
-
#bounding_box(&block) ⇒ Object
Returns a text region definition using a descriptive block.
-
#content(page = 1) ⇒ Object
Returns positional (with fuzzed y positioning) text content collection as a hash: [ fuzzed_y_position, [[x_position,content]] ].
-
#fuzzed_y(input) ⇒ Object
Returns an Array with fuzzed positioning, ordered by decreasing y position.
-
#initialize(source, options = {}) ⇒ Turtletext
constructor
source
is a file name or stream-like object Supportedoptions
include: * :y_precision. -
#precise_content(page = 1) ⇒ Object
Returns positional text content collection as a hash with precise x,y positioning: { y_position: { x_position: content}}.
-
#text_in_region(xmin, xmax, ymin, ymax, page = 1) ⇒ Object
Returns an array of text elements found within the x,y limits, x ranges from
xmin
(left of page) toxmax
(right of page) y ranges fromymin
(bottom of page) toymax
(top of page) Each line of text found is returned as an array element. -
#text_position(text, page = 1) ⇒ Object
Returns the position of
text
onpage
{x: val, y: val }text
may be a string (exact match required) or a Regexp. -
#y_precision ⇒ Object
Returns the precision required in y positions.
Constructor Details
#initialize(source, options = {}) ⇒ Turtletext
source
is a file name or stream-like object Supported options
include:
-
:y_precision
21 22 23 24 |
# File 'lib/pdf/reader/turtletext.rb', line 21 def initialize(source, ={}) @options = @reader = PDF::Reader.new(source) end |
Instance Attribute Details
#options ⇒ Object (readonly)
Returns the value of attribute options.
16 17 18 |
# File 'lib/pdf/reader/turtletext.rb', line 16 def @options end |
#reader ⇒ Object (readonly)
Returns the value of attribute reader.
15 16 17 |
# File 'lib/pdf/reader/turtletext.rb', line 15 def reader @reader end |
Instance Method Details
#bounding_box(&block) ⇒ Object
Returns a text region definition using a descriptive block.
Usage:
textangle = reader.bounding_box do
page 1
below /electricity/i
above 10
right_of 240.0
left_of "Total ($)"
end
textangle.text
Alternatively, an explicit block parameter may be used:
textangle = reader.bounding_box do |r|
r.page 1
r.below /electricity/i
r.above 10
r.right_of 240.0
r.left_of "Total ($)"
end
textangle.text
=> [['string','string'],['string']] # array of rows, each row is an array of column text element
147 148 149 |
# File 'lib/pdf/reader/turtletext.rb', line 147 def bounding_box(&block) PDF::Reader::Turtletext::Textangle.new(self,&block) end |
#content(page = 1) ⇒ Object
Returns positional (with fuzzed y positioning) text content collection as a hash:
[ fuzzed_y_position, [[x_position,content]] ]
37 38 39 40 41 42 43 44 |
# File 'lib/pdf/reader/turtletext.rb', line 37 def content(page=1) @content ||= [] if @content[page] @content[page] else @content[page] = fuzzed_y(precise_content(page)) end end |
#fuzzed_y(input) ⇒ Object
Returns an Array with fuzzed positioning, ordered by decreasing y position. Row content order by x position.
[ fuzzed_y_position, [[x_position,content]] ]
Given input
as a hash:
{ y_position: { x_position: content}}
Fuzz factors: y_precision
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
# File 'lib/pdf/reader/turtletext.rb', line 51 def fuzzed_y(input) output = [] input.keys.sort.reverse.each do |precise_y| matching_y = output.map(&:first).select{|new_y| (new_y - precise_y).abs < y_precision }.first || precise_y y_index = output.index{|y| y.first == matching_y } new_row_content = input[precise_y].to_a if y_index row_content = output[y_index].last row_content += new_row_content output[y_index] = [matching_y,row_content] else output << [matching_y,new_row_content] end end output end |
#precise_content(page = 1) ⇒ Object
Returns positional text content collection as a hash with precise x,y positioning:
{ y_position: { x_position: content}}
70 71 72 73 74 75 76 77 |
# File 'lib/pdf/reader/turtletext.rb', line 70 def precise_content(page=1) @precise_content ||= [] if @precise_content[page] @precise_content[page] else @precise_content[page] = load_content(page) end end |
#text_in_region(xmin, xmax, ymin, ymax, page = 1) ⇒ Object
Returns an array of text elements found within the x,y limits, x ranges from xmin
(left of page) to xmax
(right of page) y ranges from ymin
(bottom of page) to ymax
(top of page) Each line of text found is returned as an array element. Each line of text is an array of the seperate text elements found on that line.
[["first line first text", "first line last text"],["second line text"]]
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
# File 'lib/pdf/reader/turtletext.rb', line 85 def text_in_region(xmin,xmax,ymin,ymax,page=1) text_map = content(page) box = [] text_map.each do |y,text_row| if y >= ymin && y<= ymax row = [] text_row.each do |x,element| if x >= xmin && x<= xmax row << [x,element] end end box << row.sort{|a,b| a.first <=> b.first }.map(&:last) unless row.empty? end end box end |
#text_position(text, page = 1) ⇒ Object
Returns the position of text
on page
{x: val, y: val }
text
may be a string (exact match required) or a Regexp
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
# File 'lib/pdf/reader/turtletext.rb', line 106 def text_position(text,page=1) item = if text.class <= Regexp content(page).map do |k,v| if x = v.reduce(nil){|memo,vv| memo = (vv[1] =~ text) ? vv[0] : memo } [k,x] end end else content(page).map {|k,v| if x = v.rassoc(text) ; [k,x] ; end } end item = item.compact.flatten unless item.empty? { :x => item[1], :y => item[0] } end end |
#y_precision ⇒ Object
Returns the precision required in y positions. This is the fuzz range for interpreting y positions. Lines with y positions +/- y_precision
will be merged together. This helps align text correctly which may visually appear on the same line, but is actually off by a few pixels.
31 32 33 |
# File 'lib/pdf/reader/turtletext.rb', line 31 def y_precision [:y_precision] ||= 3 end |