Class: PDF::Reader::Turtletext
- Inherits:
-
Object
- Object
- PDF::Reader::Turtletext
- Defined in:
- lib/pdf/reader/turtletext.rb,
lib/pdf/reader/turtletext/version.rb
Overview
Class for reading structured text content
Typical usage:
reader = PDF::Reader::Turtletext.new(pdf_filename)
page = 1
heading_position = reader.text_position(/transaction table/i)
next_section = reader.text_position(/transaction summary/i)
transaction_rows = reader.text_in_region(
heading_position[x], 900,
heading_position[y] + 1,next_section[:y] -1
)
Defined Under Namespace
Instance Attribute Summary collapse
-
#options ⇒ Object
readonly
Returns the value of attribute options.
-
#reader ⇒ Object
readonly
Returns the value of attribute reader.
Instance Method Summary collapse
-
#bounding_box(&block) ⇒ Object
WIP - not using Textangle yet for text extraction.
-
#content(page = 1) ⇒ Object
Returns positional (with fuzzed y positioning) text content collection as a hash: { y_position: { x_position: content}}.
-
#fuzzed_y(input) ⇒ Object
Returns a hash with fuzzed positioning: { fuzzed_y_position: { x_position: content}} Given
input
as a hash: { y_position: { x_position: content}} Fuzz factors:y_precision
. -
#initialize(source, options = {}) ⇒ Turtletext
constructor
source
is a file name or stream-like object. -
#precise_content(page = 1) ⇒ Object
Returns positional text content collection as a hash with precise x,y positioning: { y_position: { x_position: content}}.
-
#text_in_region(xmin, xmax, ymin, ymax, page = 1) ⇒ Object
Returns an array of text elements found within the x,y limits, Each line of text found is returned as an array element.
-
#text_position(text, page = 1) ⇒ Object
Returns the position of
text
onpage
{x: val, y: val }text
may be a string (exact match required) or a Regexp. -
#y_precision ⇒ Object
Returns the precision required in y positions.
Constructor Details
#initialize(source, options = {}) ⇒ Turtletext
source
is a file name or stream-like object
19 20 21 22 |
# File 'lib/pdf/reader/turtletext.rb', line 19 def initialize(source, ={}) @options = @reader = PDF::Reader.new(source) end |
Instance Attribute Details
#options ⇒ Object (readonly)
Returns the value of attribute options.
16 17 18 |
# File 'lib/pdf/reader/turtletext.rb', line 16 def @options end |
#reader ⇒ Object (readonly)
Returns the value of attribute reader.
15 16 17 |
# File 'lib/pdf/reader/turtletext.rb', line 15 def reader @reader end |
Instance Method Details
#bounding_box(&block) ⇒ Object
WIP - not using Textangle yet for text extraction. Ideal usage is something like this:
textangle = reader.bounding_box do
page 1
below "Electricity Services"
above "Gas Services by City Gas Pte Ltd"
right_of 240.0
left_of "Total ($)"
end textangle.text
119 120 121 |
# File 'lib/pdf/reader/turtletext.rb', line 119 def bounding_box(&block) PDF::Reader::Turtletext::Textangle.new(self,&block) end |
#content(page = 1) ⇒ Object
Returns positional (with fuzzed y positioning) text content collection as a hash:
{ y_position: { x_position: content}}
35 36 37 38 39 40 41 42 |
# File 'lib/pdf/reader/turtletext.rb', line 35 def content(page=1) @content ||= [] if @content[page] @content[page] else @content[page] = fuzzed_y(precise_content(page)) end end |
#fuzzed_y(input) ⇒ Object
Returns a hash with fuzzed positioning:
{ fuzzed_y_position: { x_position: content}}
Given input
as a hash:
{ y_position: { x_position: content}}
Fuzz factors: y_precision
49 50 51 52 53 54 55 56 57 58 |
# File 'lib/pdf/reader/turtletext.rb', line 49 def fuzzed_y(input) output = {} input.keys.sort.each do |precise_y| # matching_y = (precise_y / 5.0).truncate * 5.0 matching_y = output.keys.select{|new_y| (new_y - precise_y).abs < y_precision }.first || precise_y output[matching_y] ||= {} output[matching_y].merge!(input[precise_y]) end output end |
#precise_content(page = 1) ⇒ Object
Returns positional text content collection as a hash with precise x,y positioning:
{ y_position: { x_position: content}}
62 63 64 65 66 67 68 69 |
# File 'lib/pdf/reader/turtletext.rb', line 62 def precise_content(page=1) @precise_content ||= [] if @precise_content[page] @precise_content[page] else @precise_content[page] = load_content(page) end end |
#text_in_region(xmin, xmax, ymin, ymax, page = 1) ⇒ Object
Returns an array of text elements found within the x,y limits, Each line of text found is returned as an array element. Each line of text is an array of the seperate text elements found on that line.
[["first line first text", "first line last text"],["second line text"]]
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
# File 'lib/pdf/reader/turtletext.rb', line 75 def text_in_region(xmin,xmax,ymin,ymax,page=1) text_map = content(page) box = [] text_map.keys.sort.reverse.each do |y| if y >= ymin && y<= ymax row = [] text_map[y].keys.sort.each do |x| if x >= xmin && x<= xmax row << text_map[y][x] end end box << row unless row.empty? end end box end |
#text_position(text, page = 1) ⇒ Object
Returns the position of text
on page
{x: val, y: val }
text
may be a string (exact match required) or a Regexp
95 96 97 98 99 100 101 102 103 104 105 |
# File 'lib/pdf/reader/turtletext.rb', line 95 def text_position(text,page=1) item = if text.class <= Regexp content(page).map {|k,v| if x = v.reduce(nil){|memo,vv| memo = (vv[1] =~ text) ? vv[0] : memo } ; [k,x] ; end } else content(page).map {|k,v| if x = v.rassoc(text) ; [k,x] ; end } end item = item.compact.flatten unless item.empty? { :x => item[1], :y => item[0] } end end |
#y_precision ⇒ Object
Returns the precision required in y positions. This is the fuzz range for interpreting y positions. Lines with y positions +/- y_precision
will be merged together. This helps align text correctly which may visually appear on the same line, but is actually off by a few pixels.
29 30 31 |
# File 'lib/pdf/reader/turtletext.rb', line 29 def y_precision [:y_precision] ||= 3 end |