Class: PDF::Reader::Turtletext

Inherits:
Object
  • Object
show all
Defined in:
lib/pdf/reader/turtletext.rb,
lib/pdf/reader/turtletext/version.rb

Overview

Class for reading structured text content

Typical usage:

reader = PDF::Reader::Turtletext.new(pdf_filename)
page = 1
heading_position = reader.text_position(/transaction table/i)
next_section = reader.text_position(/transaction summary/i)
transaction_rows = reader.text_in_region(
  heading_position[x], 900,
  heading_position[y] + 1,next_section[:y] -1
)

Defined Under Namespace

Classes: Textangle, Version

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source, options = {}) ⇒ Turtletext

source is a file name or stream-like object Supported options include:

  • :y_precision



21
22
23
24
# File 'lib/pdf/reader/turtletext.rb', line 21

def initialize(source, options={})
  @options = options
  @reader = PDF::Reader.new(source)
end

Instance Attribute Details

#optionsObject (readonly)

Returns the value of attribute options.



16
17
18
# File 'lib/pdf/reader/turtletext.rb', line 16

def options
  @options
end

#readerObject (readonly)

Returns the value of attribute reader.



15
16
17
# File 'lib/pdf/reader/turtletext.rb', line 15

def reader
  @reader
end

Instance Method Details

#bounding_box(&block) ⇒ Object

Returns a text region definition using a descriptive block.

Usage:

textangle = reader.bounding_box do
  page 1
  below /electricity/i
  above 10
  right_of 240.0
  left_of "Total ($)"
end
textangle.text

Alternatively, an explicit block parameter may be used:

textangle = reader.bounding_box do |r|
  r.page 1
  r.below /electricity/i
  r.above 10
  r.right_of 240.0
  r.left_of "Total ($)"
end
textangle.text
=> [['string','string'],['string']] # array of rows, each row is an array of column text element


147
148
149
# File 'lib/pdf/reader/turtletext.rb', line 147

def bounding_box(&block)
  PDF::Reader::Turtletext::Textangle.new(self,&block)
end

#content(page = 1) ⇒ Object

Returns positional (with fuzzed y positioning) text content collection as a hash:

[ fuzzed_y_position, [[x_position,content]] ]


37
38
39
40
41
42
43
44
# File 'lib/pdf/reader/turtletext.rb', line 37

def content(page=1)
  @content ||= []
  if @content[page]
    @content[page]
  else
    @content[page] = fuzzed_y(precise_content(page))
  end
end

#fuzzed_y(input) ⇒ Object

Returns an Array with fuzzed positioning, ordered by decreasing y position. Row content order by x position.

[ fuzzed_y_position, [[x_position,content]] ]

Given input as a hash:

{ y_position: { x_position: content}}

Fuzz factors: y_precision



51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# File 'lib/pdf/reader/turtletext.rb', line 51

def fuzzed_y(input)
  output = []
  input.keys.sort.reverse.each do |precise_y|
    matching_y = output.map(&:first).select{|new_y| (new_y - precise_y).abs < y_precision }.first || precise_y
    y_index = output.index{|y| y.first == matching_y }
    new_row_content = input[precise_y].to_a
    if y_index
      row_content = output[y_index].last
      row_content += new_row_content
      output[y_index] = [matching_y,row_content.sort{|a,b| a.first <=> b.first }]
    else
      output << [matching_y,new_row_content.sort{|a,b| a.first <=> b.first }]
    end
  end
  output
end

#precise_content(page = 1) ⇒ Object

Returns positional text content collection as a hash with precise x,y positioning:

{ y_position: { x_position: content}}


70
71
72
73
74
75
76
77
# File 'lib/pdf/reader/turtletext.rb', line 70

def precise_content(page=1)
  @precise_content ||= []
  if @precise_content[page]
    @precise_content[page]
  else
    @precise_content[page] = load_content(page)
  end
end

#text_in_region(xmin, xmax, ymin, ymax, page = 1) ⇒ Object

Returns an array of text elements found within the x,y limits, x ranges from xmin (left of page) to xmax (right of page) y ranges from ymin (bottom of page) to ymax (top of page) Each line of text found is returned as an array element. Each line of text is an array of the seperate text elements found on that line.

[["first line first text", "first line last text"],["second line text"]]


85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# File 'lib/pdf/reader/turtletext.rb', line 85

def text_in_region(xmin,xmax,ymin,ymax,page=1)
  text_map = content(page)
  box = []

  text_map.each do |y,text_row|
    if y >= ymin && y<= ymax
      row = []
      text_row.each do |x,element|
        if x >= xmin && x<= xmax
          row << element
        end
      end
      box << row unless row.empty?
    end
  end
  box
end

#text_position(text, page = 1) ⇒ Object

Returns the position of text on page

{x: val, y: val }

text may be a string (exact match required) or a Regexp



106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'lib/pdf/reader/turtletext.rb', line 106

def text_position(text,page=1)
  item = if text.class <= Regexp
    content(page).map do |k,v|
      if x = v.reduce(nil){|memo,vv|  memo = (vv[1] =~ text) ? vv[0] : memo  }
        [k,x]
      end
    end
  else
    content(page).map {|k,v| if x = v.rassoc(text) ; [k,x] ; end }
  end
  item = item.compact.flatten
  unless item.empty?
    { :x => item[1], :y => item[0] }
  end
end

#y_precisionObject

Returns the precision required in y positions. This is the fuzz range for interpreting y positions. Lines with y positions +/- y_precision will be merged together. This helps align text correctly which may visually appear on the same line, but is actually off by a few pixels.



31
32
33
# File 'lib/pdf/reader/turtletext.rb', line 31

def y_precision
  options[:y_precision] ||= 3
end