Class: PDF::Reader::Turtletext

Inherits:
Object
  • Object
show all
Defined in:
lib/pdf/reader/turtletext.rb,
lib/pdf/reader/turtletext/version.rb

Overview

Class for reading structured text content

Typical usage:

reader = PDF::Reader::Turtletext.new(pdf_filename)
page = 1
heading_position = reader.text_position(/transaction table/i)
next_section = reader.text_position(/transaction summary/i)
transaction_rows = reader.text_in_region(
  heading_position[x], 900,
  heading_position[y] + 1,next_section[:y] -1
)

Defined Under Namespace

Classes: Textangle, Version

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source, options = {}) ⇒ Turtletext

source is a file name or stream-like object



19
20
21
22
# File 'lib/pdf/reader/turtletext.rb', line 19

def initialize(source, options={})
  @options = options
  @reader = PDF::Reader.new(source)
end

Instance Attribute Details

#optionsObject (readonly)

Returns the value of attribute options.



16
17
18
# File 'lib/pdf/reader/turtletext.rb', line 16

def options
  @options
end

#readerObject (readonly)

Returns the value of attribute reader.



15
16
17
# File 'lib/pdf/reader/turtletext.rb', line 15

def reader
  @reader
end

Instance Method Details

#bounding_box(&block) ⇒ Object

WIP - not using Textangle yet for text extraction. Ideal usage is something like this:

textangle = reader.bounding_box do

page 1
below "Electricity Services"
above "Gas Services by City Gas Pte Ltd"
right_of 240.0
left_of "Total ($)"

end textangle.text



119
120
121
# File 'lib/pdf/reader/turtletext.rb', line 119

def bounding_box(&block)
  PDF::Reader::Turtletext::Textangle.new(self,&block)
end

#content(page = 1) ⇒ Object

Returns positional (with fuzzed y positioning) text content collection as a hash:

{ y_position: { x_position: content}}


35
36
37
38
39
40
41
42
# File 'lib/pdf/reader/turtletext.rb', line 35

def content(page=1)
  @content ||= []
  if @content[page]
    @content[page]
  else
    @content[page] = fuzzed_y(precise_content(page))
  end
end

#fuzzed_y(input) ⇒ Object

Returns a hash with fuzzed positioning:

{ fuzzed_y_position: { x_position: content}}

Given input as a hash:

{ y_position: { x_position: content}}

Fuzz factors: y_precision



49
50
51
52
53
54
55
56
57
58
# File 'lib/pdf/reader/turtletext.rb', line 49

def fuzzed_y(input)
  output = {}
  input.keys.sort.each do |precise_y|
    # matching_y = (precise_y / 5.0).truncate * 5.0
    matching_y = output.keys.select{|new_y| (new_y - precise_y).abs < y_precision }.first || precise_y
    output[matching_y] ||= {}
    output[matching_y].merge!(input[precise_y])
  end
  output
end

#precise_content(page = 1) ⇒ Object

Returns positional text content collection as a hash with precise x,y positioning:

{ y_position: { x_position: content}}


62
63
64
65
66
67
68
69
# File 'lib/pdf/reader/turtletext.rb', line 62

def precise_content(page=1)
  @precise_content ||= []
  if @precise_content[page]
    @precise_content[page]
  else
    @precise_content[page] = load_content(page)
  end
end

#text_in_region(xmin, xmax, ymin, ymax, page = 1) ⇒ Object

Returns an array of text elements found within the x,y limits, Each line of text found is returned as an array element. Each line of text is an array of the seperate text elements found on that line.

[["first line first text", "first line last text"],["second line text"]]


75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# File 'lib/pdf/reader/turtletext.rb', line 75

def text_in_region(xmin,xmax,ymin,ymax,page=1)
  text_map = content(page)
  box = []
  text_map.keys.sort.reverse.each do |y|
    if y >= ymin && y<= ymax
      row = []
      text_map[y].keys.sort.each do |x|
        if x >= xmin && x<= xmax
          row << text_map[y][x]
        end
      end
      box << row unless row.empty?
    end
  end
  box
end

#text_position(text, page = 1) ⇒ Object

Returns the position of text on page

{x: val, y: val }

text may be a string (exact match required) or a Regexp



95
96
97
98
99
100
101
102
103
104
105
# File 'lib/pdf/reader/turtletext.rb', line 95

def text_position(text,page=1)
  item = if text.class <= Regexp
    content(page).map {|k,v| if x = v.reduce(nil){|memo,vv|  memo = (vv[1] =~ text) ? vv[0] : memo  } ; [k,x] ; end }
  else
    content(page).map {|k,v| if x = v.rassoc(text) ; [k,x] ; end }
  end
  item = item.compact.flatten
  unless item.empty?
    { :x => item[1], :y => item[0] }
  end
end

#y_precisionObject

Returns the precision required in y positions. This is the fuzz range for interpreting y positions. Lines with y positions +/- y_precision will be merged together. This helps align text correctly which may visually appear on the same line, but is actually off by a few pixels.



29
30
31
# File 'lib/pdf/reader/turtletext.rb', line 29

def y_precision
  options[:y_precision] ||= 3
end