Class: PDF::Reader::Turtletext

Inherits:

Object

Object
PDF::Reader::Turtletext

Defined in:: lib/pdf/reader/turtletext.rb,
lib/pdf/reader/turtletext/version.rb

Overview

Class for reading structured text content

Typical usage:

reader = PDF::Reader::Turtletext.new(pdf_filename)
page = 1
heading_position = reader.text_position(/transaction table/i)
next_section = reader.text_position(/transaction summary/i)
transaction_rows = reader.text_in_region(
  heading_position[x], 900,
  heading_position[y] + 1,next_section[:y] -1
)

Defined Under Namespace

Classes: Textangle, Version

Instance Attribute Summary collapse

#options ⇒ Object readonly

Returns the value of attribute options.
#reader ⇒ Object readonly

Returns the value of attribute reader.

Instance Method Summary collapse

#bounding_box(&block) ⇒ Object

WIP - not using Textangle yet for text extraction.
#content(page = 1) ⇒ Object

Returns positional (with fuzzed y positioning) text content collection as a hash: { y_position: { x_position: content}}.
#fuzzed_y(input) ⇒ Object

Returns a hash with fuzzed positioning: { fuzzed_y_position: { x_position: content}} Given input as a hash: { y_position: { x_position: content}} Fuzz factors: y_precision.
#initialize(source, options = {}) ⇒ Turtletext constructor

source is a file name or stream-like object.
#precise_content(page = 1) ⇒ Object

Returns positional text content collection as a hash with precise x,y positioning: { y_position: { x_position: content}}.
#text_in_region(xmin, xmax, ymin, ymax, page = 1) ⇒ Object

Returns an array of text elements found within the x,y limits, Each line of text found is returned as an array element.
#text_position(text, page = 1) ⇒ Object

Returns the position of text on page {x: val, y: val } text may be a string (exact match required) or a Regexp.
#y_precision ⇒ Object

Returns the precision required in y positions.

Constructor Details

#initialize(source, options = {}) ⇒ `Turtletext`

source is a file name or stream-like object

# File 'lib/pdf/reader/turtletext.rb', line 19

def initialize(source, options={})
  @options = options
  @reader = PDF::Reader.new(source)
end

Instance Attribute Details

#options ⇒ `Object` (readonly)

Returns the value of attribute options.



16
17
18

# File 'lib/pdf/reader/turtletext.rb', line 16

def options
  @options
end

#reader ⇒ `Object` (readonly)

Returns the value of attribute reader.



15
16
17

# File 'lib/pdf/reader/turtletext.rb', line 15

def reader
  @reader
end

Instance Method Details

#bounding_box(&block) ⇒ `Object`

WIP - not using Textangle yet for text extraction. Ideal usage is something like this:

textangle = reader.bounding_box do

page 1
below "Electricity Services"
above "Gas Services by City Gas Pte Ltd"
right_of 240.0
left_of "Total ($)"

end textangle.text



119
120
121

# File 'lib/pdf/reader/turtletext.rb', line 119

def bounding_box(&block)
  PDF::Reader::Turtletext::Textangle.new(self,&block)
end

#content(page = 1) ⇒ `Object`

Returns positional (with fuzzed y positioning) text content collection as a hash:

{ y_position: { x_position: content}}

# File 'lib/pdf/reader/turtletext.rb', line 35

def content(page=1)
  @content ||= []
  if @content[page]
    @content[page]
  else
    @content[page] = fuzzed_y(precise_content(page))
  end
end

#fuzzed_y(input) ⇒ `Object`

Returns a hash with fuzzed positioning:

{ fuzzed_y_position: { x_position: content}}

Given input as a hash:

{ y_position: { x_position: content}}

Fuzz factors: y_precision

# File 'lib/pdf/reader/turtletext.rb', line 49

def fuzzed_y(input)
  output = {}
  input.keys.sort.each do |precise_y|
    # matching_y = (precise_y / 5.0).truncate * 5.0
    matching_y = output.keys.select{|new_y| (new_y - precise_y).abs < y_precision }.first || precise_y
    output[matching_y] ||= {}
    output[matching_y].merge!(input[precise_y])
  end
  output
end

#precise_content(page = 1) ⇒ `Object`

Returns positional text content collection as a hash with precise x,y positioning:

{ y_position: { x_position: content}}

# File 'lib/pdf/reader/turtletext.rb', line 62

def precise_content(page=1)
  @precise_content ||= []
  if @precise_content[page]
    @precise_content[page]
  else
    @precise_content[page] = load_content(page)
  end
end

#text_in_region(xmin, xmax, ymin, ymax, page = 1) ⇒ `Object`

Returns an array of text elements found within the x,y limits, Each line of text found is returned as an array element. Each line of text is an array of the seperate text elements found on that line.

[["first line first text", "first line last text"],["second line text"]]

# File 'lib/pdf/reader/turtletext.rb', line 75

def text_in_region(xmin,xmax,ymin,ymax,page=1)
  text_map = content(page)
  box = []
  text_map.keys.sort.reverse.each do |y|
    if y >= ymin && y<= ymax
      row = []
      text_map[y].keys.sort.each do |x|
        if x >= xmin && x<= xmax
          row << text_map[y][x]
        end
      end
      box << row unless row.empty?
    end
  end
  box
end

#text_position(text, page = 1) ⇒ `Object`

Returns the position of text on page

{x: val, y: val }

text may be a string (exact match required) or a Regexp

# File 'lib/pdf/reader/turtletext.rb', line 95

def text_position(text,page=1)
  item = if text.class <= Regexp
    content(page).map {|k,v| if x = v.reduce(nil){|memo,vv|  memo = (vv[1] =~ text) ? vv[0] : memo  } ; [k,x] ; end }
  else
    content(page).map {|k,v| if x = v.rassoc(text) ; [k,x] ; end }
  end
  item = item.compact.flatten
  unless item.empty?
    { :x => item[1], :y => item[0] }
  end
end

#y_precision ⇒ `Object`

Returns the precision required in y positions. This is the fuzz range for interpreting y positions. Lines with y positions +/- y_precision will be merged together. This helps align text correctly which may visually appear on the same line, but is actually off by a few pixels.



29
30
31

# File 'lib/pdf/reader/turtletext.rb', line 29

def y_precision
  options[:y_precision] ||= 3
end

Class: PDF::Reader::Turtletext

Overview

Defined Under Namespace

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source, options = {}) ⇒ Turtletext

Instance Attribute Details

#options ⇒ Object (readonly)

#reader ⇒ Object (readonly)

Instance Method Details

#bounding_box(&block) ⇒ Object

#content(page = 1) ⇒ Object

#fuzzed_y(input) ⇒ Object

#precise_content(page = 1) ⇒ Object

#text_in_region(xmin, xmax, ymin, ymax, page = 1) ⇒ Object

#text_position(text, page = 1) ⇒ Object

#y_precision ⇒ Object