Class: DerivativeRodeo::Services::ExtractWordCoordinatesFromHocrSgmlService::DocStream

Inherits:
Nokogiri::XML::SAX::Document
  • Object
show all
Defined in:
lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb

Overview

SAX Document Stream class to gather text and word tokens from hOCR

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeDocStream

Returns a new instance of DocStream.


70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 70

def initialize
  super()
  # plain text buffer:
  @text = ''
  # list of word hash, containing word+coord:
  @words = []
  # page width and height to be found in hOCR for `div.ocr_page`
  @width = nil
  @height = nil
  # to hold current word data state across #start_element, #characters,
  #   and #end_element methods (to associate word with coordinates).
  @current = nil
  # to preserve element classname from start to use by #end_element
  @element_class_name = nil
end

Instance Attribute Details

#heightObject

Returns the value of attribute height.


68
69
70
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 68

def height
  @height
end

#textObject

Returns the value of attribute text.


68
69
70
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 68

def text
  @text
end

#widthObject

Returns the value of attribute width.


68
69
70
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 68

def width
  @width
end

#wordsObject

Returns the value of attribute words.


68
69
70
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 68

def words
  @words
end

Instance Method Details

#characters(value) ⇒ Object


164
165
166
167
168
169
170
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 164

def characters(value)
  return if @current.nil?
  return if @current[:coordinates].nil?
  @current[:word] ||= ''
  @current[:word] += value
  @text += value
end

#consider?(name, class_name) ⇒ Boolean

Consider element for processing?

- `div.ocr_page` — to get page width/height
- `span.ocr_line` — to help make plain text readable
- `span.ocrx_word` — for word-coordinate JSON and plain text word

Parameters:

  • name (String)

    Element name

  • class_name (String)

    HTML class name

Returns:

  • (Boolean)

    true if element should be processed; otherwise false


108
109
110
111
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 108

def consider?(name, class_name)
  selector = "#{name}.#{class_name}"
  ['div.ocr_page', 'span.ocr_line', 'span.ocrx_word'].include?(selector)
end

#end_documentObject

Callback for completion of parsing hOCR, used to normalize generated

text content (strip unneeded whitespace incidental to output).

186
187
188
189
190
191
192
193
194
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 186

def end_document
  # postprocess @text to remove trailing spaces on lines
  @text = @text.split("\n").map(&:strip).join("\n")
  # remove excess line break
  @text.gsub!(/\n+/, "\n")
  @text.delete("\r")
  # remove trailing whitespace at end of buffer
  @text.strip!
end

#end_element(name) ⇒ Object

Callback for element end; at this time, flush word coordinate state

for current word, and append line endings to plain text:

Parameters:

  • name (String)

    element name.


176
177
178
179
180
181
182
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 176

def end_element(name)
  if name == 'span'
    end_word if @element_class_name == 'ocrx_word'
    @text += "\n" if @element_class_name.nil?
  end
  @element_class_name = nil
end

#end_lineObject


142
143
144
145
146
147
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 142

def end_line
  # strip trailing whitespace
  @text.strip!
  # then insert a line break
  @text += "\n"
end

#end_wordObject


135
136
137
138
139
140
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 135

def end_word
  # add trailing space to plaintext buffer for between words:
  @text += ' '
  @words.push(@current) if word_complete?
  @current = nil # clear the current word
end

#s_coords(attrs) ⇒ Array

Return coordinates from ‘span.ocrx_word` element attribute hash

Parameters:

  • attrs (Hash)

    hash with hOCR ‘span.ocrx_word` element attributes

Returns:

  • (Array)

    Array of position x, y, width, height in px.


90
91
92
93
94
95
96
97
98
99
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 90

def s_coords(attrs)
  element_title = attrs['title']
  bbox = element_title.split(';')[0].split('bbox ')[-1]
  x1, y1, x2, y2 = bbox.split(' ').map(&:to_i)
  height = y2 - y1
  width = x2 - x1
  hpos = x1
  vpos = y1
  [hpos, vpos, width, height]
end

#start_element(name, attrs = []) ⇒ Object

Callback for element start, ignores elements except for:

- `div.ocr_page` — to get page width/height
- `span.ocr_line` — to help make plain text readable
- `span.ocrx_word` — for word-coordinate JSON and plain text word

Parameters:

  • name (String)

    element name.

  • attrs (Array) (defaults to: [])

    Array of key, value pair Arrays.


156
157
158
159
160
161
162
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 156

def start_element(name, attrs = [])
  attributes = attrs.to_h
  @element_class_name = attributes['class']
  return unless consider?(name, @element_class_name)
  start_word(attributes) if @element_class_name == 'ocrx_word'
  start_page(attributes) if @element_class_name == 'ocr_page'
end

#start_page(attrs) ⇒ Object


120
121
122
123
124
125
126
127
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 120

def start_page(attrs)
  title = attrs['title']
  fields = title.split(';')
  bbox = fields[1].split('bbox ')[-1].split(' ').map(&:to_i)
  # width and height:
  @width = bbox[2]
  @height = bbox[3]
end

#start_word(attrs) ⇒ Object


113
114
115
116
117
118
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 113

def start_word(attrs)
  @current = {}
  # will be replaced during #characters method call:
  @current[:word] = nil
  @current[:coordinates] = s_coords(attrs)
end

#word_complete?Boolean

Returns:

  • (Boolean)

129
130
131
132
133
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 129

def word_complete?
  return false if @current.nil?
  coords = @current[:coordinates]
  @current[:word].present? && coords.size == 4
end