Class: DerivativeRodeo::Services::ExtractWordCoordinatesFromHocrSgmlService::DocStream
- Inherits:
-
Nokogiri::XML::SAX::Document
- Object
- Nokogiri::XML::SAX::Document
- DerivativeRodeo::Services::ExtractWordCoordinatesFromHocrSgmlService::DocStream
- Defined in:
- lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb
Overview
SAX Document Stream class to gather text and word tokens from hOCR
Instance Attribute Summary collapse
-
#height ⇒ Object
Returns the value of attribute height.
-
#text ⇒ Object
Returns the value of attribute text.
-
#width ⇒ Object
Returns the value of attribute width.
-
#words ⇒ Object
Returns the value of attribute words.
Instance Method Summary collapse
- #characters(value) ⇒ Object
-
#consider?(name, class_name) ⇒ Boolean
Consider element for processing? - ‘div.ocr_page` — to get page width/height - `span.ocr_line` — to help make plain text readable - `span.ocrx_word` — for word-coordinate JSON and plain text word.
-
#end_document ⇒ Object
Callback for completion of parsing hOCR, used to normalize generated text content (strip unneeded whitespace incidental to output).
-
#end_element(name) ⇒ Object
Callback for element end; at this time, flush word coordinate state for current word, and append line endings to plain text:.
- #end_line ⇒ Object
- #end_word ⇒ Object
-
#initialize ⇒ DocStream
constructor
A new instance of DocStream.
-
#s_coords(attrs) ⇒ Array
Return coordinates from ‘span.ocrx_word` element attribute hash.
-
#start_element(name, attrs = []) ⇒ Object
Callback for element start, ignores elements except for: - ‘div.ocr_page` — to get page width/height - `span.ocr_line` — to help make plain text readable - `span.ocrx_word` — for word-coordinate JSON and plain text word.
- #start_page(attrs) ⇒ Object
- #start_word(attrs) ⇒ Object
- #word_complete? ⇒ Boolean
Constructor Details
#initialize ⇒ DocStream
Returns a new instance of DocStream.
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 70 def initialize super() # plain text buffer: @text = '' # list of word hash, containing word+coord: @words = [] # page width and height to be found in hOCR for `div.ocr_page` @width = nil @height = nil # to hold current word data state across #start_element, #characters, # and #end_element methods (to associate word with coordinates). @current = nil # to preserve element classname from start to use by #end_element @element_class_name = nil end |
Instance Attribute Details
#height ⇒ Object
Returns the value of attribute height.
68 69 70 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 68 def height @height end |
#text ⇒ Object
Returns the value of attribute text.
68 69 70 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 68 def text @text end |
#width ⇒ Object
Returns the value of attribute width.
68 69 70 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 68 def width @width end |
#words ⇒ Object
Returns the value of attribute words.
68 69 70 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 68 def words @words end |
Instance Method Details
#characters(value) ⇒ Object
164 165 166 167 168 169 170 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 164 def characters(value) return if @current.nil? return if @current[:coordinates].nil? @current[:word] ||= '' @current[:word] += value @text += value end |
#consider?(name, class_name) ⇒ Boolean
Consider element for processing?
- `div.ocr_page` — to get page width/height
- `span.ocr_line` — to help make plain text readable
- `span.ocrx_word` — for word-coordinate JSON and plain text word
108 109 110 111 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 108 def consider?(name, class_name) selector = "#{name}.#{class_name}" ['div.ocr_page', 'span.ocr_line', 'span.ocrx_word'].include?(selector) end |
#end_document ⇒ Object
Callback for completion of parsing hOCR, used to normalize generated
text content (strip unneeded whitespace incidental to output).
186 187 188 189 190 191 192 193 194 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 186 def end_document # postprocess @text to remove trailing spaces on lines @text = @text.split("\n").map(&:strip).join("\n") # remove excess line break @text.gsub!(/\n+/, "\n") @text.delete("\r") # remove trailing whitespace at end of buffer @text.strip! end |
#end_element(name) ⇒ Object
Callback for element end; at this time, flush word coordinate state
for current word, and append line endings to plain text:
176 177 178 179 180 181 182 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 176 def end_element(name) if name == 'span' end_word if @element_class_name == 'ocrx_word' @text += "\n" if @element_class_name.nil? end @element_class_name = nil end |
#end_line ⇒ Object
142 143 144 145 146 147 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 142 def end_line # strip trailing whitespace @text.strip! # then insert a line break @text += "\n" end |
#end_word ⇒ Object
135 136 137 138 139 140 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 135 def end_word # add trailing space to plaintext buffer for between words: @text += ' ' @words.push(@current) if word_complete? @current = nil # clear the current word end |
#s_coords(attrs) ⇒ Array
Return coordinates from ‘span.ocrx_word` element attribute hash
90 91 92 93 94 95 96 97 98 99 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 90 def s_coords(attrs) element_title = attrs['title'] bbox = element_title.split(';')[0].split('bbox ')[-1] x1, y1, x2, y2 = bbox.split(' ').map(&:to_i) height = y2 - y1 width = x2 - x1 hpos = x1 vpos = y1 [hpos, vpos, width, height] end |
#start_element(name, attrs = []) ⇒ Object
Callback for element start, ignores elements except for:
- `div.ocr_page` — to get page width/height
- `span.ocr_line` — to help make plain text readable
- `span.ocrx_word` — for word-coordinate JSON and plain text word
156 157 158 159 160 161 162 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 156 def start_element(name, attrs = []) attributes = attrs.to_h @element_class_name = attributes['class'] return unless consider?(name, @element_class_name) start_word(attributes) if @element_class_name == 'ocrx_word' start_page(attributes) if @element_class_name == 'ocr_page' end |
#start_page(attrs) ⇒ Object
120 121 122 123 124 125 126 127 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 120 def start_page(attrs) title = attrs['title'] fields = title.split(';') bbox = fields[1].split('bbox ')[-1].split(' ').map(&:to_i) # width and height: @width = bbox[2] @height = bbox[3] end |
#start_word(attrs) ⇒ Object
113 114 115 116 117 118 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 113 def start_word(attrs) @current = {} # will be replaced during #characters method call: @current[:word] = nil @current[:coordinates] = s_coords(attrs) end |
#word_complete? ⇒ Boolean
129 130 131 132 133 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 129 def word_complete? return false if @current.nil? coords = @current[:coordinates] @current[:word].present? && coords.size == 4 end |