Class: DerivativeRodeo::Services::ExtractWordCoordinatesFromHocrSgmlService
- Inherits:
-
Object
- Object
- DerivativeRodeo::Services::ExtractWordCoordinatesFromHocrSgmlService
- Defined in:
- lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb
Overview
Responsible for converting an SGML string into JSON coordinates
Defined Under Namespace
Classes: AltoXml, DocStream, WordCoordinates
Instance Attribute Summary collapse
-
#doc_stream ⇒ Object
readonly
Returns the value of attribute doc_stream.
-
#source ⇒ Object
readonly
Returns the value of attribute source.
Class Method Summary collapse
-
.call(sgml) ⇒ String
A JSON document.
Instance Method Summary collapse
-
#initialize(html) ⇒ ExtractWordCoordinatesFromHocrSgmlService
constructor
Construct with either path or HTML [String].
- #to_alto ⇒ Object
-
#to_json ⇒ String
(also: #json)
Output JSON flattened word coordinates.
-
#to_text ⇒ String
Output plain text, keeping the method calls consistent with so calling this #to_text.
Constructor Details
#initialize(html) ⇒ ExtractWordCoordinatesFromHocrSgmlService
Construct with either path or HTML [String]
23 24 25 26 27 28 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 23 def initialize(html) @source = xml?(html) ? html : File.read(html) @doc_stream = DocStream.new parser = Nokogiri::HTML::SAX::Parser.new(@doc_stream) parser.parse(@source) end |
Instance Attribute Details
#doc_stream ⇒ Object (readonly)
Returns the value of attribute doc_stream.
29 30 31 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 29 def doc_stream @doc_stream end |
#source ⇒ Object (readonly)
Returns the value of attribute source.
29 30 31 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 29 def source @source end |
Class Method Details
.call(sgml) ⇒ String
Returns A JSON document.
15 16 17 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 15 def self.call(sgml) new(sgml) end |
Instance Method Details
#to_alto ⇒ Object
52 53 54 55 56 57 58 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 52 def to_alto @to_alto ||= AltoXml.to_alto( words: doc_stream.words, width: doc_stream.width, height: doc_stream.height ) end |
#to_json ⇒ String Also known as: json
Output JSON flattened word coordinates
36 37 38 39 40 41 42 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 36 def to_json @to_json ||= WordCoordinates.to_json( words: doc_stream.words, width: doc_stream.width, height: doc_stream.height ) end |
#to_text ⇒ String
Output plain text, keeping the method calls consistent with so calling this #to_text
48 49 50 |
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 48 def to_text @to_text ||= doc_stream.text end |