Class: DerivativeRodeo::Services::ExtractWordCoordinatesFromHocrSgmlService

Inherits:
Object
  • Object
show all
Defined in:
lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb

Overview

Responsible for converting an SGML string into JSON coordinates

Defined Under Namespace

Classes: AltoXml, DocStream, WordCoordinates

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(html) ⇒ ExtractWordCoordinatesFromHocrSgmlService

Construct with either path or HTML [String]

Parameters:

  • html (String)

    either an XML string or a path to a file.



23
24
25
26
27
28
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 23

def initialize(html)
  @source = xml?(html) ? html : File.read(html)
  @doc_stream = DocStream.new
  parser = Nokogiri::HTML::SAX::Parser.new(@doc_stream)
  parser.parse(@source)
end

Instance Attribute Details

#doc_streamObject (readonly)

Returns the value of attribute doc_stream.



29
30
31
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 29

def doc_stream
  @doc_stream
end

#sourceObject (readonly)

Returns the value of attribute source.



29
30
31
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 29

def source
  @source
end

Class Method Details

.call(sgml) ⇒ String

Returns A JSON document.

Parameters:

  • sgml (String)

    The SGML (e.g. XML or HTML) text of a HOCR file.

Returns:

  • (String)

    A JSON document



15
16
17
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 15

def self.call(sgml)
  new(sgml)
end

Instance Method Details

#to_altoObject



52
53
54
55
56
57
58
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 52

def to_alto
  @to_alto ||= AltoXml.to_alto(
    words: doc_stream.words,
    width: doc_stream.width,
    height: doc_stream.height
  )
end

#to_jsonString Also known as: json

Output JSON flattened word coordinates

Returns:

  • (String)

    JSON serialization of flattened word coordinates



36
37
38
39
40
41
42
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 36

def to_json
  @to_json ||= WordCoordinates.to_json(
    words: doc_stream.words,
    width: doc_stream.width,
    height: doc_stream.height
  )
end

#to_textString

Output plain text, keeping the method calls consistent with so calling this #to_text

Returns:

  • (String)

    plain text of OCR’d document



48
49
50
# File 'lib/derivative_rodeo/services/extract_word_coordinates_from_hocr_sgml_service.rb', line 48

def to_text
  @to_text ||= doc_stream.text
end