Class: Rika::Parser

Inherits:
Object
  • Object
show all
Defined in:
lib/rika/parser.rb

Overview

Parses a document and returns a ParseResult. This class is intended to be used only by the Rika module, not by users of the gem, who should instead call Rika.parse.

Instance Method Summary collapse

Constructor Details

#initialize(data_source, key_sort: true, max_content_length: -1,, detector: DefaultDetector.new) ⇒ Parser

Returns a new instance of Parser.

Parameters:

  • data_source (String)

    file path or HTTP(s) URL

  • key_sort (Boolean) (defaults to: true)

    whether to sort the keys in the metadata hash, defaults to true

  • max_content_length (Integer) (defaults to: -1,)

    maximum content length to return, defaults to all

  • detector (Detector) (defaults to: DefaultDetector.new)

    Tika detector, defaults to DefaultDetector



15
16
17
18
19
20
21
22
# File 'lib/rika/parser.rb', line 15

def initialize(data_source, key_sort: true, max_content_length: -1, detector: DefaultDetector.new)
  @data_source = data_source
  @key_sort = key_sort
  @max_content_length = max_content_length
  @detector = detector
  @input_type = data_source_input_type
  @tika = Tika.new(@detector)
end

Instance Method Details

#parseParseResult

Entry point method for parsing a document

Returns:



26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# File 'lib/rika/parser.rb', line 26

def parse
   = .new
  @tika.set_max_string_length(@max_content_length)
  content = with_input_stream { |stream| @tika.parse_to_string(stream, ) }
  language = Rika.language(content)
  .set('rika:language', language)
  .set('rika:data-source', @data_source)
   = ()
   = .sort_by { |key, _value| key.downcase }.to_h if @key_sort

  ParseResult.new(
    content:            content,
    metadata:           ,
    metadata_java:      ,
    content_type:       ['Content-Type'],
    language:           language,
    input_type:         @input_type,
    data_source:        @data_source,
    max_content_length: @max_content_length
  )
end