Class: Traject::Marc4JReader

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/traject/marc4j_reader.rb

Overview

Uses Marc4J to read the marc records, but then translates them to ruby-marc before delivering them still, Marc4J is just inside the black box.

But one way to get ability to transcode from Marc8. Records it delivers are ALWAYS in UTF8, will be transcoded if needed.

Also hope it gives us some performance benefit.

Uses the Marc4J MarcPermissiveStreamReader for binary, but sometimes in non-permissive mode, according to settings. Uses the Marc4j MarcXmlReader for xml.

NOTE: If you aren’t reading in binary records encoded in MARC8, you may find the pure-ruby Traject::MarcReader faster; the extra step to read Marc4J but translate to ruby MARC::Record adds some overhead.

Settings:

  • marc_source.type: serialization type. default ‘binary’, also ‘xml’ (TODO: json/marc-in-json)

  • marc4j_reader.permissive: default true, false to turn off permissive reading. Used as

    value to 'permissive' arg of MarcPermissiveStreamReader constructor.
    Only used for 'binary'
    
  • marc4j_reader.source_encoding: Only used for ‘binary’, otherwise always UTF-8.

    String of the values MarcPermissiveStreamReader accepts:
    * BESTGUESS  (tries to use MARC leader and believe it, I think)
    * ISO8859_1
    * UTF-8
    * MARC8
    Default 'BESTGUESS', but marc records in the wild are so wrong here, recommend setting.
    (will ALWAYS be transcoded to UTF-8 on the way out. We insist.)
    
  • marc4j_reader.jar_dir: Path to a directory containing Marc4J jar file to use. All .jar’s in dir will

    be loaded. If unset, uses marc4j.jar bundled with traject.
    
  • marc4j_reader.keep_marc4j: Keeps the original marc4j record accessible from

    the eventual ruby-marc record via record#original_marc4j
    

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input_stream, settings) ⇒ Marc4JReader

Returns a new instance of Marc4JReader.



50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# File 'lib/traject/marc4j_reader.rb', line 50

def initialize(input_stream, settings)
  @settings     = Traject::Indexer::Settings.new settings
  @input_stream = input_stream

  if @settings['marc4j_reader.keep_marc4j'] &&
    ! (MARC::Record.instance_methods.include?(:original_marc4j) &&
       MARC::Record.instance_methods.include?(:"original_marc4j="))
    MARC::Record.class_eval('attr_accessor :original_marc4j')
  end
  
  # Creating a converter will do the following:
  #  - nothing, if it detects that the marc4j jar is already loaded
  #  - load all the .jar files in settings['marc4j_reader.jar_dir'] if set
  #  - load the marc4j jar file bundled with MARC::MARC4J otherwise
   
  @converter = MARC::MARC4J.new(:jardir => settings['marc4j_reader.jar_dir'], :logger => logger)
  
  # Convenience
  java_import org.marc4j.MarcPermissiveStreamReader
  java_import org.marc4j.MarcXmlReader

end

Instance Attribute Details

#input_streamObject (readonly)

Returns the value of attribute input_stream.



48
49
50
# File 'lib/traject/marc4j_reader.rb', line 48

def input_stream
  @input_stream
end

#settingsObject (readonly)

Returns the value of attribute settings.



48
49
50
# File 'lib/traject/marc4j_reader.rb', line 48

def settings
  @settings
end

Instance Method Details

#create_marc_reader!Object



83
84
85
86
87
88
89
90
91
92
93
94
95
96
# File 'lib/traject/marc4j_reader.rb', line 83

def create_marc_reader!
  case input_type
  when "binary"
    permissive = settings["marc4j_reader.permissive"].to_s == "true"

    # #to_inputstream turns our ruby IO into a Java InputStream
    # third arg means 'convert to UTF-8, yes'
    MarcPermissiveStreamReader.new(input_stream.to_inputstream, permissive, true, settings["marc4j_reader.source_encoding"])
  when "xml"
    MarcXmlReader.new(input_stream.to_inputstream)
  else
    raise IllegalArgument.new("Unrecgonized marc_source.type: #{input_type}")
  end
end

#eachObject



98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# File 'lib/traject/marc4j_reader.rb', line 98

def each
  while (internal_reader.hasNext)
    begin
      marc4j = internal_reader.next
      rubymarc = @converter.marc4j_to_rubymarc(marc4j)
      if @settings['marc4j_reader.keep_marc4j']
        rubymarc.original_marc4j = marc4j
      end
    rescue Exception =>e
      msg = "MARC4JReader: Error reading MARC, fatal, re-raising"
      if marc4j
        msg += "\n    001 id: #{marc4j.getControlNumber}"
      end
      msg += "\n    #{Traject::Util.exception_to_log_message(e)}"
      logger.fatal msg
      raise e
    end

    yield rubymarc
  end
end

#input_typeObject



78
79
80
81
# File 'lib/traject/marc4j_reader.rb', line 78

def input_type
  # maybe later add some guessing somehow
  settings["marc_source.type"]
end

#internal_readerObject



74
75
76
# File 'lib/traject/marc4j_reader.rb', line 74

def internal_reader
  @internal_reader ||= create_marc_reader!
end

#loggerObject



120
121
122
# File 'lib/traject/marc4j_reader.rb', line 120

def logger
  @logger ||= (settings[:logger] || Yell.new(STDERR, :level => "gt.fatal")) # null logger)
end