Class: Traject::Marc4JReader
- Inherits:
-
Object
- Object
- Traject::Marc4JReader
- Includes:
- Enumerable
- Defined in:
- lib/traject/marc4j_reader.rb
Overview
Uses Marc4J to read the marc records, but then translates them to ruby-marc before delivering them still, Marc4J is just inside the black box.
But one way to get ability to transcode from Marc8. Records it delivers are ALWAYS in UTF8, will be transcoded if needed.
Also hope it gives us some performance benefit.
Uses the Marc4J MarcPermissiveStreamReader for binary, but sometimes in non-permissive mode, according to settings. Uses the Marc4j MarcXmlReader for xml.
NOTE: If you aren’t reading in binary records encoded in MARC8, you may find the pure-ruby Traject::MarcReader faster; the extra step to read Marc4J but translate to ruby MARC::Record adds some overhead.
Settings:
-
marc_source.type: serialization type. default ‘binary’, also ‘xml’ (TODO: json/marc-in-json)
-
marc4j_reader.permissive: default true, false to turn off permissive reading. Used as
value to 'permissive' arg of MarcPermissiveStreamReader constructor. Only used for 'binary'
-
marc4j_reader.source_encoding: Only used for ‘binary’, otherwise always UTF-8.
String of the values MarcPermissiveStreamReader accepts: * BESTGUESS (tries to use MARC leader and believe it, I think) * ISO8859_1 * UTF-8 * MARC8 Default 'BESTGUESS', but marc records in the wild are so wrong here, recommend setting. (will ALWAYS be transcoded to UTF-8 on the way out. We insist.)
-
marc4j_reader.jar_dir: Path to a directory containing Marc4J jar file to use. All .jar’s in dir will
be loaded. If unset, uses marc4j.jar bundled with traject.
-
marc4j_reader.keep_marc4j: Keeps the original marc4j record accessible from
the eventual ruby-marc record via record#original_marc4j
Instance Attribute Summary collapse
-
#input_stream ⇒ Object
readonly
Returns the value of attribute input_stream.
-
#settings ⇒ Object
readonly
Returns the value of attribute settings.
Instance Method Summary collapse
- #create_marc_reader! ⇒ Object
- #each ⇒ Object
-
#initialize(input_stream, settings) ⇒ Marc4JReader
constructor
A new instance of Marc4JReader.
- #input_type ⇒ Object
- #internal_reader ⇒ Object
- #logger ⇒ Object
Constructor Details
#initialize(input_stream, settings) ⇒ Marc4JReader
Returns a new instance of Marc4JReader.
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
# File 'lib/traject/marc4j_reader.rb', line 50 def initialize(input_stream, settings) @settings = Traject::Indexer::Settings.new settings @input_stream = input_stream if @settings['marc4j_reader.keep_marc4j'] && ! (MARC::Record.instance_methods.include?(:original_marc4j) && MARC::Record.instance_methods.include?(:"original_marc4j=")) MARC::Record.class_eval('attr_accessor :original_marc4j') end # Creating a converter will do the following: # - nothing, if it detects that the marc4j jar is already loaded # - load all the .jar files in settings['marc4j_reader.jar_dir'] if set # - load the marc4j jar file bundled with MARC::MARC4J otherwise @converter = MARC::MARC4J.new(:jardir => settings['marc4j_reader.jar_dir'], :logger => logger) # Convenience java_import org.marc4j.MarcPermissiveStreamReader java_import org.marc4j.MarcXmlReader end |
Instance Attribute Details
#input_stream ⇒ Object (readonly)
Returns the value of attribute input_stream.
48 49 50 |
# File 'lib/traject/marc4j_reader.rb', line 48 def input_stream @input_stream end |
#settings ⇒ Object (readonly)
Returns the value of attribute settings.
48 49 50 |
# File 'lib/traject/marc4j_reader.rb', line 48 def settings @settings end |
Instance Method Details
#create_marc_reader! ⇒ Object
83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
# File 'lib/traject/marc4j_reader.rb', line 83 def create_marc_reader! case input_type when "binary" permissive = settings["marc4j_reader.permissive"].to_s == "true" # #to_inputstream turns our ruby IO into a Java InputStream # third arg means 'convert to UTF-8, yes' MarcPermissiveStreamReader.new(input_stream.to_inputstream, permissive, true, settings["marc4j_reader.source_encoding"]) when "xml" MarcXmlReader.new(input_stream.to_inputstream) else raise IllegalArgument.new("Unrecgonized marc_source.type: #{input_type}") end end |
#each ⇒ Object
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
# File 'lib/traject/marc4j_reader.rb', line 98 def each while (internal_reader.hasNext) begin marc4j = internal_reader.next rubymarc = @converter.marc4j_to_rubymarc(marc4j) if @settings['marc4j_reader.keep_marc4j'] rubymarc.original_marc4j = marc4j end rescue Exception =>e msg = "MARC4JReader: Error reading MARC, fatal, re-raising" if marc4j msg += "\n 001 id: #{marc4j.getControlNumber}" end msg += "\n #{Traject::Util.(e)}" logger.fatal msg raise e end yield rubymarc end end |
#input_type ⇒ Object
78 79 80 81 |
# File 'lib/traject/marc4j_reader.rb', line 78 def input_type # maybe later add some guessing somehow settings["marc_source.type"] end |
#internal_reader ⇒ Object
74 75 76 |
# File 'lib/traject/marc4j_reader.rb', line 74 def internal_reader @internal_reader ||= create_marc_reader! end |
#logger ⇒ Object
120 121 122 |
# File 'lib/traject/marc4j_reader.rb', line 120 def logger @logger ||= (settings[:logger] || Yell.new(STDERR, :level => "gt.fatal")) # null logger) end |