Class: Traject::Marc4JReader
- Inherits:
-
Object
- Object
- Traject::Marc4JReader
- Includes:
- Enumerable
- Defined in:
- lib/traject/marc4j_reader.rb
Overview
Traject::Marc4JReader
uses the marc4j java package to parse the MARC records
into standard ruby-marc MARC::Record objects. This reader may be faster than
Traject::MarcReader, especially for XML.
Marc4JReader can read MARC ISO 2709 ("binary") or MARCXML. We use the Marc4J MarcPermissiveStreamReader for reading binary, but sometimes in non-permissive mode, according to settings. We use the Marc4j MarcXmlReader for reading xml. The actual code for dealing with Marc4J is in the separate marc-marc4j gem.
See also the pure ruby Traject::MarcReader as an alternative, if you need to read marc-in-json, or if you don't need binary Marc8 support, it may in some cases be faster.
Settings
marc_source.type: serialization type. default 'binary', also 'xml' (TODO: json/marc-in-json)
marc4j_reader.permissive: default true, false to turn off permissive reading. Used as value to 'permissive' arg of MarcPermissiveStreamReader constructor. Only used for 'binary'
marc_source.encoding: Only used for 'binary', otherwise always UTF-8. String of the values MarcPermissiveStreamReader accepts: * BESTGUESS (default: not entirely clear what Marc4J does with this) * ISO-8859-1 (also accepted: ISO8859_1) * UTF-8 * MARC-8 (also accepted: MARC8) Default 'BESTGUESS', but HIGHLY recommend setting to avoid some Marc4J unpredictability, Marc4J "BESTGUESS" can be unpredictable in a variety of ways. (will ALWAYS be transcoded to UTF-8 on the way out. We insist.)
marc4j_reader.jar_dir: Path to a directory containing Marc4J jar file to use. All .jar's in dir will be loaded. If unset, uses marc4j.jar bundled with traject.
marc4j_reader.keep_marc4j: Keeps the original marc4j record accessible from the eventual ruby-marc record via record#original_marc4j. Intended for those that have legacy java code for which a marc4j object is needed. .
Example
In a configuration file:
require 'traject/marc4j_reader
settings do
provide "reader_class_name", "Traject::Marc4JReader"
#for MarcXML:
# provide "marc_source.type", "xml"
# Or instead for binary:
provide "marc4j_reader.permissive", true
provide "marc_source.encoding", "MARC8"
end
Instance Attribute Summary collapse
-
#input_stream ⇒ Object
readonly
Returns the value of attribute input_stream.
-
#settings ⇒ Object
readonly
Returns the value of attribute settings.
Instance Method Summary collapse
- #create_marc_reader! ⇒ Object
- #each ⇒ Object
-
#initialize(input_stream, settings) ⇒ Marc4JReader
constructor
A new instance of Marc4JReader.
- #input_type ⇒ Object
- #internal_reader ⇒ Object
- #logger ⇒ Object
- #specified_source_encoding ⇒ Object
Constructor Details
#initialize(input_stream, settings) ⇒ Marc4JReader
Returns a new instance of Marc4JReader.
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
# File 'lib/traject/marc4j_reader.rb', line 65 def initialize(input_stream, settings) @settings = Traject::Indexer::Settings.new settings @input_stream = input_stream if @settings['marc4j_reader.keep_marc4j'] && ! (MARC::Record.instance_methods.include?(:original_marc4j) && MARC::Record.instance_methods.include?(:"original_marc4j=")) MARC::Record.class_eval('attr_accessor :original_marc4j') end # Creating a converter will do the following: # - nothing, if it detects that the marc4j jar is already loaded # - load all the .jar files in settings['marc4j_reader.jar_dir'] if set # - load the marc4j jar file bundled with MARC::MARC4J otherwise @converter = MARC::MARC4J.new(:jardir => settings['marc4j_reader.jar_dir'], :logger => logger) # Convenience java_import org.marc4j.MarcPermissiveStreamReader java_import org.marc4j.MarcXmlReader end |
Instance Attribute Details
#input_stream ⇒ Object (readonly)
Returns the value of attribute input_stream.
63 64 65 |
# File 'lib/traject/marc4j_reader.rb', line 63 def input_stream @input_stream end |
#settings ⇒ Object (readonly)
Returns the value of attribute settings.
63 64 65 |
# File 'lib/traject/marc4j_reader.rb', line 63 def settings @settings end |
Instance Method Details
#create_marc_reader! ⇒ Object
112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
# File 'lib/traject/marc4j_reader.rb', line 112 def create_marc_reader! case input_type when "binary" permissive = settings["marc4j_reader.permissive"].to_s == "true" # #to_inputstream turns our ruby IO into a Java InputStream # third arg means 'convert to UTF-8, yes' MarcPermissiveStreamReader.new(input_stream.to_inputstream, permissive, true, specified_source_encoding) when "xml" MarcXmlReader.new(input_stream.to_inputstream) else raise IllegalArgument.new("Unrecgonized marc_source.type: #{input_type}") end end |
#each ⇒ Object
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
# File 'lib/traject/marc4j_reader.rb', line 127 def each while (internal_reader.hasNext) begin marc4j = internal_reader.next rubymarc = @converter.marc4j_to_rubymarc(marc4j) if @settings['marc4j_reader.keep_marc4j'] rubymarc.original_marc4j = marc4j end rescue Exception =>e msg = "MARC4JReader: Error reading MARC, fatal, re-raising" if marc4j msg += "\n 001 id: #{marc4j.getControlNumber}" end msg += "\n #{Traject::Util.(e)}" logger.fatal msg raise e end yield rubymarc end end |
#input_type ⇒ Object
93 94 95 96 |
# File 'lib/traject/marc4j_reader.rb', line 93 def input_type # maybe later add some guessing somehow settings["marc_source.type"] end |
#internal_reader ⇒ Object
89 90 91 |
# File 'lib/traject/marc4j_reader.rb', line 89 def internal_reader @internal_reader ||= create_marc_reader! end |
#logger ⇒ Object
149 150 151 |
# File 'lib/traject/marc4j_reader.rb', line 149 def logger @logger ||= (settings[:logger] || Yell.new(STDERR, :level => "gt.fatal")) # null logger) end |
#specified_source_encoding ⇒ Object
98 99 100 101 102 103 104 105 106 107 108 109 110 |
# File 'lib/traject/marc4j_reader.rb', line 98 def specified_source_encoding #settings["marc4j_reader.source_encoding"] enc = settings["marc_source.encoding"] # one is standard for ruby and we want to support, # the other is used by Marc4J and we have to pass it to Marc4J enc = "ISO8859_1" if enc == "ISO-8859-1" # default enc = "BESTGUESS" if enc.nil? || enc.empty? return enc end |