Traject settings
Traject settings are a flat list of key/value pairs -- a single Hash, not nested. Keys are always strings, and dots (".") can be used for grouping and namespacing.
Values are usually strings, but occasionally something else. String values can be easily set via the command line.
Settings can be set in configuration files, usually like:
settings do
provide "key", "value"
end
or on the command line: -s key=value
. There are also some command line shortcuts
for commonly used settings, see traject -h
.
provide
will only set the key if it was previously unset, so first time to set 'wins'. And command-line
settings are applied first of all. It's recommended you use provide
.
store
is also available, and forces setting of the new value overriding any previous value set.
Known settings
Reading (general)
reader_class_name
: a Traject Reader class, used by the indexer as a source of records. Defaults is reader-specific: Traject::MarcReader (using the ruby marc gem) or Traject::NokogiriReader.Command-line shortcut-r
Error handling
mapping_rescue
: Takes a proc/lambda/callable which accepts two arguments: A Traject::Context, and an exception. Called if an unexpected error is raised when executing indexing rules. The default when this is unset, is to log and re-raise, which will halt execution. It usually means a bug in your mapping code, that you will want to know about. See default logic at Traject::Indexer#default_mapping_rescue
You may instead want to skip the record and continue with indexing, or even conditionally
decide which to do. In a custom handler, if you want to halt execution, you should re-raise the
exception (or raise another). If you want to skip the record and continue, call context.skip!
and do not raise.
The "stabby lambda" syntax is useful for providing a lambda object with proper parsing precedence to not need parentheses.
error_count = Concurrent::AtomicFixnum.new(0)
settings do
provide "mapping_rescue", -> (context, exception) {
error_count.increment
context.logger.error "Encountered exception: #{exception}, total errors #{error_count}"
if my_should_skip?(context, exception)
context.skip!
else
raise exception
end
}
end
At present mapping_rescue
only handles exceptions in running mapping/indexing logic, unexpected raises in readers or writers may not be caught here.
Threads
processing_thread_pool
Number of threads in the main thread pool used for processing records with input rules. On JRuby or Rubinius, defaults to 1 less than the number of processors detected on your machine. On other ruby platforms, defaults to 1.
NOTE: If your processing code isn't thread-safe, set to 0 or nil to disable thread pool and do all processing in main thread.
Choose a pool size based on size of your machine, and complexity of your indexing rules. You might want to try different sizes and measure which works best for you. Probably no reason for it ever to be more than number of cores on indexing machine.
Writing (general)
writer
: An object that implements the Traject Writer interface. If set, takes precedence overwriter_class_name
.writer_class_name
: a Traject Writer class, used by indexer to send processed dictionaries off. Will be used if no explicitwriter
setting or#writer=
is set. Default Traject::SolrJsonWriter, other writers for debugging or writing to files are also available. See Traject::Indexer for more info. Command line shortcut-w
output_file
: Output file to write to for operations that write to files: For instance themarcout
command, or Writer classes that write to files, like Traject::JsonWriter. Has an shortcut-o
on command line.
Writing to solr
json_writer.pretty_print
: used by the JsonWriter, if set to true, will output pretty printed json (with added whitespace) for easier human readability. Default false.solr.url
: URL to connect to a solr instance for indexing, eg http://example.org:8983/solr . Command-line short-cut-u
.solr.version
: Set to eg "1.4.0", "4.3.0"; currently un-used, but in the future will control some default settings, and/or sanity check and warn you if you're doing something that might not work with that version of solr. Set now for help in the future.solr_writer.batch_size
: size of batches that SolrJsonWriter will send docs to Solr in. Default 100. Set to nil, 0, or 1, and SolrJsonWriter will do one http transaction per document, no batching.solr_writer.commit_on_close
: default false, set to true to have the solr writer send an explicit commit message to Solr after indexing.solr_writer.thread_pool
: defaults to 1 (single bg thread). A thread pool is used for submitting docs to solr. Set to 0 or nil to disable threading. Set to 1, there will still be a single bg thread doing the adds. May make sense to set higher than number of cores on your indexing machine, as these threads will mostly be waiting on Solr. Speed/capacity of your solr might be more relevant. Note that processing_thread_pool threads can end up submitting to solr too, if solr_json_writer.thread_pool is full.solr_writer.basic_auth_user
,solr_writer.basic_auth_password
: Not set by default but when both are set the default writer is configured with basic auth.
Dealing with MARC data
marc_source.type
: default 'binary'. Can also set to 'xml' or (not yet implemented todo) 'json'. Command line shortcut-t
marcout.allow_oversized
: Used with-x marcout
command to output marc when outputting as ISO 2709 binary, set to true or string "true", and the MARC::Writer will have allow_oversized=true set, allowing oversized records to be serialized with length bytes zero'd out -- technically illegal, but can be read by MARC::Reader in permissive mode.
Logging and progress
debug_ascii_progress
: true/'true' to print ascii characters to STDERR indicating progress. Yes, this is fixed to STDERR, regardless of your logging setup..
for every batch of records read and parsed^
for every batch of records batched and queued for adding to solr (possibly in thread pool)%
for completing of a Solr 'add'!
when threadpool for solr add has a full queue, so solr add is going to happen in calling queue -- means solr adding can't keep up with production.
log.file
: filename to send logging, or 'STDOUT' or 'STDERR' for those streams. Default STDERRlog.error_file
: Default nil, if set then all log lines of ERROR and higher will be additionally sent to error file named.log.format
: Formatting string used by Yell logger. https://github.com/rudionrails/yell/wiki/101-formatting-log-messageslog.level
: Log this level and above. Default 'info', set to eg 'debug' to get potentially more logging info, or 'error' to get less. https://github.com/rudionrails/yell/wiki/101-setting-the-log-levellog.batch_size
: If set to a number N (or string representation), will output a progress line to log. (by default as INFO, but see log.batch_size.severity)log.batch_size.severity
: Iflog.batch_size
is set, what logger severity level to log to. Default "INFO", set to "DEBUG" etc if desired.'logger': Ignore all the other logger settings, just pass a
Logger
compatible logger instance in directly.