Harvestdor::Indexer

Build Status | Coverage Status | Gem Version | Dependency Status

A Gem to harvest meta/data from DOR and the skeleton code to index it and write to Solr.

Installation

Add this line to your application's Gemfile:

gem 'harvestdor-indexer'

And then execute:

$ bundle

Or install it yourself as:

$ gem install harvestdor-indexer

Usage

You must override the index method and provide configuration options. It is recommended to write a script to run it, too - example below.

Configuration / Set up

Create a yml config file for your collection going to a Solr index.

See spec/config/ap.yml for an example. You will want to copy that file and change the following settings:

  • whitelist
  • dor fetcher service_url
  • solr url
  • harvestdor log_dir, log_nam

Whitelist

The whitelist is how you specify which objects to index. The whitelist can be

  • an Array of druids inline in the config yml file
  • a filename containing a list of druids (one per line)

If a druid, per the object's identityMetadata at purl page, is for a

  • collection record: then we process all the item druids in that collection (as if they were included individually in the whitelist)
  • non-collection record: then we process the druid as an individual item

Override the Harvestdor::Indexer.index method

In your code, override this method from the Harvestdor::Indexer class

# create Solr doc for the druid and add it to Solr
#  NOTE: don't forget to send commit to Solr, either once at end (already in harvest_and_index), or for each add, or ...
def index resource

  benchmark "Indexing #{resource.druid}" do
    logger.debug "About to index #{resource.druid}"
    doc_hash = {}
    doc_hash[:id] = resource.druid

    # you might add things from Indexer level class here
    #  (e.g. things that are the same across all documents in the harvest)
    solr.add doc_hash
    # TODO: provide call to code to update DOR object's workflow datastream??
  end
end

Run it

(bundle install)

You may want to write a script to run the code. Your script might look like this:

#!/usr/bin/env ruby $LOAD_PATH.unshift(File.join(File.dirname(FILE), '..')) $LOAD_PATH.unshift(File.join(File.dirname(FILE), '..', 'lib')) require 'rubygems' begin require 'your_indexer' rescue LoadError require 'bundler/setup' require 'your_indexer' end config_yml_path = ARGV.pop if config_yml_path.nil? puts "** You must provide the full path to a collection config yml file **" exit end indexer = Harvestdor::Indexer.new(config_yml_path, opts) indexer.harvest_and_index

Then you run the script like so:

 $ ./bin/indexer config/(your coll).yml

Run from deployed instance, as that box is already set up to be able to talk to DOR Fetcher service and to SUL Solr indexes.

Contributing