Class: Traject::OaiPmhNokogiriReader
- Inherits:
-
Object
- Object
- Traject::OaiPmhNokogiriReader
- Includes:
- Enumerable
- Defined in:
- lib/traject/oai_pmh_nokogiri_reader.rb
Overview
Reads an OAI feed via HTTP and feeds it directly to a traject pipeline. You don't HAVE to use this to read oai-pmh, you might choose to fetch and store OAI-PMH responses to disk yourself, and then process as ordinary XML.
Example command line:
traject -i xml -r Traject::OaiPmhNokogiriReader -s oai_pmh.start_url="http://example.com/oai?verb=ListRecords&metadataPrefix=oai_dc" -c your_config.rb
Settings
- oai_pmh.start_url: Required, eg "http://example.com/oai?verb=ListRecords&metadataPrefix=oai_dc"
- oai_pmh.timeout: (default 10) timeout for http.rb in seconds
- oai_pmh.try_gzip: (default true). Ask server for gzip response if available
- oai_pmh.http_persistent: (default true). Use persistent HTTP connections.
JRUBY NOTES:
- Does not work with jruby 9.2 until http.rb does: https://github.com/httprb/http/issues/475
- JRuby version def reads whole http response into memory before parsing; MRI version might do this too, but might not?
TO DO
This would be a lot more useful with some sort of built-in HTTP caching.
Instance Attribute Summary collapse
-
#input_stream ⇒ Object
readonly
Returns the value of attribute input_stream.
-
#settings ⇒ Object
readonly
Returns the value of attribute settings.
Instance Method Summary collapse
- #each ⇒ Object
- #extra_xpath_hooks ⇒ Object
-
#initialize(input_stream, settings) ⇒ OaiPmhNokogiriReader
constructor
A new instance of OaiPmhNokogiriReader.
- #logger ⇒ Object
- #resumption_url(resumption_token) ⇒ Object
- #start_url ⇒ Object
- #start_url_verb ⇒ Object
- #timeout ⇒ Object
Constructor Details
#initialize(input_stream, settings) ⇒ OaiPmhNokogiriReader
Returns a new instance of OaiPmhNokogiriReader.
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 33 def initialize(input_stream, settings) namespaces = (settings["nokogiri.namespaces"] || {}).merge( "oai" => "http://www.openarchives.org/OAI/2.0/" ) @settings = Traject::Indexer::Settings.new( "nokogiri_reader.extra_xpath_hooks" => extra_xpath_hooks, "nokogiri.each_record_xpath" => "/oai:OAI-PMH/oai:ListRecords/oai:record", "nokogiri.namespaces" => namespaces ).with_defaults( "oai_pmh.timeout" => 10, "oai_pmh.try_gzip" => true, "oai_pmh.http_persistent" => true ).fill_in_defaults!.merge(settings) @input_stream = input_stream end |
Instance Attribute Details
#input_stream ⇒ Object (readonly)
Returns the value of attribute input_stream.
31 32 33 |
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 31 def input_stream @input_stream end |
#settings ⇒ Object (readonly)
Returns the value of attribute settings.
31 32 33 |
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 31 def settings @settings end |
Instance Method Details
#each ⇒ Object
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 72 def each url = start_url resumption_token = nil last_resumption_token = nil pages_fetched = 0 until url == nil resumption_token = read_and_parse_response(url) do |record| yield record end url = resumption_url(resumption_token) (last_resumption_token = resumption_token) if resumption_token pages_fetched += 1 end logger.info("#{self.class.name}: fetched #{pages_fetched} pages; last resumptionToken found: #{last_resumption_token.inspect}") end |
#extra_xpath_hooks ⇒ Object
60 61 62 63 64 65 66 67 68 69 70 |
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 60 def extra_xpath_hooks @extra_xpath_hooks ||= { "//oai:resumptionToken" => lambda do |doc, clipboard| token = doc.text if token && token != "" clipboard[:resumption_token] = token end end } end |
#logger ⇒ Object
105 106 107 |
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 105 def logger @logger ||= (@settings[:logger] || Yell.new(STDERR, :level => "gt.fatal")) # null logger) end |
#resumption_url(resumption_token) ⇒ Object
91 92 93 94 95 96 97 98 99 |
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 91 def resumption_url(resumption_token) return nil if resumption_token.nil? || resumption_token == "" # resumption URL is just original verb with resumption token, that seems to be # the oai-pmh spec. parsed_uri = URI.parse(start_url) parsed_uri.query = "verb=#{CGI.escape start_url_verb}&resumptionToken=#{CGI.escape resumption_token}" parsed_uri.to_s end |
#start_url ⇒ Object
52 53 54 |
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 52 def start_url settings["oai_pmh.start_url"] or raise ArgumentError.new("#{self.class.name} needs a setting 'oai_pmh.start_url'") end |
#start_url_verb ⇒ Object
56 57 58 |
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 56 def start_url_verb @start_url_verb ||= (array = CGI.parse(URI.parse(start_url).query)["verb"]) && array.first end |
#timeout ⇒ Object
101 102 103 |
# File 'lib/traject/oai_pmh_nokogiri_reader.rb', line 101 def timeout settings["oai_pmh.timeout"] end |