Solr::CursorStream
"Stream" results from solr with cursor-based fetching, exposing the stream as a normal ruby enumerator.
Note that this is different from true streaming of results via, e.g.,
the default /export
handler.
Those queries can involve more complex processing, but don't fit all use cases:
/export
can't use relevancy ranking as a sort;Solr::CursorStream
can/export
requires all fields have to bedocValues
,Solr::CursorStream
doesn't.
Cursor-based streaming allows, with some restrictions,
downloading large sets of data without the "deep paging"
out-of-memory problems
associated with just using the start
and rows
parameters.
The only real restriction is that the sort specification MUST
include theuniqueKey
field. If you're just downloading a whole dataset and
don't care about order, the default query of *:*
and the default sort of id asc
will be fine (assuming your uniqueKey is id
).
If you want to sort by
another field/value, just include the uniqueKey in a secondary sort (e.g.,
sort: "score desc, id asc"
) to guarantee a stable sort.
NOTE that if you don't need the score
(relevancy) field,
use the default query parameter of *:*
so
solr doesn't have to work as hard. Just put your restrictions in the
filters
array.
Usage
require 'solr/cursorstream'
core_url = "http://my.solr.com:8025/solr/mycore/"
# Get everything in the solr core, no restrictions
cs = Solr::CursorStream.new(url: core_url)
cs.each {|doc| ... }
# Filter for newer stuff
# Note that you need to lucene-escape any q/fq values on your own, since
# otherwise we'd need a full solr syntax parser to determine which
# bits to escape.
cs = Solr::CursorStream.new(url: core_url, filters = ['year:{2010 TO *}'])
# Find everything with the phrase "Civil War" in the title and
# pre-20th century, ordered by year
cs = Solr::CursorStream.new(url: core_url) do |s|
s.filters = ['year:[* TO 1900]', 'title:"Civil War"']
s.sort = 'year asc, id asc' # need to include the uniqueKey field (id)!
end
# #each yields a solr document hash until it runs out
cs.each {|doc| ... }
# The underlying Faraday http connection is available if you need
# to mess with it directly
cs.connection.set_basic_auth(user, password)
# There are a _lot_ of possible arguments to `new`. It may be easier
# to specify values in a block
cs = Solr::CursorStream.new(url: core_url) do |s|
s.batch_size = 100
s.fields = %w[id title author year]
s.filters = ["year:[* TO 1900]"]
s.query = "title:(Civil War)"
s.sort = 'score desc, id asc'
end
# Get the first 10_000 results from a query
cs.each_with_index do |doc, i|
break if i >= 10_000
do_someting_with_the_solr_doc(doc)
end
TODO
- [ ] Add a
lucene_escape
utility function - [ ] Change q/fq to take either a string (as current) or a => value hash
- [ ] Actual error handling for retries, or at least passing useful information along
- [ ] Figure out how to test without a live solr to bounce off of. Maybe use vcr or similar?
Installation
Add this line to your application's Gemfile:
gem 'solr_cursorstream'
# require with require "solr/cursorstream"
And then execute:
$ bundle install
Or install it yourself as:
$ gem install solr_cursorstream
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/mlibrary/solr_cursorstream.
License
The gem is available as open source under the terms of the MIT License.