SimpleSearch - Simple vector space search library
What is SimpleSearch?
SimpleSearch is a simple vector space text search engine.
Installation
Prerequisites
* Ruby 1.8 (http://www.ruby-lang.org/)
Optional
* RubyGems (http://rubygems.rubyforge.org)
Installing SimpleSearch
RubyGems (rubygems.rubyforge.org):
gem install SimpleSearch
…or…
.tar.gz installation:
ruby setup.rb #not yet available
Using SimpleSearch
SimpleSearch comes with a command line program that was primarily written as an example of how to use the API but might actually be useful.
To run the command line program, simply type: $ search-simple –help
An example: $ search-simple –cache=/tmp/mycache –dir=/usr/local/lib/ruby/gems/1.8/doc –extensions=html markup
This will cause search-simple to (re)index all of the files with a .html extension in your RubyGems rdoc directory and then search them for the words “markup” and “html”. The search indices will be stored in /tmp/mycache.
At the heart of SimpleSearch is, of course, an API that can be embedded in other programs. The code of SimpleSearch was originally created by Dave Thomas as a search mechanism for his RubLog (rubyforge.org/projects/rublog) weblogging package. The API can be used as follows:
require ‘search/simple’ Search::Simple::Searcher.load(content_for_indexing(options), “/tmp/search_cache”) contents = Search::Simple::Contents.new # silly example
Dir['**/*'].each do |file_name|
File.open(file_name) do |file|
contents << Search::Simple::Content.new(file.read, File.(file_name), file.mtime)
end end sr = s.find_words([‘some’, ‘keywords’, ‘to’, ‘search’, ‘for’]) if sr.contains_matches sr.results.sort.each do |res| puts “#resres.score:#resres.name” end else puts “No matches” end
Credits
Almost all of this code was written by Dave Thomas (pragprog.com/pragdave). The original code was a complete rewrite at an attempt that Chad Fowler (www.chadfowler.com) made to do a vector space search for RubLog. Chad Fowler adapted Dave’s working RubLog code to be Rublog-independent and created what is now SimpleSearch out of it.