Class: Search::Simple::Searcher
- Inherits:
-
Object
- Object
- Search::Simple::Searcher
- Defined in:
- lib/search/simple/searcher.rb
Class Method Summary collapse
-
.create_indices(entries, dict, document_vectors) ⇒ Object
Create a new dictionary and document vectors from a blog archive.
- .extract_words_for_searcher(text) ⇒ Object
-
.load(contents, cache_file) ⇒ Object
Serialization support.
Instance Method Summary collapse
- #dump ⇒ Object
-
#find_words(words) ⇒ Object
Return SearchResults based on trying to find the array of
words
in our document vectors. -
#initialize(dict, document_vectors, cache_file) ⇒ Searcher
constructor
A new instance of Searcher.
Constructor Details
#initialize(dict, document_vectors, cache_file) ⇒ Searcher
Returns a new instance of Searcher.
61 62 63 64 65 |
# File 'lib/search/simple/searcher.rb', line 61 def initialize(dict, document_vectors, cache_file) @dict = dict @document_vectors = document_vectors @cache_file = cache_file end |
Class Method Details
.create_indices(entries, dict, document_vectors) ⇒ Object
Create a new dictionary and document vectors from a blog archive
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
# File 'lib/search/simple/searcher.rb', line 168 def Searcher.create_indices(entries, dict, document_vectors) word_count = doc_count = 0 entries.each do |entry| doc_count += 1 vector = Vector.new extract_words_for_searcher(entry.content.downcase) do |word| word_index = dict.add_word(word) if word_index vector.add_word_index(word_index) word_count += 1 end end document_vectors[entry.identifier] = vector end $stderr.puts "#{dict.size} unique words out of #{word_count} " + "in #{doc_count} documents" end |
.extract_words_for_searcher(text) ⇒ Object
159 160 161 162 163 |
# File 'lib/search/simple/searcher.rb', line 159 def Searcher.extract_words_for_searcher(text) text.scan(/[-+]?\w[\-\w]{2,}/) do |word| yield word end end |
.load(contents, cache_file) ⇒ Object
Serialization support. At some point we’ll need to do incremental indexing. For now, however, the following seems to work fairly effectively on 1000 entry blogs, so I’ll defer the change until later.
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
# File 'lib/search/simple/searcher.rb', line 124 def Searcher.load(contents, cache_file) dict = document_vectors = nil modified = false loaded = false begin File.open(cache_file, "r") do |f| if f.mtime > contents.latest_mtime dict = Marshal.load(f) document_vectors = Marshal.load(f) loaded = true end end rescue ; end unless loaded dict = Dictionary.new document_vectors = {} create_indices(contents, dict, document_vectors) modified = true end s = Searcher.new(dict, document_vectors, cache_file) s.dump if modified s end |
Instance Method Details
#dump ⇒ Object
152 153 154 155 156 157 |
# File 'lib/search/simple/searcher.rb', line 152 def dump File.open(@cache_file, "w") do |fileInstance| Marshal.dump(@dict, fileInstance) Marshal.dump(@document_vectors, fileInstance) end end |
#find_words(words) ⇒ Object
Return SearchResults based on trying to find the array of words
in our document vectors
A word beginning ‘+’ must appear in the target documents A word beginning ‘-’ must not appear other words are scored. The documents with the highest scores are returned first
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
# File 'lib/search/simple/searcher.rb', line 75 def find_words(words) search_results = SearchResults.new general = Vector.new must_match = Vector.new must_not_match = Vector.new Searcher.extract_words_for_searcher(words.join(' ')) do |word| case word[0] when ?+ word = word[1,99] vector = must_match when ?- word = word[1,99] vector = must_not_match else vector = general end index = @dict.find(word.downcase) if index vector.add_word_index(index) else search_results.add_warning "'#{word}' does not occur in the documents" end end # if (general.num_bits + must_match.num_bits).zero? # search_results.add_warning "No valid search terms given" # else res = [] @document_vectors.each do |entry, dvec| score = dvec.score_against(must_match, must_not_match, general) res << [ entry, score ] if score > 0 end res.sort {|a,b| b[1] <=> a[1] }.each {|name, score| search_results.add_result(name, score) } search_results.add_warning "No matches" unless search_results.contains_matches # end search_results end |