Class: Ferret::Index::Index
- Inherits:
-
Object
- Object
- Ferret::Index::Index
- Defined in:
- lib/ferret/index.rb
Overview
This is a simplified interface to the index. See the TUTORIAL for more information on how to use this class.
Instance Attribute Summary collapse
-
#options ⇒ Object
readonly
Returns the value of attribute options.
Instance Method Summary collapse
-
#add_document(doc, analyzer = nil) ⇒ Object
(also: #<<)
Adds a document to this index, using the provided analyzer instead of the local analyzer if provided.
-
#add_indexes(indexes) ⇒ Object
Merges all segments from an index or an array of indexes into this index.
-
#close ⇒ Object
Closes this index by closing its associated reader and writer objects.
-
#delete(arg) ⇒ Object
Deletes a document/documents from the index.
-
#deleted?(n) ⇒ Boolean
Returns true if document
n
has been deleted. -
#doc(*arg) ⇒ Object
(also: #[])
Retrieves a document/documents from the index.
-
#explain(query, doc) ⇒ Object
Returns an Explanation that describes how
doc
scored againstquery
. -
#field_infos ⇒ Object
Returns the field_infos object so that you can add new fields to the index.
-
#flush ⇒ Object
(also: #commit)
Flushes all writes to the index.
-
#has_deletions? ⇒ Boolean
Returns true if any documents have been deleted since the index was last flushed.
-
#highlight(query, doc_id, options = {}) ⇒ Object
Returns an array of strings with the matches highlighted.
-
#initialize(options = {}, &block) ⇒ Index
constructor
If you create an Index without any options, it’ll simply create an index in memory.
-
#optimize ⇒ Object
optimizes the index.
-
#persist(directory, create = true) ⇒ Object
This is a simple utility method for saving an in memory or RAM index to the file system.
-
#process_query(query) ⇒ Object
Turn a query string into a Query object with the Index’s QueryParser.
-
#query_delete(query) ⇒ Object
Delete all documents returned by the query.
-
#query_update(query, new_val) ⇒ Object
Update all the documents returned by the query.
-
#reader ⇒ Object
Get the reader for this index.
-
#search(query, options = {}) ⇒ Object
Run a query through the Searcher on the index.
-
#search_each(query, options = {}) ⇒ Object
Run a query through the Searcher on the index.
-
#searcher ⇒ Object
Get the searcher for this index.
-
#size ⇒ Object
returns the number of documents in the index.
- #to_s ⇒ Object
-
#update(id, new_doc) ⇒ Object
Update the document referenced by the document number
id
ifid
is an integer or all of the documents which have the termid
ifid
is a term.. -
#writer ⇒ Object
Get the writer for this index.
Constructor Details
#initialize(options = {}, &block) ⇒ Index
If you create an Index without any options, it’ll simply create an index in memory. But this class is highly configurable and every option that you can supply to IndexWriter and QueryParser, you can also set here. Please look at the options for the constructors to these classes.
Options
See;
-
QueryParser
-
IndexWriter
- default_input_field
-
Default: “id”. This specifies the default field that will be used when you add a simple string to the index using #add_document or <<.
id_field: Default: “id”. This field is as the field to
search when doing searches on a term. For
example, if you do a lookup by term "cat", ie
index["cat"], this will be the field that is
searched.
- key
-
Default: nil. Expert: This should only be used if you really know what you are doing. Basically you can set a field or an array of fields to be the key for the index. So if you add a document with a same key as an existing document, the existing document will be replaced by the new object. Using a multiple field key will slow down indexing so it should not be done if performance is a concern. A single field key (or id) should be find however. Also, you must make sure that your key/keys are either untokenized or that they are not broken up by the analyzer.
- auto_flush
-
Default: false. Set this option to true if you want the index automatically flushed every time you do a write (includes delete) to the index. This is useful if you have multiple processes accessing the index and you don’t want lock errors. Setting :auto_flush to true has a huge performance impact so don’t use it if you are concerned about performance. In that case you should think about setting up a DRb indexing service.
- lock_retry_time
-
Default: 2 seconds. This parameter specifies how long to wait before retrying to obtain the commit lock when detecting if the IndexReader is at the latest version.
- close_dir
-
Default: false. If you explicitly pass a Directory object to this class and you want Index to close it when it is closed itself then set this to true.
Some examples;
index = Index::Index.new(:analyzer => WhiteSpaceAnalyzer.new())
index = Index::Index.new(:path => '/path/to/index',
:create_if_missing => false,
:auto_flush => true)
index = Index::Index.new(:dir => directory,
:default_slop => 2,
:handle_parse_errors => false)
You can also pass a block if you like. The index will be yielded and closed at the index of the box. For example;
Ferret::I.new() do |index|
# do stuff with index. Most of your actions will be cached.
end
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
# File 'lib/ferret/index.rb', line 97 def initialize( = {}, &block) super() if [:key] @key = [:key] if @key.is_a?(Array) @key.flatten.map {|k| k.to_s.intern} end else @key = nil end if (fi = [:field_infos]).is_a?(String) [:field_infos] = FieldInfos.load(fi) end @close_dir = [:close_dir] if [:dir].is_a?(String) [:path] = [:dir] end if [:path] @close_dir = true begin @dir = FSDirectory.new([:path], [:create]) rescue IOError => io @dir = FSDirectory.new([:path], [:create_if_missing] != false) end elsif [:dir] @dir = [:dir] else [:create] = true # this should always be true for a new RAMDir @close_dir = true @dir = RAMDirectory.new end @dir.extend(MonitorMixin).extend(SynchroLockMixin) [:dir] = @dir [:lock_retry_time]||= 2 @options = if (!@dir.exists?("segments")) || [:create] IndexWriter.new().close end [:analyzer]||= Ferret::Analysis::StandardAnalyzer.new @searcher = nil @writer = nil @reader = nil @options.delete(:create) # only create the first time if at all @auto_flush = @options[:auto_flush] || false if (@options[:id_field].nil? and @key.is_a?(Symbol)) @id_field = @key else @id_field = @options[:id_field] || :id end @default_field = (@options[:default_field]||= :*) @default_input_field = [:default_input_field] || @id_field if @default_input_field.respond_to?(:intern) @default_input_field = @default_input_field.intern end @open = true @qp = nil if block yield self self.close end end |
Instance Attribute Details
#options ⇒ Object (readonly)
Returns the value of attribute options.
26 27 28 |
# File 'lib/ferret/index.rb', line 26 def @options end |
Instance Method Details
#add_document(doc, analyzer = nil) ⇒ Object Also known as: <<
Adds a document to this index, using the provided analyzer instead of the local analyzer if provided. If the document contains more than IndexWriter::MAX_FIELD_LENGTH terms for a given field, the remainder are discarded.
There are three ways to add a document to the index. To add a document you can simply add a string or an array of strings. This will store all the strings in the “” (ie empty string) field (unless you specify the default_field when you create the index).
index << "This is a new document to be indexed"
index << ["And here", "is another", "new document", "to be indexed"]
But these are pretty simple documents. If this is all you want to index you could probably just use SimpleSearch. So let’s give our documents some fields;
index << {:title => "Programming Ruby", :content => "blah blah blah"}
index << {:title => "Programming Ruby", :content => "yada yada yada"}
Or if you are indexing data stored in a database, you’ll probably want to store the id;
index << {:id => row.id, :title => row.title, :date => row.date}
See FieldInfos for more information on how to set field properties.
266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
# File 'lib/ferret/index.rb', line 266 def add_document(doc, analyzer = nil) @dir.synchrolock do ensure_writer_open() if doc.is_a?(String) or doc.is_a?(Array) doc = {@default_input_field => doc} end # delete existing documents with the same key if @key if @key.is_a?(Array) query = @key.inject(BooleanQuery.new()) do |bq, field| bq.add_query(TermQuery.new(field, doc[field].to_s), :must) bq end query_delete(query) else id = doc[@key].to_s if id ensure_writer_open() @writer.delete(@key, id) @writer.commit end end end ensure_writer_open() if analyzer old_analyzer = @writer.analyzer @writer.analyzer = analyzer @writer.add_document(doc) @writer.analyzer = old_analyzer else @writer.add_document(doc) end flush() if @auto_flush end end |
#add_indexes(indexes) ⇒ Object
Merges all segments from an index or an array of indexes into this index. You can pass a single Index::Index, Index::Reader, Store::Directory or an array of any single one of these.
This may be used to parallelize batch indexing. A large document collection can be broken into sub-collections. Each sub-collection can be indexed in parallel, on a different thread, process or machine and perhaps all in memory. The complete index can then be created by merging sub-collection indexes with this method.
After this completes, the index is optimized.
607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 |
# File 'lib/ferret/index.rb', line 607 def add_indexes(indexes) @dir.synchrolock do ensure_writer_open() indexes = [indexes].flatten # make sure we have an array return if indexes.size == 0 # nothing to do if indexes[0].is_a?(Index) indexes.delete(self) # don't merge with self indexes = indexes.map {|index| index.reader } elsif indexes[0].is_a?(Ferret::Store::Directory) indexes.delete(@dir) # don't merge with self indexes = indexes.map {|dir| IndexReader.new(dir) } elsif indexes[0].is_a?(IndexReader) indexes.delete(@reader) # don't merge with self else raise ArgumentError, "Unknown index type when trying to merge indexes" end ensure_writer_open @writer.add_readers(indexes) end end |
#close ⇒ Object
Closes this index by closing its associated reader and writer objects.
205 206 207 208 209 210 211 212 213 214 215 216 217 |
# File 'lib/ferret/index.rb', line 205 def close @dir.synchronize do if not @open raise(StandardError, "tried to close an already closed directory") end @searcher.close() if @searcher @reader.close() if @reader @writer.close() if @writer @dir.close() if @close_dir @open = false end end |
#delete(arg) ⇒ Object
Deletes a document/documents from the index. The method for determining the document to delete depends on the type of the argument passed.
If arg
is an Integer then delete the document based on the internal document number. Will raise an error if the document does not exist.
If arg
is a String then search for the documents with arg
in the id
field. The id
field is either :id or whatever you set :id_field parameter to when you create the Index object. Will fail quietly if the no document exists.
437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 |
# File 'lib/ferret/index.rb', line 437 def delete(arg) @dir.synchrolock do ensure_writer_open() if arg.is_a?(String) or arg.is_a?(Symbol) ensure_writer_open() @writer.delete(@id_field, arg.to_s) elsif arg.is_a?(Integer) ensure_reader_open() cnt = @reader.delete(arg) else raise ArgumentError, "Cannot delete for arg of type #{arg.class}" end flush() if @auto_flush end return self end |
#deleted?(n) ⇒ Boolean
Returns true if document n
has been deleted
472 473 474 475 476 477 |
# File 'lib/ferret/index.rb', line 472 def deleted?(n) @dir.synchronize do ensure_reader_open() return @reader.deleted?(n) end end |
#doc(*arg) ⇒ Object Also known as: []
Retrieves a document/documents from the index. The method for retrieval depends on the type of the argument passed.
If arg
is an Integer then return the document based on the internal document number.
If arg
is a Range, then return the documents within the range based on internal document number.
If arg
is a String then search for the first document with arg
in the id
field. The id
field is either :id or whatever you set :id_field parameter to when you create the Index object.
412 413 414 415 416 417 418 419 420 421 422 423 424 |
# File 'lib/ferret/index.rb', line 412 def doc(*arg) @dir.synchronize do id = arg[0] if id.kind_of?(String) or id.kind_of?(Symbol) ensure_reader_open() term_doc_enum = @reader.term_docs_for(@id_field, id.to_s) return term_doc_enum.next? ? @reader[term_doc_enum.doc] : nil else ensure_reader_open(false) return @reader[*arg] end end end |
#explain(query, doc) ⇒ Object
Returns an Explanation that describes how doc
scored against query
.
This is intended to be used in developing Similarity implementations, and, for good performance, should not be displayed with every hit. Computing an explanation is as expensive as executing the query over the entire index.
673 674 675 676 677 678 679 680 |
# File 'lib/ferret/index.rb', line 673 def explain(query, doc) @dir.synchronize do ensure_searcher_open() query = do_process_query(query) return @searcher.explain(query, doc) end end |
#field_infos ⇒ Object
Returns the field_infos object so that you can add new fields to the index.
692 693 694 695 696 697 |
# File 'lib/ferret/index.rb', line 692 def field_infos @dir.synchrolock do ensure_writer_open() return @writer.field_infos end end |
#flush ⇒ Object Also known as: commit
Flushes all writes to the index. This will not optimize the index but it will make sure that all writes are written to it.
NOTE: this is not necessary if you are only using this class. All writes will automatically flush when you perform an operation that reads the index.
562 563 564 565 566 567 568 569 570 571 572 573 574 |
# File 'lib/ferret/index.rb', line 562 def flush() @dir.synchronize do if @reader if @searcher @searcher.close @searcher = nil end @reader.commit elsif @writer @writer.commit end end end |
#has_deletions? ⇒ Boolean
Returns true if any documents have been deleted since the index was last flushed.
549 550 551 552 553 554 |
# File 'lib/ferret/index.rb', line 549 def has_deletions?() @dir.synchronize do ensure_reader_open() return @reader.has_deletions? end end |
#highlight(query, doc_id, options = {}) ⇒ Object
Returns an array of strings with the matches highlighted. The query
can either a query String or a Ferret::Search::Query object. The doc_id is the id of the document you want to highlight (usually returned by the search methods). There are also a number of options you can pass;
Options
- field
-
Default: @options. The default_field is the field that is usually highlighted but you can specify which field you want to highlight here. If you want to highlight multiple fields then you will need to call this method multiple times.
- excerpt_length
-
Default: 150. Length of excerpt to show. Highlighted terms will be in the centre of the excerpt. Set to :all to highlight the entire field.
- num_excerpts
-
Default: 2. Number of excerpts to return.
- pre_tag
-
Default: “<b>”. Tag to place to the left of the match. You’ll probably want to change this to a “<span>” tag with a class. Try “033[36m” for use in a terminal.
- post_tag
-
Default: “</b>”. This tag should close the
:pre_tag
. Try tag “033[m” in the terminal. - ellipsis
-
Default: “…”. This is the string that is appended at the beginning and end of excerpts (unless the excerpt hits the start or end of the field. Alternatively you may want to use the HTML entity … or the UTF-8 string “342200246”.
194 195 196 197 198 199 200 201 202 |
# File 'lib/ferret/index.rb', line 194 def highlight(query, doc_id, = {}) @dir.synchronize do ensure_searcher_open() @searcher.highlight(do_process_query(query), doc_id, [:field]||@options[:default_field], ) end end |
#optimize ⇒ Object
optimizes the index. This should only be called when the index will no longer be updated very often, but will be read a lot.
579 580 581 582 583 584 585 586 |
# File 'lib/ferret/index.rb', line 579 def optimize() @dir.synchrolock do ensure_writer_open() @writer.optimize() @writer.close() @writer = nil end end |
#persist(directory, create = true) ⇒ Object
This is a simple utility method for saving an in memory or RAM index to the file system. The same thing can be achieved by using the Index::Index#add_indexes method and you will have more options when creating the new index, however this is a simple way to turn a RAM index into a file system index.
- directory
-
This can either be a Store::Directory object or a String representing the path to the directory where you would like to store the index.
- create
-
True if you’d like to create the directory if it doesn’t exist or copy over an existing directory. False if you’d like to merge with the existing directory. This defaults to false.
642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 |
# File 'lib/ferret/index.rb', line 642 def persist(directory, create = true) synchronize do close_all() old_dir = @dir if directory.is_a?(String) @dir = FSDirectory.new(directory, create) elsif directory.is_a?(Ferret::Store::Directory) @dir = directory end @dir.extend(MonitorMixin).extend(SynchroLockMixin) @options[:dir] = @dir @options[:create_if_missing] = true add_indexes([old_dir]) end end |
#process_query(query) ⇒ Object
Turn a query string into a Query object with the Index’s QueryParser
683 684 685 686 687 688 |
# File 'lib/ferret/index.rb', line 683 def process_query(query) @dir.synchronize do ensure_searcher_open() return do_process_query(query) end end |
#query_delete(query) ⇒ Object
Delete all documents returned by the query.
- query
-
The query to find documents you wish to delete. Can either be a string (in which case it is parsed by the standard query parser) or an actual query object.
459 460 461 462 463 464 465 466 467 468 469 |
# File 'lib/ferret/index.rb', line 459 def query_delete(query) @dir.synchrolock do ensure_writer_open() ensure_searcher_open() query = do_process_query(query) @searcher.search_each(query, :limit => :all) do |doc, score| @reader.delete(doc) end flush() if @auto_flush end end |
#query_update(query, new_val) ⇒ Object
Update all the documents returned by the query.
- query
-
The query to find documents you wish to update. Can either be a string (in which case it is parsed by the standard query parser) or an actual query object.
- new_val
-
The values we are updating. This can be a string in which case the default field is updated, or it can be a hash, in which case, all fields in the hash are merged into the old hash. That is, the old fields are replaced by values in the new hash if they exist.
Example
index << {:id => "26", :title => "Babylon", :artist => "David Grey"}
index << {:id => "29", :title => "My Oh My", :artist => "David Grey"}
# correct
index.query_update('artist:"David Grey"', {:artist => "David Gray"})
index["26"]
#=> {:id => "26", :title => "Babylon", :artist => "David Gray"}
index["28"]
#=> {:id => "28", :title => "My Oh My", :artist => "David Gray"}
525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 |
# File 'lib/ferret/index.rb', line 525 def query_update(query, new_val) @dir.synchrolock do ensure_writer_open() ensure_searcher_open() docs_to_add = [] query = do_process_query(query) @searcher.search_each(query) do |id, score| document = @searcher[id].load if new_val.is_a?(Hash) document.merge!(new_val) else new_val.is_a?(String) or new_val.is_a?(Symbol) document[@default_input_field] = new_val.to_s end docs_to_add << document @reader.delete(id) end ensure_writer_open() docs_to_add.each {|doc| @writer << doc } flush() if @auto_flush end end |
#reader ⇒ Object
Get the reader for this index.
- NOTE
-
This will close the writer from this index.
221 222 223 224 |
# File 'lib/ferret/index.rb', line 221 def reader ensure_reader_open() return @reader end |
#search(query, options = {}) ⇒ Object
Run a query through the Searcher on the index. A TopDocs object is returned with the relevant results. The query
is a built in Query object or a query string that can be parsed by the Ferret::QueryParser. Here are the options;
Options
- offset
-
Default: 0. The offset of the start of the section of the result-set to return. This is used for paging through results. Let’s say you have a page size of 10. If you don’t find the result you want among the first 10 results then set
:offset
to 10 and look at the next 10 results, then 20 and so on. - limit
-
Default: 10. This is the number of results you want returned, also called the page size. Set
:limit
to:all
to return all results - sort
-
A Sort object or sort string describing how the field should be sorted. A sort string is made up of field names which cannot contain spaces and the word “DESC” if you want the field reversed, all separated by commas. For example; “rating DESC, author, title”. Note that Ferret will try to determine a field’s type by looking at the first term in the index and seeing if it can be parsed as an integer or a float. Keep this in mind as you may need to specify a fields type to sort it correctly. For more on this, see the documentation for SortField
- filter
-
a Filter object to filter the search results with
- filter_proc
-
a filter Proc is a Proc which takes the doc_id, the score and the Searcher object as its parameters and returns a Boolean value specifying whether the result should be included in the result set.
337 338 339 340 341 |
# File 'lib/ferret/index.rb', line 337 def search(query, = {}) @dir.synchronize do return do_search(query, ) end end |
#search_each(query, options = {}) ⇒ Object
Run a query through the Searcher on the index. A TopDocs object is returned with the relevant results. The query
is a Query object or a query string that can be validly parsed by the Ferret::QueryParser. The Searcher#search_each method yields the internal document id (used to reference documents in the Searcher object like this; searcher[doc_id]
) and the search score for that document. It is possible for the score to be greater than 1.0 for some queries and taking boosts into account. This method will also normalize scores to the range 0.0..1.0 when the max-score is greater than 1.0. Here are the options;
Options
- offset
-
Default: 0. The offset of the start of the section of the result-set to return. This is used for paging through results. Let’s say you have a page size of 10. If you don’t find the result you want among the first 10 results then set
:offset
to 10 and look at the next 10 results, then 20 and so on. - limit
-
Default: 10. This is the number of results you want returned, also called the page size. Set
:limit
to:all
to return all results - sort
-
A Sort object or sort string describing how the field should be sorted. A sort string is made up of field names which cannot contain spaces and the word “DESC” if you want the field reversed, all separated by commas. For example; “rating DESC, author, title”. Note that Ferret will try to determine a field’s type by looking at the first term in the index and seeing if it can be parsed as an integer or a float. Keep this in mind as you may need to specify a fields type to sort it correctly. For more on this, see the documentation for SortField
- filter
-
a Filter object to filter the search results with
- filter_proc
-
a filter Proc is a Proc which takes the doc_id, the score and the Searcher object as its parameters and returns a Boolean value specifying whether the result should be included in the result set.
- returns
-
The total number of hits.
Example
eg.
index.search_each(query, = {}) do |doc, score|
puts "hit document number #{doc} with a score of #{score}"
end
389 390 391 392 393 394 395 396 397 398 |
# File 'lib/ferret/index.rb', line 389 def search_each(query, = {}) # :yield: doc, score @dir.synchronize do ensure_searcher_open() query = do_process_query(query) @searcher.search_each(query, ) do |doc, score| yield doc, score end end end |
#searcher ⇒ Object
Get the searcher for this index.
- NOTE
-
This will close the writer from this index.
228 229 230 231 |
# File 'lib/ferret/index.rb', line 228 def searcher ensure_searcher_open() return @searcher end |
#size ⇒ Object
returns the number of documents in the index
589 590 591 592 593 594 |
# File 'lib/ferret/index.rb', line 589 def size() @dir.synchronize do ensure_reader_open() return @reader.num_docs() end end |
#to_s ⇒ Object
658 659 660 661 662 663 664 |
# File 'lib/ferret/index.rb', line 658 def to_s buf = "" (0...(size)).each do |i| buf << self[i].to_s + "\n" if not deleted?(i) end buf end |
#update(id, new_doc) ⇒ Object
Update the document referenced by the document number id
if id
is an integer or all of the documents which have the term id
if id
is a term..
- id
-
The number of the document to update. Can also be a string representing the value in the
id
field. Also consider using the :key attribute. - new_doc
-
The document to replace the old document with
487 488 489 490 491 492 493 494 495 496 497 498 499 |
# File 'lib/ferret/index.rb', line 487 def update(id, new_doc) @dir.synchrolock do ensure_writer_open() delete(id) if id.is_a?(String) or id.is_a?(Symbol) @writer.commit else ensure_writer_open() end @writer << new_doc flush() if @auto_flush end end |
#writer ⇒ Object
Get the writer for this index.
- NOTE
-
This will close the reader from this index.
235 236 237 238 |
# File 'lib/ferret/index.rb', line 235 def writer ensure_writer_open() return @writer end |