Class: Indexer
Class Method Summary collapse
-
.add_or_update_file(complete_path) ⇒ Object
Retrieves content and language from a given document, and adds it to the index.
-
.clear!(all = false) ⇒ Object
Ensures index is closed, and removes every index file for RAILS_ENV.
-
.close ⇒ Object
Closes the index and ensures that a new Index is instantiated next time index is called.
-
.ensure_index_existence ⇒ Object
Creates the index unless it already exists.
-
.index ⇒ Object
Only one IndexWriter should be instantiated.
-
.index_directory_with_multithreads(dir) ⇒ Object
Indexes a given directory, using @@threads_number threads.
-
.index_every_directory(remove_first = false) ⇒ Object
Finds every document included in IndexedDirectories, parses them with PlainTextExtractor and adds them to the index.
-
.last_update ⇒ Object
Returns the time at which the index was last created/updated.
- .locked? ⇒ Boolean
-
.prune_index ⇒ Object
Checks for indexed files that are missing from filesytem and removes them from index & dbm file.
-
.reload_file_mtime ⇒ Object
Returns the time at which the reload file was last touched.
-
.should_index_this_document?(complete_path) ⇒ Boolean
For a given document, it retrieves the time it was last indexed, compare it to its modification time and returns false unless the file has been modified after the last indexing process.
-
.size ⇒ Object
Returns how many files are indexed.
Class Method Details
.add_or_update_file(complete_path) ⇒ Object
Retrieves content and language from a given document, and adds it to the index. Since Document#probably_unique_id is used as index :key, no document will be added twice to the index, and the old document will just get updated.
If for some reason (no content found or no defined PlainTextExtractor), content cannot be found, some basic information about the document (mtime, filename, complete_path) gets indexed anyway.
67 68 69 70 71 72 73 74 75 76 77 78 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 67 def add_or_update_file(complete_path) document = Document.default_fields_for(complete_path) begin PlainTextExtractor.extract_thumbnail_from(complete_path) document.merge! PlainTextExtractor.extract_information_from(complete_path) raise "empty document #{complete_path}" if document[:content].strip.empty? logger.add_document document rescue => e logger.reject_document document, e end index << document end |
.clear!(all = false) ⇒ Object
Ensures index is closed, and removes every index file for RAILS_ENV.
81 82 83 84 85 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 81 def clear!(all=false) close to_remove=all ? Picolena::IndexesSavePath : Picolena::IndexSavePath Dir.glob(File.join(to_remove,'**/*')).each{|f| FileUtils.rm(f) if File.file?(f)} end |
.close ⇒ Object
Closes the index and ensures that a new Index is instantiated next time index is called.
89 90 91 92 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 89 def close @@index.close rescue nil @@index = nil end |
.ensure_index_existence ⇒ Object
Creates the index unless it already exists.
114 115 116 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 114 def ensure_index_existence index_every_directory(:remove_first) unless index_exists? or RAILS_ENV=="production" end |
.index ⇒ Object
Only one IndexWriter should be instantiated. If one index already exists, returns it. Creates it otherwise.
109 110 111 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 109 def index @@index ||= Ferret::Index::Index.new(default_index_params) end |
.index_directory_with_multithreads(dir) ⇒ Object
Indexes a given directory, using @@threads_number threads. To do so, it retrieves a list of every included document, cuts it in @@threads_number chunks, and create a new indexing thread for every chunk.
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 39 def index_directory_with_multithreads(dir) logger.debug "Indexing #{dir}, #{Picolena::IndexingConfiguration[:threads_number]} threads" indexing_list=Dir[File.join(dir,"**/*")].select{|filename| File.file?(filename) && File.basename(filename) !~ Picolena::ToIgnore } indexing_list_chunks=indexing_list.in_transposed_slices(Picolena::IndexingConfiguration[:threads_number]) prepare_multi_threads_environment indexing_list_chunks.each_with_thread{|chunk| chunk.each{|complete_path| if should_index_this_document?(complete_path) then add_or_update_file(complete_path) else logger.debug "Identical : #{complete_path}" end index_time_dbm_file[complete_path] = Time.now._dump } } end |
.index_every_directory(remove_first = false) ⇒ Object
Finds every document included in IndexedDirectories, parses them with PlainTextExtractor and adds them to the index.
Updates the index unless remove_first parameter is set to true, in which case it removes the index first before re-creating it.
21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 21 def index_every_directory(remove_first=false) clear! if remove_first lock! @from_scratch = remove_first logger.start_indexing Picolena::IndexedDirectories.each{|dir, alias_dir| index_directory_with_multithreads(dir) } logger.debug "Now optimizing index" index.optimize index_time_dbm_file['last']=Time.now._dump unlock! logger.show_report end |
.last_update ⇒ Object
Returns the time at which the index was last created/updated. Returns “none” if it doesn’t exist.
125 126 127 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 125 def last_update Time._load(index_time_dbm_file['last']) rescue "none" end |
.locked? ⇒ Boolean
145 146 147 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 145 def locked? File.exists?(lock_file) end |
.prune_index ⇒ Object
Checks for indexed files that are missing from filesytem and removes them from index & dbm file.
96 97 98 99 100 101 102 103 104 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 96 def prune_index missing_files=index_time_dbm_file.reject{|filename,itime| File.exists?(filename) && Picolena::IndexedDirectories.any?{|dir,alias_path| filename.starts_with?(dir)}} missing_files.each{|filename, itime| index.writer.delete(:complete_path, filename) index_time_dbm_file.delete(filename) logger.debug "Removed : #{filename}" } index.optimize end |
.reload_file_mtime ⇒ Object
Returns the time at which the reload file was last touched. Useful to know if other processes have modified the shared index, and if the Indexer should be reloaded.
132 133 134 135 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 132 def reload_file_mtime touch_reload_file! unless File.exists?(reload_file) File.mtime(reload_file) end |
.should_index_this_document?(complete_path) ⇒ Boolean
For a given document, it retrieves the time it was last indexed, compare it to its modification time and returns false unless the file has been modified after the last indexing process.
140 141 142 143 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 140 def should_index_this_document?(complete_path) last_itime=index_time_dbm_file[complete_path] @from_scratch || !last_itime || File.mtime(complete_path)> Time._load(last_itime) end |
.size ⇒ Object
Returns how many files are indexed.
119 120 121 |
# File 'lib/picolena/templates/app/models/indexer.rb', line 119 def size index.size end |