Class: Indexer

Inherits:

Object

Object
Indexer

show all

Defined in:: lib/picolena/templates/app/models/indexer.rb

Class Method Summary collapse

.add_or_update_file(complete_path) ⇒ Object

Retrieves content and language from a given document, and adds it to the index.
.clear!(all = false) ⇒ Object

Ensures index is closed, and removes every index file for RAILS_ENV.
.close ⇒ Object

Closes the index and ensures that a new Index is instantiated next time index is called.
.ensure_index_existence ⇒ Object

Creates the index unless it already exists.
.index ⇒ Object

Only one IndexWriter should be instantiated.
.index_directory_with_multithreads(dir) ⇒ Object

Indexes a given directory, using @@threads_number threads.
.index_every_directory(remove_first = false) ⇒ Object

Finds every document included in IndexedDirectories, parses them with PlainTextExtractor and adds them to the index.
.last_update ⇒ Object

Returns the time at which the index was last created/updated.
.locked? ⇒ Boolean
.prune_index ⇒ Object

Checks for indexed files that are missing from filesytem and removes them from index & dbm file.
.reload_file_mtime ⇒ Object

Returns the time at which the reload file was last touched.
.should_index_this_document?(complete_path) ⇒ Boolean

For a given document, it retrieves the time it was last indexed, compare it to its modification time and returns false unless the file has been modified after the last indexing process.
.size ⇒ Object

Returns how many files are indexed.

Class Method Details

.add_or_update_file(complete_path) ⇒ `Object`

Retrieves content and language from a given document, and adds it to the index. Since Document#probably_unique_id is used as index :key, no document will be added twice to the index, and the old document will just get updated.

If for some reason (no content found or no defined PlainTextExtractor), content cannot be found, some basic information about the document (mtime, filename, complete_path) gets indexed anyway.

# File 'lib/picolena/templates/app/models/indexer.rb', line 67

def add_or_update_file(complete_path)
  document = Document.default_fields_for(complete_path)
  begin
    PlainTextExtractor.extract_thumbnail_from(complete_path)
    document.merge! PlainTextExtractor.extract_information_from(complete_path)
    raise "empty document #{complete_path}" if document[:content].strip.empty?
    logger.add_document document
  rescue => e
    logger.reject_document document, e
  end
  index << document
end

.clear!(all = false) ⇒ `Object`

Ensures index is closed, and removes every index file for RAILS_ENV.

# File 'lib/picolena/templates/app/models/indexer.rb', line 81

def clear!(all=false)
  close
  to_remove=all ? Picolena::IndexesSavePath : Picolena::IndexSavePath
  Dir.glob(File.join(to_remove,'**/*')).each{|f| FileUtils.rm(f) if File.file?(f)}
end

.close ⇒ `Object`

Closes the index and ensures that a new Index is instantiated next time index is called.

# File 'lib/picolena/templates/app/models/indexer.rb', line 89

def close
  @@index.close rescue nil
  @@index = nil
end

.ensure_index_existence ⇒ `Object`

Creates the index unless it already exists.



114
115
116

# File 'lib/picolena/templates/app/models/indexer.rb', line 114

def ensure_index_existence
  index_every_directory(:remove_first) unless index_exists? or RAILS_ENV=="production"
end

.index ⇒ `Object`

Only one IndexWriter should be instantiated. If one index already exists, returns it. Creates it otherwise.



109
110
111

# File 'lib/picolena/templates/app/models/indexer.rb', line 109

def index
  @@index ||= Ferret::Index::Index.new(default_index_params)
end

.index_directory_with_multithreads(dir) ⇒ `Object`

Indexes a given directory, using @@threads_number threads. To do so, it retrieves a list of every included document, cuts it in @@threads_number chunks, and create a new indexing thread for every chunk.

# File 'lib/picolena/templates/app/models/indexer.rb', line 39

def index_directory_with_multithreads(dir)
  logger.debug "Indexing #{dir}, #{Picolena::IndexingConfiguration[:threads_number]} threads"
  indexing_list=Dir[File.join(dir,"**/*")].select{|filename|
    File.file?(filename) && File.basename(filename) !~ Picolena::ToIgnore
  }

  indexing_list_chunks=indexing_list.in_transposed_slices(Picolena::IndexingConfiguration[:threads_number])
  prepare_multi_threads_environment

  indexing_list_chunks.each_with_thread{|chunk|
    chunk.each{|complete_path|
      if should_index_this_document?(complete_path) then
        add_or_update_file(complete_path)
      else
        logger.debug "Identical : #{complete_path}"
      end
      index_time_dbm_file[complete_path] = Time.now._dump
    }
  }
end

.index_every_directory(remove_first = false) ⇒ `Object`

Finds every document included in IndexedDirectories, parses them with PlainTextExtractor and adds them to the index.

Updates the index unless remove_first parameter is set to true, in which case it removes the index first before re-creating it.

# File 'lib/picolena/templates/app/models/indexer.rb', line 21

def index_every_directory(remove_first=false)
  clear! if remove_first
  lock!
  @from_scratch = remove_first
  logger.start_indexing
  Picolena::IndexedDirectories.each{|dir, alias_dir|
    index_directory_with_multithreads(dir)
  }
  logger.debug "Now optimizing index"
  index.optimize
  index_time_dbm_file['last']=Time.now._dump
  unlock!
  logger.show_report
end

.last_update ⇒ `Object`

Returns the time at which the index was last created/updated. Returns “none” if it doesn’t exist.



125
126
127

# File 'lib/picolena/templates/app/models/indexer.rb', line 125

def last_update
  Time._load(index_time_dbm_file['last']) rescue "none"
end

.locked? ⇒ `Boolean`

Returns:

(Boolean)



145
146
147

# File 'lib/picolena/templates/app/models/indexer.rb', line 145

def locked?
  File.exists?(lock_file)
end

.prune_index ⇒ `Object`

Checks for indexed files that are missing from filesytem and removes them from index & dbm file.

# File 'lib/picolena/templates/app/models/indexer.rb', line 96

def prune_index
  missing_files=index_time_dbm_file.reject{|filename,itime| File.exists?(filename) && Picolena::IndexedDirectories.any?{|dir,alias_path| filename.starts_with?(dir)}}
  missing_files.each{|filename, itime|
    index.writer.delete(:complete_path, filename)
    index_time_dbm_file.delete(filename)
    logger.debug "Removed : #{filename}"
  }
  index.optimize
end

.reload_file_mtime ⇒ `Object`

Returns the time at which the reload file was last touched. Useful to know if other processes have modified the shared index, and if the Indexer should be reloaded.

# File 'lib/picolena/templates/app/models/indexer.rb', line 132

def reload_file_mtime
  touch_reload_file! unless File.exists?(reload_file)
  File.mtime(reload_file)
end

.should_index_this_document?(complete_path) ⇒ `Boolean`

For a given document, it retrieves the time it was last indexed, compare it to its modification time and returns false unless the file has been modified after the last indexing process.

Returns:

(Boolean)

# File 'lib/picolena/templates/app/models/indexer.rb', line 140

def should_index_this_document?(complete_path)
  last_itime=index_time_dbm_file[complete_path]
  @from_scratch || !last_itime || File.mtime(complete_path)> Time._load(last_itime) 
end

.size ⇒ `Object`

Returns how many files are indexed.



119
120
121

# File 'lib/picolena/templates/app/models/indexer.rb', line 119

def size
  index.size
end

Class: Indexer

Class Method Summary collapse

Class Method Details

.add_or_update_file(complete_path) ⇒ Object

.clear!(all = false) ⇒ Object

.close ⇒ Object

.ensure_index_existence ⇒ Object

.index ⇒ Object

.index_directory_with_multithreads(dir) ⇒ Object

.index_every_directory(remove_first = false) ⇒ Object

.last_update ⇒ Object

.locked? ⇒ Boolean

.prune_index ⇒ Object

.reload_file_mtime ⇒ Object

.should_index_this_document?(complete_path) ⇒ Boolean

.size ⇒ Object