Class: Indexer

Inherits:
Object show all
Defined in:
lib/picolena/templates/app/models/indexer.rb

Class Method Summary collapse

Class Method Details

.add_or_update_file(complete_path) ⇒ Object

Retrieves content and language from a given document, and adds it to the index. Since Document#probably_unique_id is used as index :key, no document will be added twice to the index, and the old document will just get updated.

If for some reason (no content found or no defined PlainTextExtractor), content cannot be found, some basic information about the document (mtime, filename, complete_path) gets indexed anyway.



67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/picolena/templates/app/models/indexer.rb', line 67

def add_or_update_file(complete_path)
  document = Document.default_fields_for(complete_path)
  begin
    PlainTextExtractor.extract_thumbnail_from(complete_path)
    document.merge! PlainTextExtractor.extract_information_from(complete_path)
    raise "empty document #{complete_path}" if document[:content].strip.empty?
    logger.add_document document
  rescue => e
    logger.reject_document document, e
  end
  index << document
end

.clear!(all = false) ⇒ Object

Ensures index is closed, and removes every index file for RAILS_ENV.



81
82
83
84
85
# File 'lib/picolena/templates/app/models/indexer.rb', line 81

def clear!(all=false)
  close
  to_remove=all ? Picolena::IndexesSavePath : Picolena::IndexSavePath
  Dir.glob(File.join(to_remove,'**/*')).each{|f| FileUtils.rm(f) if File.file?(f)}
end

.closeObject

Closes the index and ensures that a new Index is instantiated next time index is called.



89
90
91
92
# File 'lib/picolena/templates/app/models/indexer.rb', line 89

def close
  @@index.close rescue nil
  @@index = nil
end

.ensure_index_existenceObject

Creates the index unless it already exists.



114
115
116
# File 'lib/picolena/templates/app/models/indexer.rb', line 114

def ensure_index_existence
  index_every_directory(:remove_first) unless index_exists? or RAILS_ENV=="production"
end

.indexObject

Only one IndexWriter should be instantiated. If one index already exists, returns it. Creates it otherwise.



109
110
111
# File 'lib/picolena/templates/app/models/indexer.rb', line 109

def index
  @@index ||= Ferret::Index::Index.new(default_index_params)
end

.index_directory_with_multithreads(dir) ⇒ Object

Indexes a given directory, using @@threads_number threads. To do so, it retrieves a list of every included document, cuts it in @@threads_number chunks, and create a new indexing thread for every chunk.



39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# File 'lib/picolena/templates/app/models/indexer.rb', line 39

def index_directory_with_multithreads(dir)
  logger.debug "Indexing #{dir}, #{Picolena::IndexingConfiguration[:threads_number]} threads"
  indexing_list=Dir[File.join(dir,"**/*")].select{|filename|
    File.file?(filename) && File.basename(filename) !~ Picolena::ToIgnore
  }

  indexing_list_chunks=indexing_list.in_transposed_slices(Picolena::IndexingConfiguration[:threads_number])
  prepare_multi_threads_environment

  indexing_list_chunks.each_with_thread{|chunk|
    chunk.each{|complete_path|
      if should_index_this_document?(complete_path) then
        add_or_update_file(complete_path)
      else
        logger.debug "Identical : #{complete_path}"
      end
      index_time_dbm_file[complete_path] = Time.now._dump
    }
  }
end

.index_every_directory(remove_first = false) ⇒ Object

Finds every document included in IndexedDirectories, parses them with PlainTextExtractor and adds them to the index.

Updates the index unless remove_first parameter is set to true, in which case it removes the index first before re-creating it.



21
22
23
24
25
26
27
28
29
30
31
32
33
34
# File 'lib/picolena/templates/app/models/indexer.rb', line 21

def index_every_directory(remove_first=false)
  clear! if remove_first
  lock!
  @from_scratch = remove_first
  logger.start_indexing
  Picolena::IndexedDirectories.each{|dir, alias_dir|
    index_directory_with_multithreads(dir)
  }
  logger.debug "Now optimizing index"
  index.optimize
  index_time_dbm_file['last']=Time.now._dump
  unlock!
  logger.show_report
end

.last_updateObject

Returns the time at which the index was last created/updated. Returns “none” if it doesn’t exist.



125
126
127
# File 'lib/picolena/templates/app/models/indexer.rb', line 125

def last_update
  Time._load(index_time_dbm_file['last']) rescue "none"
end

.locked?Boolean

Returns:

  • (Boolean)


145
146
147
# File 'lib/picolena/templates/app/models/indexer.rb', line 145

def locked?
  File.exists?(lock_file)
end

.prune_indexObject

Checks for indexed files that are missing from filesytem and removes them from index & dbm file.



96
97
98
99
100
101
102
103
104
# File 'lib/picolena/templates/app/models/indexer.rb', line 96

def prune_index
  missing_files=index_time_dbm_file.reject{|filename,itime| File.exists?(filename) && Picolena::IndexedDirectories.any?{|dir,alias_path| filename.starts_with?(dir)}}
  missing_files.each{|filename, itime|
    index.writer.delete(:complete_path, filename)
    index_time_dbm_file.delete(filename)
    logger.debug "Removed : #{filename}"
  }
  index.optimize
end

.reload_file_mtimeObject

Returns the time at which the reload file was last touched. Useful to know if other processes have modified the shared index, and if the Indexer should be reloaded.



132
133
134
135
# File 'lib/picolena/templates/app/models/indexer.rb', line 132

def reload_file_mtime
  touch_reload_file! unless File.exists?(reload_file)
  File.mtime(reload_file)
end

.should_index_this_document?(complete_path) ⇒ Boolean

For a given document, it retrieves the time it was last indexed, compare it to its modification time and returns false unless the file has been modified after the last indexing process.

Returns:

  • (Boolean)


140
141
142
143
# File 'lib/picolena/templates/app/models/indexer.rb', line 140

def should_index_this_document?(complete_path)
  last_itime=index_time_dbm_file[complete_path]
  @from_scratch || !last_itime || File.mtime(complete_path)> Time._load(last_itime) 
end

.sizeObject

Returns how many files are indexed.



119
120
121
# File 'lib/picolena/templates/app/models/indexer.rb', line 119

def size
  index.size
end