Class: Document

Inherits:

Object

Object
Document

show all

Defined in:: lib/picolena/templates/app/models/document.rb,
lib/picolena/templates/spec/spec_helper.rb

Overview

Document class retrieves information from filesystem and the index for any given document.

Instance Attribute Summary collapse

#complete_path ⇒ Object (also: #to_s) readonly

Returns the value of attribute complete_path.
#matching_content ⇒ Object

Returns the value of attribute matching_content.
#score ⇒ Object

Returns the value of attribute score.

Class Method Summary collapse

.default_fields_for(complete_path) ⇒ Object

Indexing fields that are shared between every document.
.find_by_extension(ext) ⇒ Object

Instance Method Summary collapse

#alias_path ⇒ Object

End users should not always know where documents are stored internally.
#basename ⇒ Object

Returns filename without extension “buildings.odt” => “buildings”.
#cached ⇒ Object

Cache à la Google.
#content ⇒ Object

Retrieves content as it is now.
#extractor ⇒ Object
#filename ⇒ Object
#has_content? ⇒ Boolean

Did at least one letter got extracted from the document? This boolean is used in views to know if a link should be displayed to show the content.
#highlighted_cache(raw_query) ⇒ Object

Returns cached content with matching terms between ‘<<’ ‘>>’.
#icon_path ⇒ Object

Returns thumbnail if available, mime icon otherwise.
#initialize(path) ⇒ Document constructor

Instantiates a new Document, and ensure that the given path exists and is included in an indexed directory.
#inspect ⇒ Object

Returns complete path as well as matching score and language if available.
#language ⇒ Object

Returns found language, if any.
#mime ⇒ Object
#mtime ⇒ Object

Returns the last modification time before the document got indexed, as YYYYMMDDHHMMSS integer.
#pretty_date ⇒ Object

Returns the last modification date before the document got indexed.
#pretty_mtime ⇒ Object

Returns the last modification time before the document got indexed.
#pretty_score ⇒ Object

Returns matching score as a percentage, e.g.
#probably_unique_id ⇒ Object

Returns an id for this document.
#supported? ⇒ Boolean

Returns true iff some PlainTextExtractor has been defined to convert it to plain text.

Constructor Details

#initialize(path) ⇒ `Document`

Instantiates a new Document, and ensure that the given path exists and is included in an indexed directory. Raises otherwise.

# File 'lib/picolena/templates/app/models/document.rb', line 9

def initialize(path)
  # To ensure @complete_path is an absolute direction.
  @complete_path=File.expand_path(path)
  validate_existence_of_file
  validate_in_indexed_directory
end

Instance Attribute Details

#complete_path ⇒ `Object` (readonly) Also known as: to_s

Returns the value of attribute complete_path.



3
4
5

# File 'lib/picolena/templates/app/models/document.rb', line 3

def complete_path
  @complete_path
end

#matching_content ⇒ `Object`

Returns the value of attribute matching_content.



4
5
6

# File 'lib/picolena/templates/app/models/document.rb', line 4

def matching_content
  @matching_content
end

#score ⇒ `Object`

Returns the value of attribute score.



4
5
6

# File 'lib/picolena/templates/app/models/document.rb', line 4

def score
  @score
end

Class Method Details

.default_fields_for(complete_path) ⇒ `Object`

Indexing fields that are shared between every document.

# File 'lib/picolena/templates/app/models/document.rb', line 130

def self.default_fields_for(complete_path)
  doc=Document.new(complete_path)
  {
    :complete_path      => complete_path,
    :probably_unique_id => complete_path.base26_hash,
    :alias_path         => doc.alias_path,
    :filename           => File.basename(complete_path),
    :basename           => File.basename(complete_path, File.extname(complete_path)).gsub(/_/,' '),
    :filetype           => File.extname(complete_path),
    :modified           => File.mtime(complete_path).strftime("%Y%m%d%H%M%S")
  }
end

.find_by_extension(ext) ⇒ `Object`



16
17
18

# File 'lib/picolena/templates/spec/spec_helper.rb', line 16

def self.find_by_extension(ext)
  Finder.new("ext:#{ext}").matching_documents.first
end

Instance Method Details

#alias_path ⇒ `Object`

End users should not always know where documents are stored internally. An alias path can be specified in config/indexed_directories.yml

For example, with:

"/media/wiki_dump/" : "http://www.mycompany.com/wiki/"

The documents

"/media/wiki_dump/organigram.odp"

will be displayed as being:

"http://www.mycompany.com/wiki/organigram.odp"

# File 'lib/picolena/templates/app/models/document.rb', line 48

def alias_path
  original_dir=indexed_directory
  alias_dir=Picolena::IndexedDirectories[original_dir]
  dirname.sub(original_dir,alias_dir)
end

#basename ⇒ `Object`

Returns filename without extension

"buildings.odt" => "buildings"



34
35
36

# File 'lib/picolena/templates/app/models/document.rb', line 34

def basename
  filename.chomp(extname)
end

#cached ⇒ `Object`

Cache à la Google. Returns content as it was at the time it was indexed.



84
85
86

# File 'lib/picolena/templates/app/models/document.rb', line 84

def cached
  from_index[:content]
end

#content ⇒ `Object`

Retrieves content as it is now.



78
79
80

# File 'lib/picolena/templates/app/models/document.rb', line 78

def content
  PlainTextExtractor.extract_content_from(complete_path)
end

#extractor ⇒ `Object`



69
70
71

# File 'lib/picolena/templates/app/models/document.rb', line 69

def extractor
  PlainTextExtractor.find_by_extension(self.ext_as_sym) rescue nil
end

#filename ⇒ `Object`

20	# File 'lib/picolena/templates/app/models/document.rb', line 20 alias_method :filename, :basename

#has_content? ⇒ `Boolean`

Did at least one letter got extracted from the document? This boolean is used in views to know if a link should be displayed to show the content

Returns:

(Boolean)



156
157
158

# File 'lib/picolena/templates/app/models/document.rb', line 156

def has_content?
  cached =~ /\w/
end

#highlighted_cache(raw_query) ⇒ `Object`

Returns cached content with matching terms between ‘<<’ ‘>>’.

# File 'lib/picolena/templates/app/models/document.rb', line 89

def highlighted_cache(raw_query)
  excerpts=Indexer.index.highlight(Query.extract_from(raw_query), doc_id,
                          :field => :content, :excerpt_length => :all,
                          :pre_tag => "<<", :post_tag => ">>"
           )
  excerpts.is_an?(Array) ? excerpts.first : ""
end

#icon_path ⇒ `Object`

Returns thumbnail if available, mime icon otherwise

# File 'lib/picolena/templates/app/models/document.rb', line 144

def icon_path
  if File.exists?(thumbnail_path) then
    thumbnail_path(:public_dir)
  else
    icon_symbol=Picolena::FiletypeToIconSymbol[ext_as_sym]
    "icons/#{icon_symbol}.png" if icon_symbol
  end
end

#inspect ⇒ `Object`

Returns complete path as well as matching score and language if available.

../spec/test_dirs/indexed/just_one_doc/for_test.txt (56.3%) (language:en)

Used for example by

rake index:search query="some query"



28
29
30

# File 'lib/picolena/templates/app/models/document.rb', line 28

def inspect
  [self,("(#{pretty_score})" if @score),("(language:#{language})" if language)].compact.join(" ")
end

#language ⇒ `Object`

Returns found language, if any.



120
121
122

# File 'lib/picolena/templates/app/models/document.rb', line 120

def language
  from_index[:language]
end

#mime ⇒ `Object`



73
74
75

# File 'lib/picolena/templates/app/models/document.rb', line 73

def mime
  extractor.mime_name rescue 'application/octet-stream'
end

#mtime ⇒ `Object`

Returns the last modification time before the document got indexed, as YYYYMMDDHHMMSS integer.

>> doc.mtime
=> 20080509093951



115
116
117

# File 'lib/picolena/templates/app/models/document.rb', line 115

def mtime
  from_index[:modified].to_i
end

#pretty_date ⇒ `Object`

Returns the last modification date before the document got indexed. Useful to know how old a document is, and to which version the cache corresponds.

>> doc.pretty_date
=> "2008-05-09"



101
102
103

# File 'lib/picolena/templates/app/models/document.rb', line 101

def pretty_date
  from_index[:modified].sub(/(\d{4})(\d{2})(\d{2})\d{6}/,'\1-\2-\3')
end

#pretty_mtime ⇒ `Object`

Returns the last modification time before the document got indexed.

>> doc.pretty_mtime
=> "2008-05-09 09:39:51"



108
109
110

# File 'lib/picolena/templates/app/models/document.rb', line 108

def pretty_mtime
  from_index[:modified].sub(/(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})/,'\1-\2-\3 \4:\5:\6')
end

#pretty_score ⇒ `Object`

Returns matching score as a percentage, e.g. 56.3%



125
126
127

# File 'lib/picolena/templates/app/models/document.rb', line 125

def pretty_score
  "%3.1f%" % (@score*100)
end

#probably_unique_id ⇒ `Object`

Returns an id for this document. This id will be used in Controllers in order to get tiny urls. Since it’s a base26 hash of the absolute filename, it can only be “probably unique”. For huge amount of indexed documents, it would be wise to increase HashLength in config/custom/picolena.rb



58
59
60

# File 'lib/picolena/templates/app/models/document.rb', line 58

def probably_unique_id
  @probably_unique_id||=complete_path.base26_hash
end

#supported? ⇒ `Boolean`

Returns true iff some PlainTextExtractor has been defined to convert it to plain text.

Document.new("presentation.pdf").supported? => true
Document.new("presentation.some_weird_extension").supported? => false

Returns:

(Boolean)



65
66
67

# File 'lib/picolena/templates/app/models/document.rb', line 65

def supported?
  PlainTextExtractor.supported_extensions.include?(self.ext_as_sym) unless ext_as_sym==:no_extension and !plain_text?
end

Class: Document

Overview

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(path) ⇒ Document

Instance Attribute Details

#complete_path ⇒ Object (readonly) Also known as: to_s

#matching_content ⇒ Object

#score ⇒ Object

Class Method Details

.default_fields_for(complete_path) ⇒ Object

.find_by_extension(ext) ⇒ Object

Instance Method Details

#alias_path ⇒ Object

#basename ⇒ Object

#cached ⇒ Object

#content ⇒ Object

#extractor ⇒ Object

#filename ⇒ Object

#has_content? ⇒ Boolean

#highlighted_cache(raw_query) ⇒ Object

#icon_path ⇒ Object

#inspect ⇒ Object

#language ⇒ Object

#mime ⇒ Object

#mtime ⇒ Object

#pretty_date ⇒ Object

#pretty_mtime ⇒ Object

#pretty_score ⇒ Object

#probably_unique_id ⇒ Object

#supported? ⇒ Boolean