Class: TfIdfSimilarity::Document
- Inherits:
-
Object
- Object
- TfIdfSimilarity::Document
- Defined in:
- lib/tf-idf-similarity/document.rb,
lib/tf-idf-similarity/extras/document.rb
Instance Attribute Summary collapse
-
#id ⇒ Object
readonly
The document's identifier.
-
#size ⇒ Object
readonly
The number of tokens in the document.
-
#term_counts ⇒ Object
readonly
The number of times each term appears in the document.
-
#text ⇒ Object
readonly
The document's text.
Instance Method Summary collapse
-
#average_term_count ⇒ Float
The average term count of all terms in the document.
-
#initialize(text, opts = {}) ⇒ Document
constructor
A new instance of Document.
-
#maximum_term_count ⇒ Float
The maximum term count of any term in the document.
-
#term_count(term) ⇒ Integer
Returns the number of occurrences of the term in the document.
-
#terms ⇒ Array<String>
Returns the set of terms in the document.
Constructor Details
#initialize(text, opts = {}) ⇒ Document
Returns a new instance of Document.
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# File 'lib/tf-idf-similarity/document.rb', line 21 def initialize(text, opts = {}) @text = text @id = opts[:id] || object_id @tokens = Array(opts[:tokens]).map { |t| Token.new(t) } if opts[:tokens] @tokenizer = opts[:tokenizer] || Tokenizer.new if opts[:term_counts] @term_counts = opts[:term_counts] @size = opts[:size] || term_counts.values.reduce(0, :+) # Nothing to do. else @term_counts = Hash.new(0) @size = 0 set_term_counts_and_size end end |
Instance Attribute Details
#id ⇒ Object (readonly)
The document's identifier.
7 8 9 |
# File 'lib/tf-idf-similarity/document.rb', line 7 def id @id end |
#size ⇒ Object (readonly)
The number of tokens in the document.
13 14 15 |
# File 'lib/tf-idf-similarity/document.rb', line 13 def size @size end |
#term_counts ⇒ Object (readonly)
The number of times each term appears in the document.
11 12 13 |
# File 'lib/tf-idf-similarity/document.rb', line 11 def term_counts @term_counts end |
#text ⇒ Object (readonly)
The document's text.
9 10 11 |
# File 'lib/tf-idf-similarity/document.rb', line 9 def text @text end |
Instance Method Details
#average_term_count ⇒ Float
Returns the average term count of all terms in the document.
9 10 11 |
# File 'lib/tf-idf-similarity/extras/document.rb', line 9 def average_term_count @average_term_count ||= term_counts.values.reduce(0, :+) / term_counts.size.to_f end |
#maximum_term_count ⇒ Float
Returns the maximum term count of any term in the document.
4 5 6 |
# File 'lib/tf-idf-similarity/extras/document.rb', line 4 def maximum_term_count @maximum_term_count ||= term_counts.values.max.to_f end |
#term_count(term) ⇒ Integer
Returns the number of occurrences of the term in the document.
49 50 51 |
# File 'lib/tf-idf-similarity/document.rb', line 49 def term_count(term) term_counts[term].to_i # need #to_i if unmarshalled end |
#terms ⇒ Array<String>
Returns the set of terms in the document.
41 42 43 |
# File 'lib/tf-idf-similarity/document.rb', line 41 def terms term_counts.keys end |