Class: SClust::Util::DocumentCollection

Inherits:

Object

Object
SClust::Util::DocumentCollection

show all

Defined in:: lib/sclust/util/doccol.rb

Instance Attribute Summary collapse

#doclist ⇒ Object readonly

A list of documents.
#logger ⇒ Object readonly

Log4r::Logger for this document collection.
#terms ⇒ Object readonly

terms - a hash were they keys are the terms in the documents and the values stored are the number of documents contiaining the term.

Instance Method Summary collapse

#<<(d) ⇒ Object

Add a document to the collection and adjust the @terms attribute to store any new terms in the document.
#average_terms_per_document ⇒ Object

The sum of the terms divided by the documents.
#average_words_per_document ⇒ Object

Number of words that make up a document.
#document_count ⇒ Object

Return the size of the document list.
#drop_terms(min_frequency = 0.10, max_frequency = 0.80) ⇒ Object
#each_term(&c) ⇒ Object
#filter_df(min = 1, max = 0.20) ⇒ Object

Filter out documents that are not in the given range of document frequency as expressed as a percentage of the total number of documents in the collection.
#initialize ⇒ DocumentCollection constructor

A new instance of DocumentCollection.
#inverse_document_frequency(term) ⇒ Object (also: #idf)
#term_count ⇒ Object

Return the size of the term vector.
#word_count ⇒ Object

Sum all words.

Constructor Details

#initialize ⇒ `DocumentCollection`

Returns a new instance of DocumentCollection.

# File 'lib/sclust/util/doccol.rb', line 45

def initialize()
    @logger = Log4r::Logger.new(self.class.to_s)
    @logger.add('default')
    @terms   = SClust::Util::SparseVector.new(0)
    @doclist = []
end

Instance Attribute Details

#doclist ⇒ `Object` (readonly)

A list of documents



40
41
42

# File 'lib/sclust/util/doccol.rb', line 40

def doclist
  @doclist
end

#logger ⇒ `Object` (readonly)

Log4r::Logger for this document collection.



43
44
45

# File 'lib/sclust/util/doccol.rb', line 43

def logger
  @logger
end

#terms ⇒ `Object` (readonly)

terms - a hash were they keys are the terms in the documents and the values stored are the number of documents contiaining the term.



37
38
39

# File 'lib/sclust/util/doccol.rb', line 37

def terms
  @terms
end

Instance Method Details

#<<(d) ⇒ `Object`

Add a document to the collection and adjust the @terms attribute to store any new terms in the document. The document is also added to the @doclist attribute.

# File 'lib/sclust/util/doccol.rb', line 54

def <<(d)
    
    seen_terms = {}
    
    d.each_term { |term, frequency| seen_terms[term] = 1 }
    
    if ( seen_terms.size > 0 )
    
        seen_terms.each_key { |term| @terms[term] += 1 }
        
        @doclist<<d
        
        #@logger.info("There are #{@doclist.size} documents and #{@terms.size} terms.")
    end
    
    self
end

#average_terms_per_document ⇒ `Object`

The sum of the terms divided by the documents. If the document only has 1-gram terms, then this number will always be less than the number of words per document. If, however, you enable 2-grams, 3-grams, etc in a document, this value will not corrolate perfectly with the word count.



75
76
77

# File 'lib/sclust/util/doccol.rb', line 75

def average_terms_per_document()
    @terms.reduce(0.0) { |count, keyval_pair| count + keyval_pair[1] } / @doclist.size
end

#average_words_per_document ⇒ `Object`

Number of words that make up a document. Words are no unique like terms are. Two occurences of the word “the” are a single term “the”. Get it? :) Great. One caveate is that a “term” is typically a 1-gram, that is 1 word is 1 term. It is possible for a term to be constructed of two or more words (an 2-gram, 3-gram, … n-gram) in which case this relationship will vary widely.



84
85
86

# File 'lib/sclust/util/doccol.rb', line 84

def average_words_per_document()
    @doclist.reduce(0.0) { |count, doc| count + doc.words.size } / @doclist.size
end

#document_count ⇒ `Object`

Return the size of the document list.



89
90
91

# File 'lib/sclust/util/doccol.rb', line 89

def document_count()
    @doclist.size
end

#drop_terms(min_frequency = 0.10, max_frequency = 0.80) ⇒ `Object`

# File 'lib/sclust/util/doccol.rb', line 104

def drop_terms(min_frequency=0.10, max_frequency=0.80)
    
    min_docs = @doclist.length * min_frequency
    max_docs = @doclist.length * max_frequency
    
    @logger.info("Analyzing #{@terms.length} terms for removal.")
    @logger.info("Upper/lower boundary are #{max_frequency}/#{min_frequency}% document frequency or #{max_docs}/#{min_docs} documents.")
    
    remove_list = []
    
    @terms.each do |term, frequency|
                    
        if ( frequency < min_docs or frequency > max_docs )
            @logger.info("Removing term #{term} occuring in #{frequency} documents out of #{@doclist.length}")
            @terms.delete(term)
            remove_list << term
        end
    end
    
    @logger.info("Removed #{remove_list.length} of #{@terms.length + remove_list.length} terms. Updating #{doclist.length} documents.")
    
    @doclist.each do |doc|
        remove_list.each do |term|
            doc.terms.delete(term)
        end
    end
end

#each_term(&c) ⇒ `Object`



138
139
140

# File 'lib/sclust/util/doccol.rb', line 138

def each_term(&c)
    @terms.each_key { |k| yield k }
end

#filter_df(min = 1, max = 0.20) ⇒ `Object`

Filter out documents that are not in the given range of document frequency as expressed as a percentage of the total number of documents in the collection. If floats are passed, then they are treated as percentages. If integers are passed, they are treated like docuent counts.

# File 'lib/sclust/util/doccol.rb', line 146

def filter_df(min=1, max=0.20)
    
    delete_list = []
    delete_hash = {}
    
    mindocs = ( min.is_a?(Integer) ) ? min : ( min * @doclist.size )
    maxdocs = ( max.is_a?(Integer) ) ? max : ( max * @doclist.size )
    
    @logger.info("Building term to delete list for range #{mindocs} - #{maxdocs}.")
    
    @terms.each { |term, freq| delete_list << term if (freq <= mindocs or freq >= maxdocs ) }
    
    @logger.info("Identified #{delete_list.size} terms for removal.")
    
    # NOTE: We do a two-phase delete so we can delete from backing documents.
    
    delete_list.each do |term|
        @logger.debug { "Removing term #{term}."}
        @terms.delete(term)
        delete_hash[term] = 1
    end
    
    @logger.info("Updating documents.")
    
    i=0
    
    @doclist.each do |doc|
        @logger.debug { "Processing document #{i += 1} / #{@doclist.size}" }
        
        doc.delete_term_if { |term| delete_hash.member?(term) }
    end
    
    @logger.info("Deleting documents that now have no terms left in them. #{@doclist.size} documents.")
    
    @doclist.delete_if { |doc| doc.terms.size == 0 }
    
    @logger.info("Document count now #{@doclist.size} documents.")

end

#inverse_document_frequency(term) ⇒ `Object` Also known as: idf



132
133
134

# File 'lib/sclust/util/doccol.rb', line 132

def inverse_document_frequency(term)
    Math.log( @doclist.length / @terms[term] )
end

#term_count ⇒ `Object`

Return the size of the term vector



99
100
101

# File 'lib/sclust/util/doccol.rb', line 99

def term_count()
    @terms.size
end

#word_count ⇒ `Object`

Sum all words



94
95
96

# File 'lib/sclust/util/doccol.rb', line 94

def word_count()
    @doclist.reduce(0) { |count, doc| count+doc.words.size }
end

Class: SClust::Util::DocumentCollection

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize ⇒ DocumentCollection

Instance Attribute Details

#doclist ⇒ Object (readonly)

#logger ⇒ Object (readonly)

#terms ⇒ Object (readonly)

Instance Method Details

#<<(d) ⇒ Object

#average_terms_per_document ⇒ Object

#average_words_per_document ⇒ Object

#document_count ⇒ Object

#drop_terms(min_frequency = 0.10, max_frequency = 0.80) ⇒ Object

#each_term(&c) ⇒ Object

#filter_df(min = 1, max = 0.20) ⇒ Object

#inverse_document_frequency(term) ⇒ Object Also known as: idf

#term_count ⇒ Object

#word_count ⇒ Object