Class: Glove::Corpus

Inherits:
Object
  • Object
show all
Defined in:
lib/glove/corpus.rb

Overview

Class responsible for building the token count, token index and token pairs hashes from a given text

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(text, options = {}) ⇒ Corpus

Create a new Glove::Corpus instance

Parameters:

  • options (Hash) (defaults to: {})

    the options to initialize the instance with.

Options Hash (options):

  • :window (Integer) — default: 2

    Number of context words to the left and to the right

  • :min_count (Integer) — default: 5

    Lower limit such that words which occur fewer than :min_count times are discarded.



23
24
25
26
27
# File 'lib/glove/corpus.rb', line 23

def initialize(text, options={})
  @tokens = Parser.new(text, options).tokenize
  @window = options[:window] || 2
  @min_count = options[:min_count] || 5
end

Instance Attribute Details

#min_countObject (readonly)

Returns the value of attribute min_count.



8
9
10
# File 'lib/glove/corpus.rb', line 8

def min_count
  @min_count
end

#tokensFixnum (readonly)

Returns the parsed tokens array. Holds all the tokens in the exact order they appear in the text

Returns:

  • (Fixnum)

    Returns the parsed tokens array. Holds all the tokens in the exact order they appear in the text



8
9
10
# File 'lib/glove/corpus.rb', line 8

def tokens
  @tokens
end

#windowObject (readonly)

Returns the value of attribute window.



8
9
10
# File 'lib/glove/corpus.rb', line 8

def window
  @window
end

Class Method Details

.build(text, options = {}) ⇒ Object

Convenience method for creating an instance and building the token count, index and pairs (see #initialize)



12
13
14
# File 'lib/glove/corpus.rb', line 12

def self.build(text, options={})
  new(text, options).build_tokens
end

Instance Method Details

#build_tokensGlove::Corpus

Builds the token count, token index and token pairs

Returns:



32
33
34
35
36
37
# File 'lib/glove/corpus.rb', line 32

def build_tokens
  build_count
  build_index
  build_pairs
  self
end

#countHash{String=>Integer} Also known as: build_count

Hash that stores the occurence count of unique tokens

Returns:

  • (Hash{String=>Integer})

    Token-Count pairs where count is total occurences of token in the (non-unique) tokens hash



43
44
45
46
47
48
# File 'lib/glove/corpus.rb', line 43

def count
  @count ||= tokens.inject(Hash.new(0)) do |hash,item|
    hash[item] += 1
    hash
  end.to_h.keep_if{ |word,count| count >= min_count }
end

#indexHash{String=>Integer} Also known as: build_index

A hash whose values hold the senquantial index of a word as it appears in the #count hash

Returns:

  • (Hash{String=>Integer})

    Token-Index pairs where index is the sequential index of the token in the unique vocabulary pool



56
57
58
59
60
61
# File 'lib/glove/corpus.rb', line 56

def index
  @index ||= @count.keys.each_with_index.inject({}) do |hash,(word,idx)|
    hash[word] = idx
    hash
  end
end

#marshal_dumpObject

Data to dump with Marshal.dump



94
95
96
# File 'lib/glove/corpus.rb', line 94

def marshal_dump
  [@tokens, @count, @index, @pairs]
end

#marshal_load(contents) ⇒ Object

Reconstruct the instance data via Marshal.load



99
100
101
# File 'lib/glove/corpus.rb', line 99

def marshal_load(contents)
  @tokens, @count, @index, @pairs = contents
end

#pairsArray<(Glove::TokenPair)> Also known as: build_pairs

Iterates over the tokens array and contructs TokenPairs where neighbors holds the adjacent (context) words. The number of neighbours is controlled by the :window option (on each side)

Returns:



69
70
71
72
73
74
75
# File 'lib/glove/corpus.rb', line 69

def pairs
  @pairs ||= tokens.map.with_index do |word, index|
    next unless count[word] >= min_count

    TokenPair.new(word, token_neighbors(word, index))
  end.compact
end

#token_neighbors(word, index) ⇒ Array<(String)>

Construct array of neighbours to the given word and its index in the tokens array

Parameters:

  • word (String)

    The word to get neighbours for

  • index (Integer)

    Index of the word in the @tokens array

Returns:

  • (Array<(String)>)

    List of the nighbours



84
85
86
87
88
89
90
91
# File 'lib/glove/corpus.rb', line 84

def token_neighbors(word, index)
  start_pos = index - window < 0 ? 0 : index - window
  end_pos   = (index + window >= tokens.size) ? tokens.size - 1 : index + window

  tokens[start_pos..end_pos].map do |neighbor|
    neighbor unless word == neighbor
  end.compact
end