Class: Glove::Corpus
- Inherits:
-
Object
- Object
- Glove::Corpus
- Defined in:
- lib/glove/corpus.rb
Overview
Class responsible for building the token count, token index and token pairs hashes from a given text
Instance Attribute Summary collapse
-
#min_count ⇒ Object
readonly
Returns the value of attribute min_count.
-
#tokens ⇒ Fixnum
readonly
Returns the parsed tokens array.
-
#window ⇒ Object
readonly
Returns the value of attribute window.
Class Method Summary collapse
-
.build(text, options = {}) ⇒ Object
Convenience method for creating an instance and building the token count, index and pairs (see #initialize).
Instance Method Summary collapse
-
#build_tokens ⇒ Glove::Corpus
Builds the token count, token index and token pairs.
-
#count ⇒ Hash{String=>Integer}
(also: #build_count)
Hash that stores the occurence count of unique tokens.
-
#index ⇒ Hash{String=>Integer}
(also: #build_index)
A hash whose values hold the senquantial index of a word as it appears in the #count hash.
-
#initialize(text, options = {}) ⇒ Corpus
constructor
Create a new Corpus instance.
-
#marshal_dump ⇒ Object
Data to dump with Marshal.dump.
-
#marshal_load(contents) ⇒ Object
Reconstruct the instance data via Marshal.load.
-
#pairs ⇒ Array<(Glove::TokenPair)>
(also: #build_pairs)
Iterates over the tokens array and contructs TokenPairs where neighbors holds the adjacent (context) words.
-
#token_neighbors(word, index) ⇒ Array<(String)>
Construct array of neighbours to the given word and its index in the tokens array.
Constructor Details
#initialize(text, options = {}) ⇒ Corpus
Create a new Glove::Corpus instance
23 24 25 26 27 |
# File 'lib/glove/corpus.rb', line 23 def initialize(text, ={}) @tokens = Parser.new(text, ).tokenize @window = [:window] || 2 @min_count = [:min_count] || 5 end |
Instance Attribute Details
#min_count ⇒ Object (readonly)
Returns the value of attribute min_count.
8 9 10 |
# File 'lib/glove/corpus.rb', line 8 def min_count @min_count end |
#tokens ⇒ Fixnum (readonly)
Returns the parsed tokens array. Holds all the tokens in the exact order they appear in the text
8 9 10 |
# File 'lib/glove/corpus.rb', line 8 def tokens @tokens end |
#window ⇒ Object (readonly)
Returns the value of attribute window.
8 9 10 |
# File 'lib/glove/corpus.rb', line 8 def window @window end |
Class Method Details
.build(text, options = {}) ⇒ Object
Convenience method for creating an instance and building the token count, index and pairs (see #initialize)
12 13 14 |
# File 'lib/glove/corpus.rb', line 12 def self.build(text, ={}) new(text, ).build_tokens end |
Instance Method Details
#build_tokens ⇒ Glove::Corpus
Builds the token count, token index and token pairs
32 33 34 35 36 37 |
# File 'lib/glove/corpus.rb', line 32 def build_tokens build_count build_index build_pairs self end |
#count ⇒ Hash{String=>Integer} Also known as: build_count
Hash that stores the occurence count of unique tokens
43 44 45 46 47 48 |
# File 'lib/glove/corpus.rb', line 43 def count @count ||= tokens.inject(Hash.new(0)) do |hash,item| hash[item] += 1 hash end.to_h.keep_if{ |word,count| count >= min_count } end |
#index ⇒ Hash{String=>Integer} Also known as: build_index
A hash whose values hold the senquantial index of a word as it appears in the #count hash
56 57 58 59 60 61 |
# File 'lib/glove/corpus.rb', line 56 def index @index ||= @count.keys.each_with_index.inject({}) do |hash,(word,idx)| hash[word] = idx hash end end |
#marshal_dump ⇒ Object
Data to dump with Marshal.dump
94 95 96 |
# File 'lib/glove/corpus.rb', line 94 def marshal_dump [@tokens, @count, @index, @pairs] end |
#marshal_load(contents) ⇒ Object
Reconstruct the instance data via Marshal.load
99 100 101 |
# File 'lib/glove/corpus.rb', line 99 def marshal_load(contents) @tokens, @count, @index, @pairs = contents end |
#pairs ⇒ Array<(Glove::TokenPair)> Also known as: build_pairs
Iterates over the tokens array and contructs TokenPairs where neighbors holds the adjacent (context) words. The number of neighbours is controlled by the :window option (on each side)
69 70 71 72 73 74 75 |
# File 'lib/glove/corpus.rb', line 69 def pairs @pairs ||= tokens.map.with_index do |word, index| next unless count[word] >= min_count TokenPair.new(word, token_neighbors(word, index)) end.compact end |
#token_neighbors(word, index) ⇒ Array<(String)>
Construct array of neighbours to the given word and its index in the tokens array
84 85 86 87 88 89 90 91 |
# File 'lib/glove/corpus.rb', line 84 def token_neighbors(word, index) start_pos = index - window < 0 ? 0 : index - window end_pos = (index + window >= tokens.size) ? tokens.size - 1 : index + window tokens[start_pos..end_pos].map do |neighbor| neighbor unless word == neighbor end.compact end |