Class: Rblearn::CountVectorizer

Inherits:

Object

Object
Rblearn::CountVectorizer

show all

Defined in:: lib/rblearn/CountVectorizer.rb

Instance Attribute Summary collapse

#doc_matrix ⇒ Object

TODO: consider the access controll about all variables.
#feature_names ⇒ Object

TODO: consider the access controll about all variables.
#token2index ⇒ Object

TODO: consider the access controll about all variables.

Instance Method Summary collapse

#fit_transform(features) ⇒ Object
features: Each documents’ feature

Array<String> -> NArray::Int64.
#initialize(tokenizer, lowercase = true, max_features = 0.8) ⇒ CountVectorizer constructor
tokenizer: lambda function

string -> Array<string> lowcase: whether if words are lowercases

bool stop_words: list of stop words

Array<string> max_features: limitation of feature size

Float in [0, 1] TODO: by max_features, zero vectors are sometimes created.

Constructor Details

#initialize(tokenizer, lowercase = true, max_features = 0.8) ⇒ `CountVectorizer`

tokenizer: lambda function: string -> Array<string>
lowcase: whether if words are lowercases: bool
stop_words: list of stop words: Array<string>
max_features: limitation of feature size: Float in [0, 1]

TODO: by max_features, zero vectors are sometimes created.

# File 'lib/rblearn/CountVectorizer.rb', line 14

def initialize(tokenizer, lowercase=true, max_features=0.8)
  @tokenizer = tokenizer
  @lowercase = lowercase

  stop_words = Stopwords::STOP_WORDS
  stop_words.map! {|token| token.stem}
  stop_words.map! {|token| token.downcase} if @lowercase
  @stopwords = stop_words
  @max_feature = max_features
end

Instance Attribute Details

#doc_matrix ⇒ `Object`

TODO: consider the access controll about all variables



7
8
9

# File 'lib/rblearn/CountVectorizer.rb', line 7

def doc_matrix
  @doc_matrix
end

#feature_names ⇒ `Object`

TODO: consider the access controll about all variables



7
8
9

# File 'lib/rblearn/CountVectorizer.rb', line 7

def feature_names
  @feature_names
end

#token2index ⇒ `Object`

TODO: consider the access controll about all variables



7
8
9

# File 'lib/rblearn/CountVectorizer.rb', line 7

def token2index
  @token2index
end

Instance Method Details

#fit_transform(features) ⇒ `Object`

features: Each documents’ feature: Array<String> -> NArray::Int64

# File 'lib/rblearn/CountVectorizer.rb', line 26

def fit_transform(features)
  all_vocaburaries = []
  word_frequency = Hash.new{|hash, key| hash[key] = 0}

  features.each do |feature|
    @tokenizer.call(feature).each do |token|
      token.downcase! if @lowercase
      all_vocaburaries << token
      word_frequency[token] += 1
    end
  end

  all_vocaburaries.uniq!
  word_frequency =  word_frequency.sort{|(_, value1), (_, value2)| value2 <=> value1}
  feature_names = (0...(word_frequency.size * @max_feature).to_i).map{|i| word_frequency[i][0]}

  token2index = {}
  feature_names.each_with_index do |token, i|
    token2index[token] = i
  end

  doc_matrix = Numo::Int32.zeros([features.size, feature_names.size])
  features.each_with_index do |feature, doc_id|
    tokens = []
    @tokenizer.call(feature).each do |token|
      token.downcase! if @lowercase
      tokens << token unless @stopwords.include?(token)
    end

    # BoW representation
    counter = Hash.new{|hash, key| hash[key] = 0}
    tokens.each do |token|
      counter[token] += 1
    end

    counter.each do |token, freq|
      doc_matrix[doc_id, token2index[token]] = freq if token2index[token]
    end
  end

  @doc_matrix = doc_matrix
  @feature_names = feature_names
  @token2index = token2index
  return @doc_matrix
end

Class: Rblearn::CountVectorizer

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(tokenizer, lowercase = true, max_features = 0.8) ⇒ CountVectorizer

Instance Attribute Details

#doc_matrix ⇒ Object

#feature_names ⇒ Object

#token2index ⇒ Object

Instance Method Details

#fit_transform(features) ⇒ Object

#initialize(tokenizer, lowercase = true, max_features = 0.8) ⇒ `CountVectorizer`

#doc_matrix ⇒ `Object`

#feature_names ⇒ `Object`

#token2index ⇒ `Object`

#fit_transform(features) ⇒ `Object`