Class: Tokkens::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/tokkens/tokenizer.rb

Constant Summary collapse

MIN_LENGTH =

default minimum token length

2
STOP_WORDS =

default stop words to ignore

%w(
  het de deze
  en of om te hier nog ook al
  in van voor mee per als tot uit bij
  waar waardoor waarvan wanneer
  je uw ze zelf jezelf
  ca bijvoorbeeld
  is bevat hebben kunnen mogen
  gemaakt aanbevolen
  belangrijke heerlijk heerlijke handig handige dagelijkse
  gebruik allergieinformatie bijdrage smaak hoeveelheid
)

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(tokens = nil, min_length: MIN_LENGTH, stop_words: STOP_WORDS) ⇒ Tokenizer

Create a new tokenizer

Parameters:

  • tokens (Tokens) (defaults to: nil)

    object to use for obtaining token numbers

  • min_length (Fixnum) (defaults to: MIN_LENGTH)

    minimum length for tokens

  • stop_words (Array<String>) (defaults to: STOP_WORDS)

    stop words to ignore


46
47
48
49
50
# File 'lib/tokkens/tokenizer.rb', line 46

def initialize(tokens = nil, min_length: MIN_LENGTH, stop_words: STOP_WORDS)
  @tokens = tokens || Tokens.new
  @stop_words = stop_words
  @min_length = min_length
end

Instance Attribute Details

#min_lengthObject (readonly)

Returns the value of attribute min_length


39
# File 'lib/tokkens/tokenizer.rb', line 39

attr_reader :tokens, :stop_words, :min_length

#stop_wordsArray<String> (readonly)

Returns stop words to ignore.

Returns:

  • (Array<String>)

    stop words to ignore


39
# File 'lib/tokkens/tokenizer.rb', line 39

attr_reader :tokens, :stop_words, :min_length

#tokensTokens (readonly)

Returns object to use for obtaining tokens.

Returns:

  • (Tokens)

    object to use for obtaining tokens


39
40
41
# File 'lib/tokkens/tokenizer.rb', line 39

def tokens
  @tokens
end

Instance Method Details

#get(s, **kwargs) ⇒ Array<Fixnum>

Returns array of token numbers.

Returns:

  • (Array<Fixnum>)

    array of token numbers


53
54
55
56
# File 'lib/tokkens/tokenizer.rb', line 53

def get(s, **kwargs)
  return [] unless s and s.strip != ''
  tokenize(s).map {|token| @tokens.get(token, **kwargs) }.compact
end