Class: WordsCounted::Tokeniser
- Inherits:
-
Object
- Object
- WordsCounted::Tokeniser
- Defined in:
- lib/words_counted/tokeniser.rb
Constant Summary collapse
- TOKEN_REGEXP =
Default tokenisation strategy
/[\p{Alpha}\-']+/
Instance Method Summary collapse
-
#initialize(input) ⇒ Tokeniser
constructor
Initialises state with the string to be tokenised.
-
#tokenise(pattern: TOKEN_REGEXP, exclude: nil) ⇒ Array
Converts a string into an array of tokens using a regular expression.
Constructor Details
#initialize(input) ⇒ Tokeniser
Initialises state with the string to be tokenised.
21 22 23 |
# File 'lib/words_counted/tokeniser.rb', line 21 def initialize(input) @input = input end |
Instance Method Details
#tokenise(pattern: TOKEN_REGEXP, exclude: nil) ⇒ Array
Converts a string into an array of tokens using a regular expression.
If a regexp is not provided a default one is used. See Tokenizer.TOKEN_REGEXP
.
Use exclude
to remove tokens from the final list. exclude
can be a string,
a regular expression, a lambda, a symbol, or an array of one or more of those types.
This allows for powerful and flexible tokenisation strategies.
If a symbol is passed, it must name a predicate method.
97 98 99 100 |
# File 'lib/words_counted/tokeniser.rb', line 97 def tokenise(pattern: TOKEN_REGEXP, exclude: nil) filter_proc = filter_to_proc(exclude) @input.scan(pattern).map(&:downcase).reject { |token| filter_proc.call(token) } end |