Class: WordsCounted::Tokeniser

Inherits:

Object

Object
WordsCounted::Tokeniser

show all

Defined in:: lib/words_counted/tokeniser.rb

Constant Summary collapse

TOKEN_REGEXP = Default tokenisation strategy

/[\p{Alpha}\-']+/

Instance Method Summary collapse

#initialize(input) ⇒ Tokeniser constructor
Initialises state with the string to be tokenised.
#tokenise(pattern: TOKEN_REGEXP, exclude: nil) ⇒ Array
Converts a string into an array of tokens using a regular expression.

Constructor Details

#initialize(input) ⇒ `Tokeniser`

Initialises state with the string to be tokenised.

Parameters:

input (String) —
The string to tokenise



21
22
23

# File 'lib/words_counted/tokeniser.rb', line 21

def initialize(input)
  @input = input
end

Instance Method Details

#tokenise(pattern: TOKEN_REGEXP, exclude: nil) ⇒ `Array`

Converts a string into an array of tokens using a regular expression. If a regexp is not provided a default one is used. See Tokenizer.TOKEN_REGEXP.

Use exclude to remove tokens from the final list. exclude can be a string, a regular expression, a lambda, a symbol, or an array of one or more of those types. This allows for powerful and flexible tokenisation strategies.

If a symbol is passed, it must name a predicate method.

Examples:

WordsCounted::Tokeniser.new("Hello World").tokenise
# => ['hello', 'world']

With `pattern`

WordsCounted::Tokeniser.new("Hello-Mohamad").tokenise(pattern: /[^-]+/)
# => ['hello', 'mohamad']

With `exclude` as a string

WordsCounted::Tokeniser.new("Hello Sami").tokenise(exclude: "hello")
# => ['sami']

With `exclude` as a regexp

WordsCounted::Tokeniser.new("Hello Dani").tokenise(exclude: /hello/i)
# => ['dani']

With `exclude` as a lambda

WordsCounted::Tokeniser.new("Goodbye Sami").tokenise(
  exclude: ->(token) { token.length > 6 }
)
# => ['sami']

With `exclude` as a symbol

WordsCounted::Tokeniser.new("Hello محمد").tokenise(exclude: :ascii_only?)
# => ['محمد']

With `exclude` as an array of strings

WordsCounted::Tokeniser.new("Goodbye Sami and hello Dani").tokenise(
  exclude: ["goodbye hello"]
)
# => ['sami', 'and', dani']

With `exclude` as an array of regular expressions

WordsCounted::Tokeniser.new("Goodbye and hello Dani").tokenise(
  exclude: [/goodbye/i, /and/i]
)
# => ['hello', 'dani']

With `exclude` as an array of lambdas

t = WordsCounted::Tokeniser.new("Special Agent 007")
t.tokenise(
  exclude: [
    ->(t) { t.to_i.odd? },
    ->(t) { t.length > 5}
  ]
)
# => ['agent']

With `exclude` as a mixed array

t = WordsCounted::Tokeniser.new("Hello! اسماءنا هي محمد، كارولينا، سامي، وداني")
t.tokenise(
  exclude: [
    :ascii_only?,
    /محمد/,
    ->(t) { t.length > 6},
    "و"
  ]
)
# => ["هي", "سامي", "وداني"]

Parameters:

pattern (Regexp) (defaults to: TOKEN_REGEXP) —
The string to tokenise
exclude (Array<String, Regexp, Lambda, Symbol>, String, Regexp, Lambda, Symbol, nil) (defaults to: nil) —
The filter to apply

Returns:

(Array) —
The array of filtered tokens

# File 'lib/words_counted/tokeniser.rb', line 97

def tokenise(pattern: TOKEN_REGEXP, exclude: nil)
  filter_proc = filter_to_proc(exclude)
  @input.scan(pattern).map(&:downcase).reject { |token| filter_proc.call(token) }
end

Class: WordsCounted::Tokeniser

Constant Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input) ⇒ Tokeniser

Instance Method Details

#tokenise(pattern: TOKEN_REGEXP, exclude: nil) ⇒ Array

Examples:

With pattern

With exclude as a string

With exclude as a regexp

With exclude as a lambda

With exclude as a symbol

With exclude as an array of strings

With exclude as an array of regular expressions

With exclude as an array of lambdas

With exclude as a mixed array