Module: TextRank::Tokenizer

Defined in:
lib/text_rank/tokenizer.rb,
lib/text_rank/tokenizer/url.rb,
lib/text_rank/tokenizer/word.rb,
lib/text_rank/tokenizer/money.rb,
lib/text_rank/tokenizer/number.rb,
lib/text_rank/tokenizer/whitespace.rb,
lib/text_rank/tokenizer/punctuation.rb

Overview

Tokenizers are responsible for transforming a single String of text into an array of potential keywords ("tokens"). There are no requirements of tokens other than to be non-empty. When used in combination with token filters, it may make sense for a tokenizer to temporarily create tokens which might seem like ill-suited keywords. The token filter may use these "bad" keywords to help inform its decision on which tokens to keep and which to drop. An example of this is the part of speech token filter which uses punctuation tokens to help guess the part of speech of each non-punctuation token.

When tokenizing a piece of text, the Tokenizer will combine one or more regular expressions (in the order given) to scan the text for matches. As such you need only tell the tokenizer which tokens you want; everything else will be ignored.

Constant Summary collapse

Url =

A tokenizer regex that preserves entire URL's as a token (rather than split them up)

rubocop:disable Naming/ConstantName

%r{
  (
    (?:[\w-]+://?|www[.])
    [^\s()<>]+
    (?:
      \([\w\d]+\)
      |
      (?:[^[:punct:]\s]
      |
      /)
    )
  )
}xi
Word =

A tokenizer regex that preserves a non-space, non-punctuation "word". It does allow hyphens and numerals, but the first character must be an A-Z character.

rubocop:disable Naming/ConstantName

/
  (
    [a-z][a-z0-9-]*
  )
/xi
Money =

It also supports two alternative formats for negatives as well as optional three digit comma separation and optional decimals.

rubocop:disable Naming/ConstantName

/
  (
    #{CURRENCY_SYMBOLS} -? #{Number}       # $-45,231.21
    |
    -? #{CURRENCY_SYMBOLS} #{Number}       # -$45,231.21
    |
    \( #{CURRENCY_SYMBOLS} #{Number} \)    # ($45,231.21)
  )
/x
Number =

A tokenizer regex that preserves (optionally formatted) numbers as a single token.

rubocop:disable Naming/ConstantName

/
  (
    [1-9]\d{3,}       # 453231162
    (?:\.\d+)?        # 453231162.17

    |

    [1-9]\d{0,2}      # 453
    (?:,\d{3})*       # 453,231,162
    (?:\.\d+)?        # 453,231,162.17

    |

    0                 # 0
    (?:\.\d+)?        # 0.17

    |

    (?:\.\d+)         # .17
  )
/x
Whitespace =

A tokenizer regex that preserves single whitespace characters as a token. Use this if one or more of your TokenFilter classes need whitespace in order to make decisions.

rubocop:disable Naming/ConstantName

/\s/
Punctuation =

A tokenizer regex that preserves single punctuation symbols as a token. Use this if one or more of your TokenFilter classes need punctuation in order to make decisions.

rubocop:disable Naming/ConstantName

/(\p{Punct})/

Class Method Summary collapse

Class Method Details

.tokenize(text, *regular_expressions) ⇒ Array<String>

Performs tokenization of piece of text by one or more tokenizer regular expressions.

Parameters:

  • text (String)
  • regular_expressions (Array<Regexp|String>)

Returns:

  • (Array<String>)


30
31
32
33
34
35
36
37
# File 'lib/text_rank/tokenizer.rb', line 30

def self.tokenize(text, *regular_expressions)
  tokens = []
  text.scan(Regexp.new(regular_expressions.flatten.join('|'))) do |matches|
    m = matches.compact.first
    tokens << m if m&.size&.positive?
  end
  tokens
end