Module: TextRank::CharFilter

Defined in:
lib/text_rank/char_filter.rb,
lib/text_rank/char_filter/lowercase.rb,
lib/text_rank/char_filter/strip_html.rb,
lib/text_rank/char_filter/strip_email.rb,
lib/text_rank/char_filter/ascii_folding.rb,
lib/text_rank/char_filter/strip_possessive.rb,
lib/text_rank/char_filter/undo_contractions.rb

Overview

Character filters pre-process text prior to tokenization. It is during this phase that the text should be "cleaned up" so that the tokenizer will produce valid tokens. Character filters should not attempt to remove undesired tokens, however. That is the job of the token filter. Examples include converting non-ascii characters to related ascii characters, forcing text to lower case, stripping out HTML, converting English contractions (e.g. "won't") to the non-contracted form ("will not"), and more.

Character filters are applied as a chain, so care should be taken to use them in the desired order.

Defined Under Namespace

Classes: AsciiFolding, Lowercase, StripEmail, StripHtml, StripPossessive, UndoContractions