Class: Ferret::Analysis::RegExpTokenizer
- Inherits:
-
Object
- Object
- Ferret::Analysis::RegExpTokenizer
- Defined in:
- ext/r_analysis.c
Overview
Summary
A tokenizer that recognizes tokens based on a regular expression passed to the constructor. Most possible tokenizers can be created using this class.
Example
Below is an example of a simple implementation of a LetterTokenizer using an RegExpTokenizer. Basically, a token is a sequence of alphabetic characters separated by one or more non-alphabetic characters.
# of course you would add more than just é
RegExpTokenizer.new(input, /[[:alpha:]é]+/)
"Dave's résumé, at http://www.davebalmain.com/ 1234"
=> ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]