Class: Ferret::Analysis::StemFilter
- Inherits:
-
Object
- Object
- Ferret::Analysis::StemFilter
- Defined in:
- ext/r_analysis.c
Overview
Summary
A StemFilter takes a term and transforms the term as per the SnowBall stemming algorithm. Note: the input to the stemming filter must already be in lower case, so you will need to use LowerCaseFilter or lowercasing Tokenizer further down the Tokenizer chain in order for this to work properly!
Available algorithms and encodings
Algorithm Algorithm Pseudonyms Encoding
----------------------------------------------------------------
"danish", | "da", "dan" | "ISO_8859_1", "UTF_8"
"dutch", | "dut", "nld" | "ISO_8859_1", "UTF_8"
"english", | "en", "eng" | "ISO_8859_1", "UTF_8"
"finnish", | "fi", "fin" | "ISO_8859_1", "UTF_8"
"french", | "fr", "fra", "fre" | "ISO_8859_1", "UTF_8"
"german", | "de", "deu", "ge", "ger" | "ISO_8859_1", "UTF_8"
"hungarian", | "hu", "hun" | "ISO_8859_1", "UTF_8"
"italian", | "it", "ita" | "ISO_8859_1", "UTF_8"
"norwegian", | "nl", "no" | "ISO_8859_1", "UTF_8"
"porter", | | "ISO_8859_1", "UTF_8"
"portuguese", | "por", "pt" | "ISO_8859_1", "UTF_8"
"romanian", | "ro", "ron", "rum" | "ISO_8859_2", "UTF_8"
"russian", | "ru", "rus" | "KOI8_R", "UTF_8"
"spanish", | "es", "esl" | "ISO_8859_1", "UTF_8"
"swedish", | "sv", "swe" | "ISO_8859_1", "UTF_8"
"turkish", | "tr", "tur" | "UTF_8"
New Stemmers
The following stemmers have recently benn added. Please try them out;
* Hungarian
* Romanian
* Turkish
Example
To use this filter with other analyzers, you’ll want to write an Analyzer class that sets up the TokenStream chain as you want it. To use this with a lowercasing Tokenizer, for example, you’d write an analyzer like this:
def MyAnalyzer < Analyzer
def token_stream(field, str)
return StemFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str)))
end
end
"debate debates debated debating debater"
=> ["debat", "debat", "debat", "debat", "debat"]
Attributes
- token_stream
-
TokenStream to be filtered
- algorithm
-
The algorithm (or language) to use (default: “english”)
- encoding
-
The encoding of the data (default: “UTF-8”)