Module: Punkt
- Defined in:
- lib/punkt-segmenter/punkt.rb,
lib/punkt-segmenter/punkt/base.rb,
lib/punkt-segmenter/punkt/token.rb,
lib/punkt-segmenter/punkt/trainer.rb,
lib/punkt-segmenter/punkt/parameters.rb,
lib/punkt-segmenter/punkt/language_vars.rb,
lib/punkt-segmenter/punkt/sentence_tokenizer.rb
Overview
Ruby implementation of Punkt sentence tokenizer
This code is a ruby port of the algorithm implemented by the NLTK Project. This code follows the terms and conditions of Apache License v2 (www.apache.org/licenses/LICENSE-2.0)
Copyright © 2001-2010 NLTK Project Algorithm: Kiss & Strunk (2006) Author: Willy <[email protected]> (original Python port)
Steven Bird <[email protected]> (additions)
Edward Loper <[email protected]> (rewrite)
Joel Nothman <[email protected]> (almost rewrite)
Luis Cipriani (ruby port)
URL: <www.nltk.org/>
The Punkt sentence tokenizer. The algorithm for this tokenizer is
- described in Kiss & Strunk (2006)
-
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.
Defined Under Namespace
Classes: Base, LanguageVars, Parameters, SentenceTokenizer, Token, Trainer
Constant Summary collapse
- ORTHO_BEG_UC =
Orthographoc Context Constants
1 << 1
- ORTHO_MID_UC =
Orthographoc context: beginning of sentence with upper case
1 << 2
- ORTHO_UNK_UC =
Orthographoc context: middle of sentence with upper case
1 << 3
- ORTHO_BEG_LC =
Orthographoc context: unknown position in a sentence with upper case
1 << 4
- ORTHO_MID_LC =
Orthographoc context: beginning of sentence with lower case
1 << 5
- ORTHO_UNK_LC =
Orthographoc context: middle of sentence with lower case
1 << 6
- ORTHO_UC =
Orthographoc context: unknown position in a sentence with lower case
ORTHO_BEG_UC + ORTHO_MID_UC + ORTHO_UNK_UC
- ORTHO_LC =
ORTHO_BEG_LC + ORTHO_MID_LC + ORTHO_UNK_LC
- ORTHO_MAP =
{ [:initial, :upper] => ORTHO_BEG_UC, [:internal, :upper] => ORTHO_MID_UC, [:unknown, :upper] => ORTHO_UNK_UC, [:initial, :lower] => ORTHO_BEG_LC, [:internal, :lower] => ORTHO_MID_LC, [:unknown, :lower] => ORTHO_UNK_LC, }