Module: Punkt

Defined in:
lib/punkt-segmenter/punkt.rb,
lib/punkt-segmenter/punkt/base.rb,
lib/punkt-segmenter/punkt/token.rb,
lib/punkt-segmenter/punkt/trainer.rb,
lib/punkt-segmenter/punkt/parameters.rb,
lib/punkt-segmenter/punkt/language_vars.rb,
lib/punkt-segmenter/punkt/sentence_tokenizer.rb

Overview

Ruby implementation of Punkt sentence tokenizer

This code is a ruby port of the algorithm implemented by the NLTK Project. This code follows the terms and conditions of Apache License v2 (www.apache.org/licenses/LICENSE-2.0)

Copyright © 2001-2010 NLTK Project Algorithm: Kiss & Strunk (2006) Author: Willy <[email protected]> (original Python port)

Steven Bird <[email protected]> (additions)
Edward Loper <[email protected]> (rewrite)
Joel Nothman <[email protected]> (almost rewrite)

Luis Cipriani (ruby port)

URL: <www.nltk.org/>

The Punkt sentence tokenizer. The algorithm for this tokenizer is

described in Kiss & Strunk (2006)

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.

Defined Under Namespace

Classes: Base, LanguageVars, Parameters, SentenceTokenizer, Token, Trainer

Constant Summary collapse

ORTHO_BEG_UC =

Orthographoc Context Constants

1 << 1
ORTHO_MID_UC =

Orthographoc context: beginning of sentence with upper case

1 << 2
ORTHO_UNK_UC =

Orthographoc context: middle of sentence with upper case

1 << 3
ORTHO_BEG_LC =

Orthographoc context: unknown position in a sentence with upper case

1 << 4
ORTHO_MID_LC =

Orthographoc context: beginning of sentence with lower case

1 << 5
ORTHO_UNK_LC =

Orthographoc context: middle of sentence with lower case

1 << 6
ORTHO_UC =

Orthographoc context: unknown position in a sentence with lower case

ORTHO_BEG_UC + ORTHO_MID_UC + ORTHO_UNK_UC
ORTHO_LC =
ORTHO_BEG_LC + ORTHO_MID_LC + ORTHO_UNK_LC
ORTHO_MAP =
{
  [:initial,  :upper] => ORTHO_BEG_UC,
  [:internal, :upper] => ORTHO_MID_UC,
  [:unknown,  :upper] => ORTHO_UNK_UC,
  [:initial,  :lower] => ORTHO_BEG_LC,
  [:internal, :lower] => ORTHO_MID_LC,
  [:unknown,  :lower] => ORTHO_UNK_LC,
}