Class: Treat::Workers::Processors::Segmenters::Punkt
- Inherits:
-
Object
- Object
- Treat::Workers::Processors::Segmenters::Punkt
- Defined in:
- lib/treat/workers/processors/segmenters/punkt.rb
Overview
Sentence segmentation based on a set of log- likelihood-based heuristics to infer abbreviations and common sentence starters from a large text corpus. Easily adaptable but requires a large (unlabeled) indomain corpus for assembling statistics.
Original paper: Kiss, Tibor and Strunk, Jan. 2006. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32:485-525.
Constant Summary collapse
- @@segmenters =
Hold one copy of the segmenter per language.
{}
- @@trainers =
Hold only one trainer per language.
{}
Class Method Summary collapse
-
.segment(entity, options = {}) ⇒ Object
Segment a text using the Punkt segmenter gem.
- .set_options(lang, options) ⇒ Object
Class Method Details
.segment(entity, options = {}) ⇒ Object
Segment a text using the Punkt segmenter gem. The included models for this segmenter have been trained on one or two lengthy books from the corresponding language.
Options:
(String) :training_text => Text to train on.
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
# File 'lib/treat/workers/processors/segmenters/punkt.rb', line 32 def self.segment(entity, = {}) entity.check_hasnt_children lang = entity.language (lang, ) s = entity.to_s # Replace the point in all floating-point numbers # by ^^; this is a fix since Punkt trips on decimal # numbers. s.escape_floats! # Take out suspension points temporarily. s.gsub!('...', '&;&.') # Remove abbreviations. s.scan(/(?:[A-Za-z]\.){2,}/).each do |abbr| s.gsub!(abbr, abbr.gsub(' ', '').gsub('.', '&-&')) end # Unstick sentences from each other. s.gsub!(/([^\.\?!]\.|\!|\?)([^\s"'])/) { $1 + ' ' + $2 } result = @@segmenters[lang]. sentences_from_text(s, :output => :sentences_text) result.each do |sentence| # Unescape the sentence. sentence.unescape_floats! # Repair abbreviations in sentences. sentence.gsub!('&-&', '.') # Repair suspension points. sentence.gsub!('&;&.', '...') entity << Treat::Entities::Phrase. from_string(sentence) end end |
.set_options(lang, options) ⇒ Object
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
# File 'lib/treat/workers/processors/segmenters/punkt.rb', line 73 def self.(lang, ) return @@segmenters[lang] if @@segmenters[lang] if [:model] model = [:model] else model_path = Treat.libraries.punkt.model_path || Treat.paths.models + 'punkt/' model = model_path + "#{lang}.yaml" unless File.readable?(model) raise Treat::Exception, "Could not get the language model " + "for the Punkt segmenter for #{lang.to_s.capitalize}." end end t = ::YAML.load(File.read(model)) @@segmenters[lang] = ::Punkt::SentenceTokenizer.new(t) end |