Class: Treat::Workers::Processors::Segmenters::SRX
- Inherits:
-
Object
- Object
- Treat::Workers::Processors::Segmenters::SRX
- Defined in:
- lib/treat/workers/processors/segmenters/srx.rb
Overview
Sentence segmentation based on a set of predefined rules defined in SRX (Segmentation Rules eXchange) format and developped by Marcin Milkowski.
Original paper: Marcin Miłkowski, Jarosław Lipski,
-
Using SRX standard for sentence segmentation
in LanguageTool, in: Human Language Technologies as a Challenge for Computer Science and Linguistics.
Constant Summary collapse
- @@segmenters =
{}
Class Method Summary collapse
-
.segment(entity, options = {}) ⇒ Object
Require the srx-english library.
Class Method Details
.segment(entity, options = {}) ⇒ Object
Require the srx-english library. Segment a text using the SRX algorithm
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# File 'lib/treat/workers/processors/segmenters/srx.rb', line 15 def self.segment(entity, = {}) lang = entity.language entity.check_hasnt_children text = entity.to_s text.escape_floats! unless @@segmenters[lang] # Require the appropriate gem. require "srx/#{lang}/sentence_splitter" @@segmenters[lang] = SRX.const_get( lang.capitalize).const_get( 'SentenceSplitter') end sentences = @@segmenters[lang].new(text) sentences.each do |sentence| sentence.unescape_floats! entity << Treat::Entities::Phrase. from_string(sentence.strip) end entity end |