srx-polish
DESCRIPTION
‘srx-polish’ is a Ruby library containing Polish sentence and word segmentation rules. The sentence segementation rules are based on rules defined by Marcin Miłkowski: morfologik.blogspot.com/2009/11/talking-about-srx-in-lt-during-ltc.html
FEATURES/PROBLEMS
-
this library is generated by ‘srx2ruby’ which has some limitations and might be not 100% SRX standard compliant.
INSTALL
Standard rubygems installation:
$ gem install srx-polish
BASIC USAGE
The library defines the SRX::Polish::Sentence class allowing to iterate over the matched sentences:
require 'srx/polish/sentence_splitter'
text =<<-END
Kiedy spotkałem p. Wojtka miał na sobie krótkie spodnie. Na s. 10 książki
sprawa jest szczegółowo opisana.
END
splitter = SRX::Polish::SentenceSplitter.new(text)
splitter.each do |sentence|
puts sentence.gsub(/\n|\r/,"")
end
# Kiedy spotkałem p. Wojtka miał na sobie krótkie spodnie.
# Na s. 10 książki sprawa jest szczegółowo opisana.
require 'srx/polish/word_splitter'
sentence = "Ala ma kota za 5zł i 10$."
splitter = SRX::Polish::WordSplitter.new(sentence)
splitter.each do |word,type|
puts "'#{word}' #{type}"
end
# 'Ala' word
# ' ' other
# 'ma' word
# ' ' other
# 'kota' word
# ' ' other
# 'za' word
# ' ' other
# '5' number
# 'zł' word
# ' ' other
# 'i' word
# ' ' other
# '10' number
# '$' graph
# '.' punct
LICENSE
Copyright © 2011 Aleksander Pohl, Marcin Miłkowski, Jarosław Lipski
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.