Class: StanfordParser::StandoffDocumentPreprocessor
- Inherits:
-
DocumentPreprocessor
- Object
- Rjb::JavaObjectWrapper
- DocumentPreprocessor
- StanfordParser::StandoffDocumentPreprocessor
- Defined in:
- lib/stanfordparser.rb
Overview
A preprocessor that segments text into sentences and tokens that contain character offset and token context information that can be used for standoff annotation.
Instance Attribute Summary
Attributes inherited from Rjb::JavaObjectWrapper
Instance Method Summary collapse
-
#getSentencesFromString(s) ⇒ Object
Returns a list of sentences in a string.
-
#initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER) ⇒ StandoffDocumentPreprocessor
constructor
A new instance of StandoffDocumentPreprocessor.
Methods inherited from DocumentPreprocessor
Methods inherited from Rjb::JavaObjectWrapper
#each, #inspect, #method_missing, #to_s
Constructor Details
#initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER) ⇒ StandoffDocumentPreprocessor
Returns a new instance of StandoffDocumentPreprocessor.
256 257 258 259 260 261 262 263 264 265 266 267 268 269 |
# File 'lib/stanfordparser.rb', line 256 def initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER) # PTBTokenizer.factory is a static function, so use RJB to call it # directly instead of going through a JavaObjectWrapper. We do it this # way because the Standford parser Java code does not provide a # constructor that allows you to specify the second parameter, # invertible, to true, and we need this to write character offset # information into the tokens. ptb_tokenizer_class = Rjb::import(tokenizer) # See the documentation for # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> for a # description of these parameters. ptb_tokenizer_factory = ptb_tokenizer_class.factory(false, true, false) super(ptb_tokenizer_factory) end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method in the class Rjb::JavaObjectWrapper
Instance Method Details
#getSentencesFromString(s) ⇒ Object
Returns a list of sentences in a string. This wraps the returned sentences in a StandoffSentence object.
273 274 275 |
# File 'lib/stanfordparser.rb', line 273 def getSentencesFromString(s) super(s).map!{|s| StandoffSentence.new(s)} end |