Class: StanfordParser::StandoffDocumentPreprocessor

Inherits:
DocumentPreprocessor show all
Defined in:
lib/stanfordparser.rb

Overview

A preprocessor that segments text into sentences and tokens that contain character offset and token context information that can be used for standoff annotation.

Instance Attribute Summary

Attributes inherited from Rjb::JavaObjectWrapper

#java_object

Instance Method Summary collapse

Methods inherited from DocumentPreprocessor

#inspect, #to_s

Methods inherited from Rjb::JavaObjectWrapper

#each, #inspect, #method_missing, #to_s

Constructor Details

#initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER) ⇒ StandoffDocumentPreprocessor

Returns a new instance of StandoffDocumentPreprocessor.



256
257
258
259
260
261
262
263
264
265
266
267
268
269
# File 'lib/stanfordparser.rb', line 256

def initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER)
  # PTBTokenizer.factory is a static function, so use RJB to call it
  # directly instead of going through a JavaObjectWrapper.  We do it this
  # way because the Standford parser Java code does not provide a
  # constructor that allows you to specify the second parameter,
  # invertible, to true, and we need this to write character offset
  # information into the tokens.
  ptb_tokenizer_class = Rjb::import(tokenizer)
  # See the documentation for
  # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> for a
  # description of these parameters.
  ptb_tokenizer_factory = ptb_tokenizer_class.factory(false, true, false)
  super(ptb_tokenizer_factory)
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method in the class Rjb::JavaObjectWrapper

Instance Method Details

#getSentencesFromString(s) ⇒ Object

Returns a list of sentences in a string. This wraps the returned sentences in a StandoffSentence object.



273
274
275
# File 'lib/stanfordparser.rb', line 273

def getSentencesFromString(s)
  super(s).map!{|s| StandoffSentence.new(s)}
end