Class: StanfordParser::StandoffNode

Inherits:
Treebank::ParentedNode
  • Object
show all
Defined in:
lib/stanfordparser.rb

Overview

Standoff syntactic tree annotation of text. Terminal nodes are labeled with the appropriate StandoffToken objects. Standoff parses can reproduce the original string from which they were generated verbatim, optionally with brackets around the yields of specified non-terminal nodes.

Instance Method Summary collapse

Constructor Details

#initialize(stanford_parser_node, tokens) ⇒ StandoffNode

Create the standoff tree from a tree returned by the Stanford parser. For non-terminal nodes, the tokens argument will be a StandoffSentence containing the StandoffToken objects representing all the tokens beneath and after this node. For terminal nodes, the tokens argument will be a StandoffToken.



357
358
359
360
361
362
363
364
365
366
367
368
# File 'lib/stanfordparser.rb', line 357

def initialize(stanford_parser_node, tokens)
  # Annotate this node with a non-terminal label or a StandoffToken as
  # appropriate.
  super(tokens.instance_of?(StandoffSentence) ?
        stanford_parser_node.value : tokens)
  # Enumerate the children depth-first.  Tokens are removed from the list
  # left-to-right as terminal nodes are added to the tree.
  stanford_parser_node.children.each do |child|
    subtree = self.class.new(child, child.leaf? ? tokens.shift : tokens)
    attach_child!(subtree)
  end
end

Instance Method Details

#to_bracketed_string(coords, open = "[", close = "]") ⇒ Object

Print the original string with brackets around word spans dominated by the specified consituents.

The constituents to bracket are specified by passing a list of node coordinates, which are arrays of integers of the form returned by the tree enumerators of Treebank::Node objects.

coords

the coordinates of the nodes around which to place brackets

open

the open bracket symbol

close

the close bracket symbol



387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
# File 'lib/stanfordparser.rb', line 387

def to_bracketed_string(coords, open = "[", close = "]")
  # Get a list of all the leaf nodes and their coordinates.
  items = depth_first_enumerator(true).find_all {|n| n.first.leaf?}
  # Enumerate over all the matching constituents inserting open and close
  # brackets around their yields in the items list.
  coords.each do |matching|
    # Insert using a simple state machine with three states: :start,
    # :open, and :close.
    state = :start
    # Enumerate over the items list looking for nodes that are the
    # children of the matching constituent.
    items.each_with_index do |item, index|
      # Skip inserted bracket characters.
      next if item.is_a? String
      # Handle terminal node items with the state machine.
      node, terminal_coordinate = item
      if state == :start
        next if not in_yield?(matching, terminal_coordinate)
        items.insert(index, open)
        state = :open
      else # state == :open
        next if in_yield?(matching, terminal_coordinate)
        items.insert(index, close)
        state = :close
        break
      end
    end # items.each_with_index
    # Handle the case where a matching constituent is flush with the end
    # of the sentence.
    items << close if state == :open
  end # each
  # Replace terminal nodes with their string representations.  Insert
  # spacing characters in the list.
  items.each_with_index do |item, index|
    next if item.is_a? String
    text = item.first.label.current
    spacing = item.first.label.after
    # Replace the terminal node with its text.
    items[index] = text
    # Insert the spacing that comes after this text before the first
    # non-close bracket character.
    close_pos = find_index(items[index+1..-1]) {|item| not item == close}
    items.insert(index + close_pos + 1, spacing)
  end
  items.join
end

#to_original_stringObject

Return the original text string dominated by this node.



371
372
373
374
375
# File 'lib/stanfordparser.rb', line 371

def to_original_string
  leaves.inject("") do |s, leaf|
    s += leaf.label.current + leaf.label.after
  end
end