Class: Lingua::EN::Sentence

Inherits:

Object

Object
Lingua::EN::Sentence

show all

Defined in:: lib/lingua/en/sentence.rb

Overview

The class Lingua::EN::Sentence takes English text, and attempts to split it up into sentences, respecting abbreviations.

Constant Summary collapse

EOS = temporary end of sentence marker

"\001"

Titles =

[ 'jr', 'mr', 'mrs', 'ms', 'dr', 'prof', 'sr', 'sen', 'rep',
'rev', 'gov', 'atty', 'supt', 'det', 'rev', 'col','gen', 'lt',
'cmdr', 'adm', 'capt', 'sgt', 'cpl', 'maj' ]

Entities =

[ 'dept', 'univ', 'uni', 'assn', 'bros', 'inc', 'ltd', 'co',
'corp', 'plc' ]

Months =

[ 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul',
'aug', 'sep', 'oct', 'nov', 'dec', 'sept' ]

Days =

[ 'mon', 'tue', 'wed', 'thu',
'fri', 'sat', 'sun' ]

Misc =

[ 'vs', 'etc', 'no', 'esp', 'cf' ]

Streets =

[ 'ave', 'bld', 'blvd', 'cl', 'ct',
'cres', 'dr', 'rd', 'st' ]

ABBR_DETECT = Finds abbreviations, like e.g., i.e., U.S., u.S., U.S.S.R.

/(?:\s(?:(?:(?:\w\.){2,}\w?)|(?:\w\.\w)))/

PUNCTUATION_DETECT = Finds punctuation that ends paragraphs.

/((?:[\.?!]|[\r\n]+)(?:\"|\'|\)|\]|\})?)(\s+)/

CORRECT_ABBR =

/(#{ABBR_DETECT})#{EOS}(\s+[a-z0-9])/

Class Attribute Summary collapse

.abbr_regex ⇒ Object readonly

Returns the value of attribute abbr_regex.
.abbreviations ⇒ Object readonly

Returns the value of attribute abbreviations.

Class Method Summary collapse

.abbreviation(*abbreviations) ⇒ Object

Adds a list of abbreviations to the list that’s used to detect false sentence ends.
.initialize_abbreviations! ⇒ Object
.sentences(text) ⇒ Object

Split the passed text into individual sentences, trim these and return as an array.
.set_abbr_regex! ⇒ Object

Class Attribute Details

.abbr_regex ⇒ `Object` (readonly)

Returns the value of attribute abbr_regex.



9
10
11

# File 'lib/lingua/en/sentence.rb', line 9

def abbr_regex
  @abbr_regex
end

.abbreviations ⇒ `Object` (readonly)

Returns the value of attribute abbreviations.



8
9
10

# File 'lib/lingua/en/sentence.rb', line 8

def abbreviations
  @abbreviations
end

Class Method Details

.abbreviation(*abbreviations) ⇒ `Object`

Adds a list of abbreviations to the list that’s used to detect false sentence ends. Return the current list of abbreviations in use.

# File 'lib/lingua/en/sentence.rb', line 68

def self.abbreviation(*abbreviations)
  @abbreviations += abbreviations
  @abbreviations.uniq!
  set_abbr_regex!
  @abbreviations
end

.initialize_abbreviations! ⇒ `Object`

# File 'lib/lingua/en/sentence.rb', line 75

def self.initialize_abbreviations!
  @abbreviations = Titles + Entities + Months + Days + Streets + Misc
  set_abbr_regex!
end

.sentences(text) ⇒ `Object`

Split the passed text into individual sentences, trim these and return as an array. A sentence is marked by one of the punctuation marks “.”, “?” or “!” followed by whitespace. Sequences of full stops (such as an ellipsis marker “…” and stops after a known abbreviation are ignored.

# File 'lib/lingua/en/sentence.rb', line 40

def self.sentences(text)
  # Make sure we work with a duplicate, as we are modifying the
  # text with #gsub!
  text = text.dup

  # Mark end of sentences with EOS marker.
  # We preserve the trailing whitespace ($2) so that we can
  # fix ellipses (...)!
  text.gsub!(PUNCTUATION_DETECT) { $1 << EOS << $2 }

  # Correct ellipsis marks.
  text.gsub!(/(\.\.\.*)#{EOS}/) { $1 }

  # Correct e.g, i.e. marks.
  text.gsub!(CORRECT_ABBR, "\\1\\2")

  # Correct abbreviations
  text.gsub!(@abbr_regex) { $1 << '.' }

  # Split on EOS marker, get rid of trailing whitespace.
  # Remove empty sentences.
  text.split(EOS).
    map { |sentence| sentence.strip }.
    delete_if { |sentence| sentence.nil? || sentence.empty? }
end

.set_abbr_regex! ⇒ `Object`



80
81
82

# File 'lib/lingua/en/sentence.rb', line 80

def self.set_abbr_regex!
  @abbr_regex = / (#{abbreviations.join("|")})\.#{EOS}/i
end

Class: Lingua::EN::Sentence

Overview

Constant Summary collapse

Class Attribute Summary collapse

Class Method Summary collapse

Class Attribute Details

.abbr_regex ⇒ Object (readonly)

.abbreviations ⇒ Object (readonly)

Class Method Details

.abbreviation(*abbreviations) ⇒ Object

.initialize_abbreviations! ⇒ Object