Class: Lingua::EN::Sentence
- Inherits:
-
Object
- Object
- Lingua::EN::Sentence
- Defined in:
- lib/lingua/en/sentence.rb
Overview
The class Lingua::EN::Sentence takes English text, and attempts to split it up into sentences, respecting abbreviations.
Constant Summary collapse
- EOS =
temporary end of sentence marker
"\001"
- Titles =
[ 'jr', 'mr', 'mrs', 'ms', 'dr', 'prof', 'sr', 'sen', 'rep', 'rev', 'gov', 'atty', 'supt', 'det', 'rev', 'col','gen', 'lt', 'cmdr', 'adm', 'capt', 'sgt', 'cpl', 'maj' ]
- Entities =
[ 'dept', 'univ', 'uni', 'assn', 'bros', 'inc', 'ltd', 'co', 'corp', 'plc' ]
- Months =
[ 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec', 'sept' ]
- Days =
[ 'mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun' ]
- Misc =
[ 'vs', 'etc', 'no', 'esp', 'cf' ]
- Streets =
[ 'ave', 'bld', 'blvd', 'cl', 'ct', 'cres', 'dr', 'rd', 'st' ]
- ABBR_DETECT =
Finds abbreviations, like e.g., i.e., U.S., u.S., U.S.S.R.
/(?:\s(?:(?:(?:\w\.){2,}\w?)|(?:\w\.\w)))/
- PUNCTUATION_DETECT =
Finds punctuation that ends paragraphs.
/((?:[\.?!]|[\r\n]+)(?:\"|\'|\)|\]|\})?)(\s+)/
- CORRECT_ABBR =
/(#{ABBR_DETECT})#{EOS}(\s+[a-z0-9])/
Class Attribute Summary collapse
-
.abbr_regex ⇒ Object
readonly
Returns the value of attribute abbr_regex.
-
.abbreviations ⇒ Object
readonly
Returns the value of attribute abbreviations.
Class Method Summary collapse
-
.abbreviation(*abbreviations) ⇒ Object
Adds a list of abbreviations to the list that’s used to detect false sentence ends.
- .initialize_abbreviations! ⇒ Object
-
.sentences(text) ⇒ Object
Split the passed text into individual sentences, trim these and return as an array.
- .set_abbr_regex! ⇒ Object
Class Attribute Details
.abbr_regex ⇒ Object (readonly)
Returns the value of attribute abbr_regex.
9 10 11 |
# File 'lib/lingua/en/sentence.rb', line 9 def abbr_regex @abbr_regex end |
.abbreviations ⇒ Object (readonly)
Returns the value of attribute abbreviations.
8 9 10 |
# File 'lib/lingua/en/sentence.rb', line 8 def abbreviations @abbreviations end |
Class Method Details
.abbreviation(*abbreviations) ⇒ Object
Adds a list of abbreviations to the list that’s used to detect false sentence ends. Return the current list of abbreviations in use.
68 69 70 71 72 73 |
# File 'lib/lingua/en/sentence.rb', line 68 def self.abbreviation(*abbreviations) @abbreviations += abbreviations @abbreviations.uniq! set_abbr_regex! @abbreviations end |
.initialize_abbreviations! ⇒ Object
75 76 77 78 |
# File 'lib/lingua/en/sentence.rb', line 75 def self.initialize_abbreviations! @abbreviations = Titles + Entities + Months + Days + Streets + Misc set_abbr_regex! end |
.sentences(text) ⇒ Object
Split the passed text into individual sentences, trim these and return as an array. A sentence is marked by one of the punctuation marks “.”, “?” or “!” followed by whitespace. Sequences of full stops (such as an ellipsis marker “…” and stops after a known abbreviation are ignored.
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
# File 'lib/lingua/en/sentence.rb', line 40 def self.sentences(text) # Make sure we work with a duplicate, as we are modifying the # text with #gsub! text = text.dup # Mark end of sentences with EOS marker. # We preserve the trailing whitespace ($2) so that we can # fix ellipses (...)! text.gsub!(PUNCTUATION_DETECT) { $1 << EOS << $2 } # Correct ellipsis marks. text.gsub!(/(\.\.\.*)#{EOS}/) { $1 } # Correct e.g, i.e. marks. text.gsub!(CORRECT_ABBR, "\\1\\2") # Correct abbreviations text.gsub!(@abbr_regex) { $1 << '.' } # Split on EOS marker, get rid of trailing whitespace. # Remove empty sentences. text.split(EOS). map { |sentence| sentence.strip }. delete_if { |sentence| sentence.nil? || sentence.empty? } end |
.set_abbr_regex! ⇒ Object
80 81 82 |
# File 'lib/lingua/en/sentence.rb', line 80 def self.set_abbr_regex! @abbr_regex = / (#{abbreviations.join("|")})\.#{EOS}/i end |