Class: PragmaticSegmenter::Languages::Danish::AbbreviationReplacer

Inherits:

Object
AbbreviationReplacer
PragmaticSegmenter::Languages::Danish::AbbreviationReplacer

show all

Defined in:: lib/pragmatic_segmenter/languages/danish.rb

Constant Summary collapse

SENTENCE_STARTERS =

%w(
  At De Dem Den Der Det Du En Et For Få Gjorde Han Hun Hvad Hvem Hvilke
  Hvor Hvordan Hvorfor Hvorledes Hvornår I Jeg Mange Vi Være
).freeze

Instance Attribute Summary

Attributes inherited from AbbreviationReplacer

#text

Instance Method Summary collapse

#replace_abbreviation_as_sentence_boundary(txt) ⇒ Object

Methods inherited from AbbreviationReplacer

#initialize, #replace

Constructor Details

This class inherits a constructor from PragmaticSegmenter::AbbreviationReplacer

Instance Method Details

#replace_abbreviation_as_sentence_boundary(txt) ⇒ `Object`

# File 'lib/pragmatic_segmenter/languages/danish.rb', line 54

def replace_abbreviation_as_sentence_boundary(txt)
  # As we are being conservative and keeping ambiguous
  # sentence boundaries as one sentence instead of
  # splitting into two, we can split at words that
  # we know for certain never follow these abbreviations.
  # Some might say that the set of words that follow an
  # abbreviation such as U.S. (i.e. U.S. Government) is smaller than
  # the set of words that could start a sentence and
  # never follow U.S. However, we are being conservative
  # and not splitting by default, so we need to look for places
  # where we definitely can split. Obviously SENTENCE_STARTERS
  # will never cover all cases, but as the gem is named
  # 'Pragmatic Segmenter' we need to be pragmatic
  # and try to cover the words that most often start a
  # sentence but could never follow one of the abbreviations below.

  @language::AbbreviationReplacer::SENTENCE_STARTERS.each do |word|
    escaped = Regexp.escape(word)
    txt.gsub!(/U∯S∯\s#{escaped}\s/, "U∯S\.\s#{escaped}\s")
    txt.gsub!(/U\.S∯\s#{escaped}\s/, "U\.S\.\s#{escaped}\s")
    txt.gsub!(/U∯K∯\s#{escaped}\s/, "U∯K\.\s#{escaped}\s")
    txt.gsub!(/U\.K∯\s#{escaped}\s/, "U\.K\.\s#{escaped}\s")
    txt.gsub!(/E∯U∯\s#{escaped}\s/, "E∯U\.\s#{escaped}\s")
    txt.gsub!(/E\.U∯\s#{escaped}\s/, "E\.U\.\s#{escaped}\s")
    txt.gsub!(/U∯S∯A∯\s#{escaped}\s/, "U∯S∯A\.\s#{escaped}\s")
    txt.gsub!(/U\.S\.A∯\s#{escaped}\s/, "U\.S\.A\.\s#{escaped}\s")
    txt.gsub!(/I∯\s#{escaped}\s/, "I\.\s#{escaped}\s")
    txt.gsub!(/s.u∯\s#{escaped}\s/, "s\.u\.\s#{escaped}\s")
    txt.gsub!(/S.U∯\s#{escaped}\s/, "S\.U\.\s#{escaped}\s")
  end
  txt
end

Class: PragmaticSegmenter::Languages::Danish::AbbreviationReplacer

Constant Summary collapse

Instance Attribute Summary

Attributes inherited from AbbreviationReplacer

Instance Method Summary collapse

Methods inherited from AbbreviationReplacer

Constructor Details

Instance Method Details

#replace_abbreviation_as_sentence_boundary(txt) ⇒ Object

#replace_abbreviation_as_sentence_boundary(txt) ⇒ `Object`