Class: PragmaticSegmenter::Languages::Danish::AbbreviationReplacer
- Inherits:
-
AbbreviationReplacer
- Object
- AbbreviationReplacer
- PragmaticSegmenter::Languages::Danish::AbbreviationReplacer
- Defined in:
- lib/pragmatic_segmenter/languages/danish.rb
Constant Summary collapse
- SENTENCE_STARTERS =
%w( At De Dem Den Der Det Du En Et For Få Gjorde Han Hun Hvad Hvem Hvilke Hvor Hvordan Hvorfor Hvorledes Hvornår I Jeg Mange Vi Være ).freeze
Instance Attribute Summary
Attributes inherited from AbbreviationReplacer
Instance Method Summary collapse
Methods inherited from AbbreviationReplacer
Constructor Details
This class inherits a constructor from PragmaticSegmenter::AbbreviationReplacer
Instance Method Details
#replace_abbreviation_as_sentence_boundary(txt) ⇒ Object
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
# File 'lib/pragmatic_segmenter/languages/danish.rb', line 54 def replace_abbreviation_as_sentence_boundary(txt) # As we are being conservative and keeping ambiguous # sentence boundaries as one sentence instead of # splitting into two, we can split at words that # we know for certain never follow these abbreviations. # Some might say that the set of words that follow an # abbreviation such as U.S. (i.e. U.S. Government) is smaller than # the set of words that could start a sentence and # never follow U.S. However, we are being conservative # and not splitting by default, so we need to look for places # where we definitely can split. Obviously SENTENCE_STARTERS # will never cover all cases, but as the gem is named # 'Pragmatic Segmenter' we need to be pragmatic # and try to cover the words that most often start a # sentence but could never follow one of the abbreviations below. @language::AbbreviationReplacer::SENTENCE_STARTERS.each do |word| escaped = Regexp.escape(word) txt.gsub!(/U∯S∯\s#{escaped}\s/, "U∯S\.\s#{escaped}\s") txt.gsub!(/U\.S∯\s#{escaped}\s/, "U\.S\.\s#{escaped}\s") txt.gsub!(/U∯K∯\s#{escaped}\s/, "U∯K\.\s#{escaped}\s") txt.gsub!(/U\.K∯\s#{escaped}\s/, "U\.K\.\s#{escaped}\s") txt.gsub!(/E∯U∯\s#{escaped}\s/, "E∯U\.\s#{escaped}\s") txt.gsub!(/E\.U∯\s#{escaped}\s/, "E\.U\.\s#{escaped}\s") txt.gsub!(/U∯S∯A∯\s#{escaped}\s/, "U∯S∯A\.\s#{escaped}\s") txt.gsub!(/U\.S\.A∯\s#{escaped}\s/, "U\.S\.A\.\s#{escaped}\s") txt.gsub!(/I∯\s#{escaped}\s/, "I\.\s#{escaped}\s") txt.gsub!(/s.u∯\s#{escaped}\s/, "s\.u\.\s#{escaped}\s") txt.gsub!(/S.U∯\s#{escaped}\s/, "S\.U\.\s#{escaped}\s") end txt end |