Module: PragmaticSegmenter::Languages::Common
- Included in:
- Amharic, Arabic, Armenian, Bulgarian, Burmese, Chinese, Danish, Deutsch, Dutch, English, French, Greek, Hindi, Italian, Japanese, Kazakh, Persian, Polish, Russian, Spanish, Urdu
- Defined in:
- lib/pragmatic_segmenter/languages/common.rb,
lib/pragmatic_segmenter/languages/common/numbers.rb,
lib/pragmatic_segmenter/languages/common/ellipsis.rb
Defined Under Namespace
Modules: Abbreviation, Abbreviations, AmPmRules, DoublePunctuationRules, EllipsisRules, ExclamationPointRules, Numbers, ReinsertEllipsisRules, SingleLetterAbbreviationRules, SubSymbolsRules Classes: AbbreviationReplacer
Constant Summary collapse
- Punctuations =
This class holds the punctuation marks.
['。', '.', '.', '!', '!', '?', '?'].freeze
- GeoLocationRule =
Rubular: rubular.com/r/G2opjedIm9
Rule.new(/(?<=[a-zA-z]°)\.(?=\s*\d+)/, '∯')
- FileFormatRule =
Rule.new(/(?<=\s)\.(?=(jpe?g|png|gif|tiff?|pdf|ps|docx?|xlsx?|svg|bmp|tga|exif|odt|html?|txt|rtf|bat|sxw|xml|zip|exe|msi|blend|wmv|mp[34]|pptx?|flac|rb|cpp|cs|js)\s)/, '∯')
- SingleNewLineRule =
Rule.new(/\n/, 'ȹ')
- QuestionMarkInQuotationRule =
Rubular: rubular.com/r/aXPUGm6fQh
Rule.new(/\?(?=(\'|\"))/, '&ᓷ&')
- ExtraWhiteSpaceRule =
Rule.new(/\s{3,}/, ' ')
- SubSingleQuoteRule =
Rule.new(/&⎋&/, "'")
- SENTENCE_BOUNDARY_REGEX =
/\u{ff08}(?:[^\u{ff09}])*\u{ff09}(?=\s?[A-Z])|\u{300c}(?:[^\u{300d}])*\u{300d}(?=\s[A-Z])|\((?:[^\)]){2,}\)(?=\s[A-Z])|'(?:[^'])*[^,]'(?=\s[A-Z])|"(?:[^"])*[^,]"(?=\s[A-Z])|“(?:[^”])*[^,]”(?=\s[A-Z])|\S.*?[。..!!??ȸȹ☉☈☇☄]/
- QUOTATION_AT_END_OF_SENTENCE_REGEX =
Rubular: rubular.com/r/NqCqv372Ix
/[!?\.-][\"\'\u{201d}\u{201c}]\s{1}[A-Z]/
- PARENS_BETWEEN_DOUBLE_QUOTES_REGEX =
Rubular: rubular.com/r/6flGnUMEVl
/["”]\s\(.*\)\s["“]/
- BETWEEN_DOUBLE_QUOTES_REGEX =
Rubular: rubular.com/r/TYzr4qOW1Q
/"(?:[^"])*[^,]"|“(?:[^”])*[^,]”/
- SPLIT_SPACE_QUOTATION_AT_END_OF_SENTENCE_REGEX =
Rubular: rubular.com/r/JMjlZHAT4g
/(?<=[!?\.-][\"\'\u{201d}\u{201c}])\s{1}(?=[A-Z])/
- CONTINUOUS_PUNCTUATION_REGEX =
Rubular: rubular.com/r/mQ8Es9bxtk
/(?<=\S)(!|\?){3,}(?=(\s|\z|$))/
- NUMBERED_REFERENCE_REGEX =
/(?<=[^\d\s])(\.|∯)((\[(\d{1,3},?\s?-?\s?)?\b\d{1,3}\])+|((\d{1,3}\s?){0,3}\d{1,3}))(\s)(?=[A-Z])/
- PossessiveAbbreviationRule =
Rubular: rubular.com/r/yqa4Rit8EY
Rule.new(/\.(?='s\s)|\.(?='s$)|\.(?='s\z)/, '∯')
- KommanditgesellschaftRule =
Rubular: rubular.com/r/NEv265G2X2
Rule.new(/(?<=Co)\.(?=\sKG)/, '∯')
- MULTI_PERIOD_ABBREVIATION_REGEX =
Rubular: rubular.com/r/xDkpFZ0EgH
/\b[a-z](?:\.[a-z])+[.]/i