Class: PragmaticSegmenter::Cleaner
- Inherits:
-
Object
- Object
- PragmaticSegmenter::Cleaner
- Includes:
- Rules
- Defined in:
- lib/pragmatic_segmenter/cleaner.rb,
lib/pragmatic_segmenter/cleaner/rules.rb
Overview
This is an opinionated class that removes errant newlines, xhtml, inline formatting, etc.
Direct Known Subclasses
Languages::Danish::Cleaner, Languages::English::Cleaner, Languages::Japanese::Cleaner
Defined Under Namespace
Modules: Rules
Constant Summary
Constants included from Rules
Rules::ConsecutiveForwardSlashRule, Rules::ConsecutivePeriodsRule, Rules::DoubleNewLineRule, Rules::DoubleNewLineWithSpaceRule, Rules::EscapedCarriageReturnRule, Rules::EscapedNewLineRule, Rules::InlineFormattingRule, Rules::NEWLINE_IN_MIDDLE_OF_SENTENCE_REGEX, Rules::NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX, Rules::NO_SPACE_BETWEEN_SENTENCES_REGEX, Rules::NewLineFollowedByBulletRule, Rules::NewLineFollowedByPeriodRule, Rules::NewLineInMiddleOfWordRule, Rules::NoSpaceBetweenSentencesDigitRule, Rules::NoSpaceBetweenSentencesRule, Rules::QuotationsFirstRule, Rules::QuotationsSecondRule, Rules::ReplaceNewlineWithCarriageReturnRule, Rules::TableOfContentsRule, Rules::TypoEscapedCarriageReturnRule, Rules::TypoEscapedNewLineRule, Rules::URL_EMAIL_KEYWORDS
Instance Attribute Summary collapse
-
#doc_type ⇒ Object
readonly
Returns the value of attribute doc_type.
-
#text ⇒ Object
readonly
Returns the value of attribute text.
Instance Method Summary collapse
-
#clean ⇒ Object
Clean text of unwanted formatting.
-
#initialize(text:, doc_type: nil, language: Languages::Common) ⇒ Cleaner
constructor
A new instance of Cleaner.
Constructor Details
#initialize(text:, doc_type: nil, language: Languages::Common) ⇒ Cleaner
Returns a new instance of Cleaner.
13 14 15 16 17 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 13 def initialize(text:, doc_type: nil, language: Languages::Common) @text = text.dup @doc_type = doc_type @language = language end |
Instance Attribute Details
#doc_type ⇒ Object (readonly)
Returns the value of attribute doc_type.
12 13 14 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 12 def doc_type @doc_type end |
#text ⇒ Object (readonly)
Returns the value of attribute text.
12 13 14 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 12 def text @text end |
Instance Method Details
#clean ⇒ Object
Clean text of unwanted formatting
Example:
>> text = "This is a sentence\ncut off in the middle because pdf."
>> PragmaticSegmenter::Cleaner(text: text).clean
=> "This is a sentence cut off in the middle because pdf."
Arguments:
text: (String) *required
language: (String) *optional
(two character ISO 639-1 code e.g. 'en')
doc_type: (String) *optional
(e.g. 'pdf')
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# File 'lib/pragmatic_segmenter/cleaner.rb', line 33 def clean return unless text remove_all_newlines replace_double_newlines replace_newlines replace_escaped_newlines Rule.apply(@text, HTML::All) replace_punctuation_in_brackets Rule.apply(@text, InlineFormattingRule) clean_quotations clean_table_of_contents check_for_no_space_in_between_sentences clean_consecutive_characters end |