Class: PragmaticSegmenter::Cleaner

Inherits:
Object
  • Object
show all
Includes:
Rules
Defined in:
lib/pragmatic_segmenter/cleaner.rb,
lib/pragmatic_segmenter/cleaner/rules.rb

Overview

This is an opinionated class that removes errant newlines, xhtml, inline formatting, etc.

Defined Under Namespace

Modules: Rules

Constant Summary

Constants included from Rules

Rules::ConsecutiveForwardSlashRule, Rules::ConsecutivePeriodsRule, Rules::DoubleNewLineRule, Rules::DoubleNewLineWithSpaceRule, Rules::EscapedCarriageReturnRule, Rules::EscapedNewLineRule, Rules::InlineFormattingRule, Rules::NEWLINE_IN_MIDDLE_OF_SENTENCE_REGEX, Rules::NO_SPACE_BETWEEN_SENTENCES_DIGIT_REGEX, Rules::NO_SPACE_BETWEEN_SENTENCES_REGEX, Rules::NewLineFollowedByBulletRule, Rules::NewLineFollowedByPeriodRule, Rules::NewLineInMiddleOfWordRule, Rules::NoSpaceBetweenSentencesDigitRule, Rules::NoSpaceBetweenSentencesRule, Rules::QuotationsFirstRule, Rules::QuotationsSecondRule, Rules::ReplaceNewlineWithCarriageReturnRule, Rules::TableOfContentsRule, Rules::TypoEscapedCarriageReturnRule, Rules::TypoEscapedNewLineRule, Rules::URL_EMAIL_KEYWORDS

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(text:, doc_type: nil, language: Languages::Common) ⇒ Cleaner

Returns a new instance of Cleaner.



13
14
15
16
17
# File 'lib/pragmatic_segmenter/cleaner.rb', line 13

def initialize(text:, doc_type: nil, language: Languages::Common)
  @text = text.dup
  @doc_type = doc_type
  @language = language
end

Instance Attribute Details

#doc_typeObject (readonly)

Returns the value of attribute doc_type.



12
13
14
# File 'lib/pragmatic_segmenter/cleaner.rb', line 12

def doc_type
  @doc_type
end

#textObject (readonly)

Returns the value of attribute text.



12
13
14
# File 'lib/pragmatic_segmenter/cleaner.rb', line 12

def text
  @text
end

Instance Method Details

#cleanObject

Clean text of unwanted formatting

Example:

>> text = "This is a sentence\ncut off in the middle because pdf."
>> PragmaticSegmenter::Cleaner(text: text).clean
=> "This is a sentence cut off in the middle because pdf."

Arguments:

text:       (String)  *required
language:   (String)  *optional
            (two character ISO 639-1 code e.g. 'en')
doc_type:   (String)  *optional
            (e.g. 'pdf')


33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# File 'lib/pragmatic_segmenter/cleaner.rb', line 33

def clean
  return unless text
  remove_all_newlines
  replace_double_newlines
  replace_newlines
  replace_escaped_newlines

  Rule.apply(@text, HTML::All)

  replace_punctuation_in_brackets
  Rule.apply(@text, InlineFormattingRule)
  clean_quotations
  clean_table_of_contents
  check_for_no_space_in_between_sentences
  clean_consecutive_characters
end