Class: String

Inherits:
Object show all
Defined in:
lib/classifier/lsi/summary.rb,
lib/classifier/extensions/word_hash.rb

Overview

These are extensions to the String class to provide convenience methods for the Classifier package.

Constant Summary collapse

ABBREVIATIONS =
%w[Mr Mrs Ms Dr Prof Jr Sr Inc Ltd Corp Co vs etc al eg ie].freeze

Instance Method Summary collapse

Instance Method Details

#clean_word_hashObject

Return a word hash without extra punctuation or short symbols, just stemmed words



32
33
34
# File 'lib/classifier/extensions/word_hash.rb', line 32

def clean_word_hash
  word_hash_for_words gsub(/[^\w\s]/, '').split
end

#paragraph_summary(count = 1, separator = ' [...] ') ⇒ Object



12
13
14
# File 'lib/classifier/lsi/summary.rb', line 12

def paragraph_summary(count = 1, separator = ' [...] ')
  perform_lsi split_paragraphs, count, separator
end

#split_paragraphsObject



22
23
24
# File 'lib/classifier/lsi/summary.rb', line 22

def split_paragraphs
  split(/\r?\n\r?\n+/)
end

#split_sentencesObject



16
17
18
19
20
# File 'lib/classifier/lsi/summary.rb', line 16

def split_sentences
  return pragmatic_segment if defined?(PragmaticSegmenter)

  split_sentences_regex
end

#summary(count = 10, separator = ' [...] ') ⇒ Object



8
9
10
# File 'lib/classifier/lsi/summary.rb', line 8

def summary(count = 10, separator = ' [...] ')
  perform_lsi split_sentences, count, separator
end

#without_punctuationObject

Removes common punctuation symbols, returning a new string. E.g.,

"Hello (greeting's), with {braces} < >...?".without_punctuation
=> "Hello  greetings   with  braces         "


17
18
19
# File 'lib/classifier/extensions/word_hash.rb', line 17

def without_punctuation
  tr(',?.!;:"@#$%^&*()_=+[]{}|<>/`~', ' ').tr("'-", '')
end

#word_hashObject

Return a Hash of strings => ints. Each word in the string is stemmed, interned, and indexes to its frequency in the document.



24
25
26
27
28
# File 'lib/classifier/extensions/word_hash.rb', line 24

def word_hash
  word_hash = clean_word_hash
  symbol_hash = word_hash_for_symbols(gsub(/\w/, ' ').split)
  word_hash.merge(symbol_hash)
end