Module: Ebooks::NLP

Defined in:
lib/twitter_ebooks/nlp.rb

Constant Summary collapse

PUNCTUATION =

We deliberately limit our punctuation handling to stuff we can do consistently It’ll just be a part of another token if we don’t split it out, and that’s fine

".?!,"

Class Method Summary collapse

Class Method Details

.adjectivesObject



23
24
25
# File 'lib/twitter_ebooks/nlp.rb', line 23

def self.adjectives
  @adjectives ||= File.read(File.join(DATA_PATH, 'adjectives.txt')).split
end

.gingericeObject

Gingerice text correction service



34
35
36
37
# File 'lib/twitter_ebooks/nlp.rb', line 34

def self.gingerice
  require 'gingerice'
  Gingerice::Parser.new # No caching for this one
end

.htmlentitiesObject

For decoding html entities



40
41
42
43
# File 'lib/twitter_ebooks/nlp.rb', line 40

def self.htmlentities
  require 'htmlentities'
  @htmlentities ||= HTMLEntities.new
end

.keywords(sentences) ⇒ Object



72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# File 'lib/twitter_ebooks/nlp.rb', line 72

def self.keywords(sentences)
  # Preprocess to remove stopwords (highscore's blacklist is v. slow)
  text = sentences.flatten.reject { |t| stopword?(t) }.join(' ')

  text = Highscore::Content.new(text)

  text.configure do
    #set :multiplier, 2
    #set :upper_case, 3
    #set :long_words, 2
    #set :long_words_threshold, 15
    #set :vowels, 1                     # => default: 0 = not considered
    #set :consonants, 5                 # => default: 0 = not considered
    #set :ignore_case, true             # => default: false
    set :word_pattern, /(?<!@)(?<=\s)[\w']+/           # => default: /\w+/
    #set :stemming, true                # => default: false
  end

  text.keywords
end

.normalize(text) ⇒ Object

We don’t really want to deal with all this weird unicode punctuation



48
49
50
# File 'lib/twitter_ebooks/nlp.rb', line 48

def self.normalize(text)
  htmlentities.decode text.gsub('', '"').gsub('', '"').gsub('', "'").gsub('', '...')
end

.nounsObject



19
20
21
# File 'lib/twitter_ebooks/nlp.rb', line 19

def self.nouns
  @nouns ||= File.read(File.join(DATA_PATH, 'nouns.txt')).split
end

.punctuation?(token) ⇒ Boolean

Returns:

  • (Boolean)


121
122
123
# File 'lib/twitter_ebooks/nlp.rb', line 121

def self.punctuation?(token)
  (token.chars.to_set - PUNCTUATION.chars.to_set).empty?
end

.reconstruct(tokens) ⇒ Object

Takes a list of tokens and builds a nice-looking sentence



94
95
96
97
98
99
100
101
102
103
104
# File 'lib/twitter_ebooks/nlp.rb', line 94

def self.reconstruct(tokens)
  text = ""
  last_token = nil
  tokens.each do |token|
    next if token == INTERIM
    text += ' ' if last_token && space_between?(last_token, token)
    text += token
    last_token = token
  end
  text
end

.sentences(text) ⇒ Object

Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation



56
57
58
# File 'lib/twitter_ebooks/nlp.rb', line 56

def self.sentences(text)
  text.split(/\n+|(?<=[.?!])\s+/)
end

.space_between?(token1, token2) ⇒ Boolean

Determine if we need to insert a space between two tokens

Returns:

  • (Boolean)


107
108
109
110
111
112
113
114
115
116
117
118
119
# File 'lib/twitter_ebooks/nlp.rb', line 107

def self.space_between?(token1, token2)
  p1 = self.punctuation?(token1)
  p2 = self.punctuation?(token2)
  if p1 && p2 # "foo?!"
    false
  elsif !p1 && p2 # "foo."
    false
  elsif p1 && !p2 # "foo. rah"
    true
  else # "foo rah"
    true
  end
end

.stem(word) ⇒ Object



68
69
70
# File 'lib/twitter_ebooks/nlp.rb', line 68

def self.stem(word)
  Stemmer::stem_word(word.downcase)
end

.stopword?(token) ⇒ Boolean

Returns:

  • (Boolean)


125
126
127
128
# File 'lib/twitter_ebooks/nlp.rb', line 125

def self.stopword?(token)
  @stopword_set ||= stopwords.map(&:downcase).to_set
  @stopword_set.include?(token.downcase)
end

.stopwordsObject

Lazy-load NLP libraries and resources Some of this stuff is pretty heavy and we don’t necessarily need to be using it all of the time



15
16
17
# File 'lib/twitter_ebooks/nlp.rb', line 15

def self.stopwords
  @stopwords ||= File.read(File.join(DATA_PATH, 'stopwords.txt')).split
end

.subseq?(a1, a2) ⇒ Boolean

Determine if a2 is a subsequence of a1

Returns:

  • (Boolean)


155
156
157
158
159
# File 'lib/twitter_ebooks/nlp.rb', line 155

def self.subseq?(a1, a2)
  a1.each_index.find do |i|
    a1[i...i+a2.length] == a2
  end
end

.taggerObject

POS tagger



28
29
30
31
# File 'lib/twitter_ebooks/nlp.rb', line 28

def self.tagger
  require 'engtagger'
  @tagger ||= EngTagger.new
end

.tokenize(sentence) ⇒ Object

Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt. things like emoticons and timestamps



63
64
65
66
# File 'lib/twitter_ebooks/nlp.rb', line 63

def self.tokenize(sentence)
  regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/
  sentence.split(regex)
end

.unmatched_enclosers?(text) ⇒ Boolean

Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the markov generator; we can just tell it to retry

Returns:

  • (Boolean)


133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# File 'lib/twitter_ebooks/nlp.rb', line 133

def self.unmatched_enclosers?(text)
  enclosers = ['**', '""', '()', '[]', '``', "''"]
  enclosers.each do |pair|
    starter = Regexp.new('(\W|^)' + Regexp.escape(pair[0]) + '\S')
    ender = Regexp.new('\S' + Regexp.escape(pair[1]) + '(\W|$)')

    opened = 0

    tokenize(text).each do |token|
      opened += 1 if token.match(starter)
      opened -= 1 if token.match(ender)

      return true if opened < 0 # Too many ends!
    end

    return true if opened != 0 # Mismatch somewhere.
  end

  false
end