Module: Ebooks::NLP

Defined in:: lib/twitter_ebooks/nlp.rb

Constant Summary collapse

PUNCTUATION = We deliberately limit our punctuation handling to stuff we can do consistently It’ll just be a part of another token if we don’t split it out, and that’s fine

".?!,"

Class Method Summary collapse

.adjectives ⇒ Object
.gingerice ⇒ Object

Gingerice text correction service.
.htmlentities ⇒ Object

For decoding html entities.
.keywords(sentences) ⇒ Object
.normalize(text) ⇒ Object

We don’t really want to deal with all this weird unicode punctuation.
.nouns ⇒ Object
.punctuation?(token) ⇒ Boolean
.reconstruct(tokens) ⇒ Object

Takes a list of tokens and builds a nice-looking sentence.
.sentences(text) ⇒ Object

Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation.
.space_between?(token1, token2) ⇒ Boolean

Determine if we need to insert a space between two tokens.
.stem(word) ⇒ Object
.stopword?(token) ⇒ Boolean
.stopwords ⇒ Object

Lazy-load NLP libraries and resources Some of this stuff is pretty heavy and we don’t necessarily need to be using it all of the time.
.subseq?(a1, a2) ⇒ Boolean

Determine if a2 is a subsequence of a1.
.tagger ⇒ Object

POS tagger.
.tokenize(sentence) ⇒ Object

Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt.
.unmatched_enclosers?(text) ⇒ Boolean

Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the markov generator; we can just tell it to retry.

Class Method Details

.adjectives ⇒ `Object`



23
24
25

# File 'lib/twitter_ebooks/nlp.rb', line 23

def self.adjectives
  @adjectives ||= File.read(File.join(DATA_PATH, 'adjectives.txt')).split
end

.gingerice ⇒ `Object`

Gingerice text correction service

# File 'lib/twitter_ebooks/nlp.rb', line 34

def self.gingerice
  require 'gingerice'
  Gingerice::Parser.new # No caching for this one
end

.htmlentities ⇒ `Object`

For decoding html entities

# File 'lib/twitter_ebooks/nlp.rb', line 40

def self.htmlentities
  require 'htmlentities'
  @htmlentities ||= HTMLEntities.new
end

.keywords(sentences) ⇒ `Object`

# File 'lib/twitter_ebooks/nlp.rb', line 72

def self.keywords(sentences)
  # Preprocess to remove stopwords (highscore's blacklist is v. slow)
  text = sentences.flatten.reject { |t| stopword?(t) }.join(' ')

  text = Highscore::Content.new(text)

  text.configure do
    #set :multiplier, 2
    #set :upper_case, 3
    #set :long_words, 2
    #set :long_words_threshold, 15
    #set :vowels, 1                     # => default: 0 = not considered
    #set :consonants, 5                 # => default: 0 = not considered
    #set :ignore_case, true             # => default: false
    set :word_pattern, /(?<!@)(?<=\s)[\w']+/           # => default: /\w+/
    #set :stemming, true                # => default: false
  end

  text.keywords
end

.normalize(text) ⇒ `Object`

We don’t really want to deal with all this weird unicode punctuation



48
49
50

# File 'lib/twitter_ebooks/nlp.rb', line 48

def self.normalize(text)
  htmlentities.decode text.gsub('“', '"').gsub('”', '"').gsub('’', "'").gsub('…', '...')
end

.nouns ⇒ `Object`



19
20
21

# File 'lib/twitter_ebooks/nlp.rb', line 19

def self.nouns
  @nouns ||= File.read(File.join(DATA_PATH, 'nouns.txt')).split
end

.punctuation?(token) ⇒ `Boolean`

Returns:

(Boolean)



121
122
123

# File 'lib/twitter_ebooks/nlp.rb', line 121

def self.punctuation?(token)
  (token.chars.to_set - PUNCTUATION.chars.to_set).empty?
end

.reconstruct(tokens) ⇒ `Object`

Takes a list of tokens and builds a nice-looking sentence

# File 'lib/twitter_ebooks/nlp.rb', line 94

def self.reconstruct(tokens)
  text = ""
  last_token = nil
  tokens.each do |token|
    next if token == INTERIM
    text += ' ' if last_token && space_between?(last_token, token)
    text += token
    last_token = token
  end
  text
end

.sentences(text) ⇒ `Object`

Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation



56
57
58

# File 'lib/twitter_ebooks/nlp.rb', line 56

def self.sentences(text)
  text.split(/\n+|(?<=[.?!])\s+/)
end

.space_between?(token1, token2) ⇒ `Boolean`

Determine if we need to insert a space between two tokens

Returns:

(Boolean)

# File 'lib/twitter_ebooks/nlp.rb', line 107

def self.space_between?(token1, token2)
  p1 = self.punctuation?(token1)
  p2 = self.punctuation?(token2)
  if p1 && p2 # "foo?!"
    false
  elsif !p1 && p2 # "foo."
    false
  elsif p1 && !p2 # "foo. rah"
    true
  else # "foo rah"
    true
  end
end

.stem(word) ⇒ `Object`



68
69
70

# File 'lib/twitter_ebooks/nlp.rb', line 68

def self.stem(word)
  Stemmer::stem_word(word.downcase)
end

.stopword?(token) ⇒ `Boolean`

Returns:

(Boolean)

# File 'lib/twitter_ebooks/nlp.rb', line 125

def self.stopword?(token)
  @stopword_set ||= stopwords.map(&:downcase).to_set
  @stopword_set.include?(token.downcase)
end

.stopwords ⇒ `Object`

Lazy-load NLP libraries and resources Some of this stuff is pretty heavy and we don’t necessarily need to be using it all of the time



15
16
17

# File 'lib/twitter_ebooks/nlp.rb', line 15

def self.stopwords
  @stopwords ||= File.read(File.join(DATA_PATH, 'stopwords.txt')).split
end

.subseq?(a1, a2) ⇒ `Boolean`

Determine if a2 is a subsequence of a1

Returns:

(Boolean)

# File 'lib/twitter_ebooks/nlp.rb', line 155

def self.subseq?(a1, a2)
  a1.each_index.find do |i|
    a1[i...i+a2.length] == a2
  end
end

.tagger ⇒ `Object`

POS tagger

# File 'lib/twitter_ebooks/nlp.rb', line 28

def self.tagger
  require 'engtagger'
  @tagger ||= EngTagger.new
end

.tokenize(sentence) ⇒ `Object`

Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt. things like emoticons and timestamps

# File 'lib/twitter_ebooks/nlp.rb', line 63

def self.tokenize(sentence)
  regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/
  sentence.split(regex)
end

.unmatched_enclosers?(text) ⇒ `Boolean`

Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the markov generator; we can just tell it to retry

Returns:

(Boolean)

# File 'lib/twitter_ebooks/nlp.rb', line 133

def self.unmatched_enclosers?(text)
  enclosers = ['**', '""', '()', '[]', '``', "''"]
  enclosers.each do |pair|
    starter = Regexp.new('(\W|^)' + Regexp.escape(pair[0]) + '\S')
    ender = Regexp.new('\S' + Regexp.escape(pair[1]) + '(\W|$)')

    opened = 0

    tokenize(text).each do |token|
      opened += 1 if token.match(starter)
      opened -= 1 if token.match(ender)

      return true if opened < 0 # Too many ends!
    end

    return true if opened != 0 # Mismatch somewhere.
  end

  false
end

Module: Ebooks::NLP

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.adjectives ⇒ Object

.gingerice ⇒ Object

.htmlentities ⇒ Object

.keywords(sentences) ⇒ Object

.normalize(text) ⇒ Object

.nouns ⇒ Object

.punctuation?(token) ⇒ Boolean

.reconstruct(tokens) ⇒ Object

.sentences(text) ⇒ Object

.space_between?(token1, token2) ⇒ Boolean

.stem(word) ⇒ Object

.stopword?(token) ⇒ Boolean

.stopwords ⇒ Object

.subseq?(a1, a2) ⇒ Boolean

.tagger ⇒ Object

.tokenize(sentence) ⇒ Object

.unmatched_enclosers?(text) ⇒ Boolean

.adjectives ⇒ `Object`

.gingerice ⇒ `Object`

.htmlentities ⇒ `Object`

.keywords(sentences) ⇒ `Object`

.normalize(text) ⇒ `Object`

.nouns ⇒ `Object`

.punctuation?(token) ⇒ `Boolean`

.reconstruct(tokens) ⇒ `Object`

.sentences(text) ⇒ `Object`

.space_between?(token1, token2) ⇒ `Boolean`

.stem(word) ⇒ `Object`

.stopword?(token) ⇒ `Boolean`

.stopwords ⇒ `Object`

.subseq?(a1, a2) ⇒ `Boolean`

.tagger ⇒ `Object`

.tokenize(sentence) ⇒ `Object`

.unmatched_enclosers?(text) ⇒ `Boolean`