Module: Ebooks::NLP
- Defined in:
- lib/twitter_ebooks/nlp.rb
Constant Summary collapse
- PUNCTUATION =
We deliberately limit our punctuation handling to stuff we can do consistently It’ll just be a part of another token if we don’t split it out, and that’s fine
".?!,"
Class Method Summary collapse
- .adjectives ⇒ Object
-
.gingerice ⇒ Object
Gingerice text correction service.
-
.htmlentities ⇒ Object
For decoding html entities.
- .keywords(sentences) ⇒ Object
-
.normalize(text) ⇒ Object
We don’t really want to deal with all this weird unicode punctuation.
- .nouns ⇒ Object
- .punctuation?(token) ⇒ Boolean
-
.reconstruct(tokens) ⇒ Object
Takes a list of tokens and builds a nice-looking sentence.
-
.sentences(text) ⇒ Object
Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation.
-
.space_between?(token1, token2) ⇒ Boolean
Determine if we need to insert a space between two tokens.
- .stem(word) ⇒ Object
- .stopword?(token) ⇒ Boolean
-
.stopwords ⇒ Object
Lazy-load NLP libraries and resources Some of this stuff is pretty heavy and we don’t necessarily need to be using it all of the time.
-
.subseq?(a1, a2) ⇒ Boolean
Determine if a2 is a subsequence of a1.
-
.tagger ⇒ Object
POS tagger.
-
.tokenize(sentence) ⇒ Object
Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt.
-
.unmatched_enclosers?(text) ⇒ Boolean
Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the markov generator; we can just tell it to retry.
Class Method Details
.adjectives ⇒ Object
23 24 25 |
# File 'lib/twitter_ebooks/nlp.rb', line 23 def self.adjectives @adjectives ||= File.read(File.join(DATA_PATH, 'adjectives.txt')).split end |
.gingerice ⇒ Object
Gingerice text correction service
34 35 36 37 |
# File 'lib/twitter_ebooks/nlp.rb', line 34 def self.gingerice require 'gingerice' Gingerice::Parser.new # No caching for this one end |
.htmlentities ⇒ Object
For decoding html entities
40 41 42 43 |
# File 'lib/twitter_ebooks/nlp.rb', line 40 def self.htmlentities require 'htmlentities' @htmlentities ||= HTMLEntities.new end |
.keywords(sentences) ⇒ Object
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
# File 'lib/twitter_ebooks/nlp.rb', line 72 def self.keywords(sentences) # Preprocess to remove stopwords (highscore's blacklist is v. slow) text = sentences.flatten.reject { |t| stopword?(t) }.join(' ') text = Highscore::Content.new(text) text.configure do #set :multiplier, 2 #set :upper_case, 3 #set :long_words, 2 #set :long_words_threshold, 15 #set :vowels, 1 # => default: 0 = not considered #set :consonants, 5 # => default: 0 = not considered #set :ignore_case, true # => default: false set :word_pattern, /(?<!@)(?<=\s)[\w']+/ # => default: /\w+/ #set :stemming, true # => default: false end text.keywords end |
.normalize(text) ⇒ Object
We don’t really want to deal with all this weird unicode punctuation
48 49 50 |
# File 'lib/twitter_ebooks/nlp.rb', line 48 def self.normalize(text) htmlentities.decode text.gsub('“', '"').gsub('”', '"').gsub('’', "'").gsub('…', '...') end |
.nouns ⇒ Object
19 20 21 |
# File 'lib/twitter_ebooks/nlp.rb', line 19 def self.nouns @nouns ||= File.read(File.join(DATA_PATH, 'nouns.txt')).split end |
.punctuation?(token) ⇒ Boolean
121 122 123 |
# File 'lib/twitter_ebooks/nlp.rb', line 121 def self.punctuation?(token) (token.chars.to_set - PUNCTUATION.chars.to_set).empty? end |
.reconstruct(tokens) ⇒ Object
Takes a list of tokens and builds a nice-looking sentence
94 95 96 97 98 99 100 101 102 103 104 |
# File 'lib/twitter_ebooks/nlp.rb', line 94 def self.reconstruct(tokens) text = "" last_token = nil tokens.each do |token| next if token == INTERIM text += ' ' if last_token && space_between?(last_token, token) text += token last_token = token end text end |
.sentences(text) ⇒ Object
Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation
56 57 58 |
# File 'lib/twitter_ebooks/nlp.rb', line 56 def self.sentences(text) text.split(/\n+|(?<=[.?!])\s+/) end |
.space_between?(token1, token2) ⇒ Boolean
Determine if we need to insert a space between two tokens
107 108 109 110 111 112 113 114 115 116 117 118 119 |
# File 'lib/twitter_ebooks/nlp.rb', line 107 def self.space_between?(token1, token2) p1 = self.punctuation?(token1) p2 = self.punctuation?(token2) if p1 && p2 # "foo?!" false elsif !p1 && p2 # "foo." false elsif p1 && !p2 # "foo. rah" true else # "foo rah" true end end |
.stem(word) ⇒ Object
68 69 70 |
# File 'lib/twitter_ebooks/nlp.rb', line 68 def self.stem(word) Stemmer::stem_word(word.downcase) end |
.stopword?(token) ⇒ Boolean
125 126 127 128 |
# File 'lib/twitter_ebooks/nlp.rb', line 125 def self.stopword?(token) @stopword_set ||= stopwords.map(&:downcase).to_set @stopword_set.include?(token.downcase) end |
.stopwords ⇒ Object
Lazy-load NLP libraries and resources Some of this stuff is pretty heavy and we don’t necessarily need to be using it all of the time
15 16 17 |
# File 'lib/twitter_ebooks/nlp.rb', line 15 def self.stopwords @stopwords ||= File.read(File.join(DATA_PATH, 'stopwords.txt')).split end |
.subseq?(a1, a2) ⇒ Boolean
Determine if a2 is a subsequence of a1
155 156 157 158 159 |
# File 'lib/twitter_ebooks/nlp.rb', line 155 def self.subseq?(a1, a2) a1.each_index.find do |i| a1[i...i+a2.length] == a2 end end |
.tagger ⇒ Object
POS tagger
28 29 30 31 |
# File 'lib/twitter_ebooks/nlp.rb', line 28 def self.tagger require 'engtagger' @tagger ||= EngTagger.new end |
.tokenize(sentence) ⇒ Object
Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt. things like emoticons and timestamps
63 64 65 66 |
# File 'lib/twitter_ebooks/nlp.rb', line 63 def self.tokenize(sentence) regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/ sentence.split(regex) end |
.unmatched_enclosers?(text) ⇒ Boolean
Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the markov generator; we can just tell it to retry
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
# File 'lib/twitter_ebooks/nlp.rb', line 133 def self.unmatched_enclosers?(text) enclosers = ['**', '""', '()', '[]', '``', "''"] enclosers.each do |pair| starter = Regexp.new('(\W|^)' + Regexp.escape(pair[0]) + '\S') ender = Regexp.new('\S' + Regexp.escape(pair[1]) + '(\W|$)') opened = 0 tokenize(text).each do |token| opened += 1 if token.match(starter) opened -= 1 if token.match(ender) return true if opened < 0 # Too many ends! end return true if opened != 0 # Mismatch somewhere. end false end |