Class: String

Inherits:
Object
  • Object
show all
Defined in:
lib/unsupervised-language-detection/language-detector.rb

Instance Method Summary collapse

Instance Method Details

#normalize_tweetObject

TODO: Try not normalizing out all non-ASCII characters! Should significantly reduce false positive rate.



11
12
13
# File 'lib/unsupervised-language-detection/language-detector.rb', line 11

def normalize_tweet
  self.remove_tweeters.remove_links.remove_hashtags.downcase.gsub(/\s/, " ").gsub(/[^a-z0-9\s]/, "").strip
end

#remove_hashtagsObject

Remove any words beginning with ‘#’.



21
22
23
# File 'lib/unsupervised-language-detection/language-detector.rb', line 21

def remove_hashtags
  self.gsub(/#\w+/, "")
end

Remove anything beginning with ‘http’, ‘www’, or ending with ‘.com’. (Not the most sophisticated link remover, I know.)



27
28
29
30
31
32
# File 'lib/unsupervised-language-detection/language-detector.rb', line 27

def remove_links
  ret = self.gsub(/http\S+/, "")
  ret = ret.gsub(/www\S+/, "")
  ret = ret.gsub(/\S+\.com/, "")
  ret
end

#remove_tweetersObject

Remove mentions of other twitter users.



16
17
18
# File 'lib/unsupervised-language-detection/language-detector.rb', line 16

def remove_tweeters
  self.gsub(/@\w+/, "")
end

#to_ngrams(n) ⇒ Object

Returns a set of character ‘n`-grams computed from this string.



6
7
8
# File 'lib/unsupervised-language-detection/language-detector.rb', line 6

def to_ngrams(n)
  self.normalize_tweet.scan(/.{#{n}}/)
end