Class: String
- Inherits:
-
Object
- Object
- String
- Defined in:
- lib/unsupervised-language-detection/language-detector.rb
Instance Method Summary collapse
-
#normalize_tweet ⇒ Object
TODO: Try not normalizing out all non-ASCII characters! Should significantly reduce false positive rate.
-
#remove_hashtags ⇒ Object
Remove any words beginning with ‘#’.
-
#remove_links ⇒ Object
Remove anything beginning with ‘http’, ‘www’, or ending with ‘.com’.
-
#remove_tweeters ⇒ Object
Remove mentions of other twitter users.
-
#to_ngrams(n) ⇒ Object
Returns a set of character ‘n`-grams computed from this string.
Instance Method Details
#normalize_tweet ⇒ Object
TODO: Try not normalizing out all non-ASCII characters! Should significantly reduce false positive rate.
11 12 13 |
# File 'lib/unsupervised-language-detection/language-detector.rb', line 11 def normalize_tweet self.remove_tweeters.remove_links..downcase.gsub(/\s/, " ").gsub(/[^a-z0-9\s]/, "").strip end |
#remove_hashtags ⇒ Object
Remove any words beginning with ‘#’.
21 22 23 |
# File 'lib/unsupervised-language-detection/language-detector.rb', line 21 def self.gsub(/#\w+/, "") end |
#remove_links ⇒ Object
Remove anything beginning with ‘http’, ‘www’, or ending with ‘.com’. (Not the most sophisticated link remover, I know.)
27 28 29 30 31 32 |
# File 'lib/unsupervised-language-detection/language-detector.rb', line 27 def remove_links ret = self.gsub(/http\S+/, "") ret = ret.gsub(/www\S+/, "") ret = ret.gsub(/\S+\.com/, "") ret end |
#remove_tweeters ⇒ Object
Remove mentions of other twitter users.
16 17 18 |
# File 'lib/unsupervised-language-detection/language-detector.rb', line 16 def remove_tweeters self.gsub(/@\w+/, "") end |
#to_ngrams(n) ⇒ Object
Returns a set of character ‘n`-grams computed from this string.
6 7 8 |
# File 'lib/unsupervised-language-detection/language-detector.rb', line 6 def to_ngrams(n) self.normalize_tweet.scan(/.{#{n}}/) end |