What?
Given a set of strings from different languages, build a detector for the majority language (often, but not necessarily, English). More information on the algorithm here.
Example
training_sentences = File.readlines("datasets/gutenberg-training.txt")
detector = LanguageDetector.new(:ngram_size => 3)
detector.train(30, training_sentences)
puts "Testing on English sentences..."
true_english = 0
false_spanish = 0
IO.foreach("datasets/gutenberg-test-en.txt") do |line|
next if line.strip.empty?
if detector.classify(line) == "majority"
true_english += 1
else
puts line
false_spanish += 1
end
end
puts false_spanish
puts true_english
Using the Gem
gem install unsupervised-language-detection
require 'rubygems'
require 'unsupervised-language-detection'
UnsupervisedLanguageDetection.is_english_tweet?("I am an English sentence.") # => true
UnsupervisedLanguageDetection.is_english_tweet?("Hola, me llamo Edwin.") # => false
Demo
See a demo here.