Class: String

Inherits:
Object
  • Object
show all
Defined in:
lib/ngrams_search.rb

Overview

This is an extension of Ruby’s core String class. It add methods to extract a set of n-grams from a string. Typically, the most used set of n-grams are unigrams, bigrams, and trigrams; sets of n-grams of length 1, 2, and 3, respectively.

Instance Method Summary collapse

Instance Method Details

#bigrams(regex = //) ⇒ Object

This function splits the string into bigrams tokenizes into chars by default



37
# File 'lib/ngrams_search.rb', line 37

def bigrams(regex = //) ngrams({:regex => regex, :n => 2}); end

#ngrams(options = {:regex=>//, :n=>2}) ⇒ Object

An n-gram is a sequence of units of text of length n, where those units are typically single characters or words delimited by space characters. However, a token could also be a fixed length character sequence, strings with embedded spaces, etc. depending on the intended application. Typically, n-grams are formed of contiguous tokens.

This function splits the string into a set of n-grams. The default regex used tokenizes the string into characters.

Regex Examples: // => splits into characters /s+/ => splits into words delimited by one or more space characters /n+/ => splits into lines delimted by one or more newline characters



21
22
23
24
25
26
27
28
29
# File 'lib/ngrams_search.rb', line 21

def ngrams(options = {:regex=>//, :n=>2})
	ngrams = []
	tokens = self.split(options[:regex])
	max_pos = tokens.length - options[:n]
	for i in 0..max_pos
		ngrams.push(tokens[i..i+(options[:n]-1)])
	end
	ngrams
end

#trigrams(regex = //) ⇒ Object

This function splits the string into trigrams tokenizes into chars by default



41
# File 'lib/ngrams_search.rb', line 41

def trigrams(regex = //) ngrams({:regex => regex, :n => 3}); end

#unigrams(regex = //) ⇒ Object

This function splits the string into unigrams, tokenizes into chars by default



33
# File 'lib/ngrams_search.rb', line 33

def unigrams(regex = //) ngrams({:regex => regex, :n => 1}); end