Class: String

Inherits:

Object

Object
String

show all

Defined in:: lib/ngrams_search.rb

Overview

This is an extension of Ruby’s core String class. It add methods to extract a set of n-grams from a string. Typically, the most used set of n-grams are unigrams, bigrams, and trigrams; sets of n-grams of length 1, 2, and 3, respectively.

Instance Method Summary collapse

#bigrams(regex = //) ⇒ Object

This function splits the string into bigrams tokenizes into chars by default.
#ngrams(options = {:regex=>//, :n=>2}) ⇒ Object

An n-gram is a sequence of units of text of length n, where those units are typically single characters or words delimited by space characters.
#trigrams(regex = //) ⇒ Object

This function splits the string into trigrams tokenizes into chars by default.
#unigrams(regex = //) ⇒ Object

This function splits the string into unigrams, tokenizes into chars by default.

Instance Method Details

#bigrams(regex = //) ⇒ `Object`

This function splits the string into bigrams tokenizes into chars by default

37	# File 'lib/ngrams_search.rb', line 37 def bigrams(regex = //) ngrams({:regex => regex, :n => 2}); end

#ngrams(options = {:regex=>//, :n=>2}) ⇒ `Object`

An n-gram is a sequence of units of text of length n, where those units are typically single characters or words delimited by space characters. However, a token could also be a fixed length character sequence, strings with embedded spaces, etc. depending on the intended application. Typically, n-grams are formed of contiguous tokens.

This function splits the string into a set of n-grams. The default regex used tokenizes the string into characters.

Regex Examples: // => splits into characters /s+/ => splits into words delimited by one or more space characters /n+/ => splits into lines delimted by one or more newline characters

# File 'lib/ngrams_search.rb', line 21

def ngrams(options = {:regex=>//, :n=>2})
	ngrams = []
	tokens = self.split(options[:regex])
	max_pos = tokens.length - options[:n]
	for i in 0..max_pos
		ngrams.push(tokens[i..i+(options[:n]-1)])
	end
	ngrams
end

#trigrams(regex = //) ⇒ `Object`

This function splits the string into trigrams tokenizes into chars by default

41	# File 'lib/ngrams_search.rb', line 41 def trigrams(regex = //) ngrams({:regex => regex, :n => 3}); end

#unigrams(regex = //) ⇒ `Object`

This function splits the string into unigrams, tokenizes into chars by default

33	# File 'lib/ngrams_search.rb', line 33 def unigrams(regex = //) ngrams({:regex => regex, :n => 1}); end

Class: String

Overview

Instance Method Summary collapse

Instance Method Details

#bigrams(regex = //) ⇒ Object

#ngrams(options = {:regex=>//, :n=>2}) ⇒ Object

#trigrams(regex = //) ⇒ Object

#unigrams(regex = //) ⇒ Object

#bigrams(regex = //) ⇒ `Object`

#ngrams(options = {:regex=>//, :n=>2}) ⇒ `Object`

#trigrams(regex = //) ⇒ `Object`

#unigrams(regex = //) ⇒ `Object`