Class: String
- Inherits:
-
Object
- Object
- String
- Defined in:
- lib/ngrams_search.rb
Overview
This is an extension of Ruby’s core String class. It add methods to extract a set of n-grams from a string. Typically, the most used set of n-grams are unigrams, bigrams, and trigrams; sets of n-grams of length 1, 2, and 3, respectively.
Instance Method Summary collapse
-
#bigrams(regex = //) ⇒ Object
This function splits the string into bigrams tokenizes into chars by default.
-
#ngrams(options = {:regex=>//, :n=>2}) ⇒ Object
An n-gram is a sequence of units of text of length n, where those units are typically single characters or words delimited by space characters.
-
#trigrams(regex = //) ⇒ Object
This function splits the string into trigrams tokenizes into chars by default.
-
#unigrams(regex = //) ⇒ Object
This function splits the string into unigrams, tokenizes into chars by default.
Instance Method Details
#bigrams(regex = //) ⇒ Object
This function splits the string into bigrams tokenizes into chars by default
37 |
# File 'lib/ngrams_search.rb', line 37 def bigrams(regex = //) ngrams({:regex => regex, :n => 2}); end |
#ngrams(options = {:regex=>//, :n=>2}) ⇒ Object
An n-gram is a sequence of units of text of length n, where those units are typically single characters or words delimited by space characters. However, a token could also be a fixed length character sequence, strings with embedded spaces, etc. depending on the intended application. Typically, n-grams are formed of contiguous tokens.
This function splits the string into a set of n-grams. The default regex used tokenizes the string into characters.
Regex Examples: // => splits into characters /s+/ => splits into words delimited by one or more space characters /n+/ => splits into lines delimted by one or more newline characters
21 22 23 24 25 26 27 28 29 |
# File 'lib/ngrams_search.rb', line 21 def ngrams( = {:regex=>//, :n=>2}) ngrams = [] tokens = self.split([:regex]) max_pos = tokens.length - [:n] for i in 0..max_pos ngrams.push(tokens[i..i+([:n]-1)]) end ngrams end |
#trigrams(regex = //) ⇒ Object
This function splits the string into trigrams tokenizes into chars by default
41 |
# File 'lib/ngrams_search.rb', line 41 def trigrams(regex = //) ngrams({:regex => regex, :n => 3}); end |
#unigrams(regex = //) ⇒ Object
This function splits the string into unigrams, tokenizes into chars by default
33 |
# File 'lib/ngrams_search.rb', line 33 def unigrams(regex = //) ngrams({:regex => regex, :n => 1}); end |