WordStats

WordStats provides a set of methods useful for counting character and word frequencies.

Installation

Add this line to your application's Gemfile:

gem 'word_stats'

And then execute:

$ bundle

Or install it yourself as:

$ gem install word_stats

Usage

Require the WordStats gem as follows:

require 'word_stats' # Remember to require Ruby Gems first if using Ruby 1.8

text = "The quick brown fox jumps over the lazy dog."
# Note: all strings processed by WordStats are downcased!!

WordStats provides shortcuts for single letter frequencies, bigrams and trigrams. The WordStats::Characters.ngrams(n,text) method can be used to find n-grams of any length. The output is a hash of the form [:word,count].

letter_frequencies = WordStats::Characters.letters(text)
letter_frequencies[:'u'] #=> 2

bigrams = WordStats::Characters.bigrams(text)
bigrams[:'th'] #=> 2

trigrams = WordStats::Characters.trigrams(text)
trigrams['qui'.to_sym] #=> 1

octocats = WordStats::Characters.ngrams(8,text)
octocats[:'The quic'] #=> 0
octocats[:'the quic'] #=> 1

Similarly, WordStats provides a method to count words and any arbitrary length sequence of words:

word_count = WordStats::Words.nwords(1,text)
word_count[:'the'] #=> 2

word_pairs = WordStats::Words.nwords(2,text)
word_pairs[:'quick brown'] #=> 1

Important Notes

WordStats will downcase any string that you pass into it. It also strips punctuation before processing.

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Added some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request