StrMetrics

checks Gem Version license

Ruby gem (native extension in Rust) providing implementations of various string metrics. Current metrics supported are: Sørensen–Dice, Levenshtein, Damerau–Levenshtein, Jaro & Jaro–Winkler. Strings that are UTF-8 encodable (convertible to UTF-8 representation) are supported. All comparison of strings is done at the grapheme cluster level as described by Unicode Standard Annex #29; this may be different from many gems that calculate string metrics. Gem should work on Linux, MacOS & Windows.

Getting Started

Prerequisites

Install Rust (tested with version >= 1.38.0) with:

curl https://sh.rustup.rs -sSf | sh

Installation

With bundler

Add this line to your application's Gemfile:

gem 'str_metrics'

And then execute:

$ bundle install

Without bundler

$ gem install str_metrics

Usage

All you need to do to use the metrics provided in this gem is to make sure str_metrics is required like:

require 'str_metrics'

Each metric is shown below with an example & meanings of optional parameters.

Sørensen–Dice

StrMetrics::SorensenDice.coefficient('abc', 'bcd', ignore_case: false)
 => 0.5

Options:

Keyword Type Default Description
ignore_case boolean false Case insensitive comparison?

Levenshtein

StrMetrics::Levenshtein.distance('abc', 'acb', ignore_case: false)
 => 2

Options:

Keyword Type Default Description
ignore_case boolean false Case insensitive comparison?

Damerau–Levenshtein

StrMetrics::DamerauLevenshtein.distance('abc', 'acb', ignore_case: false)
 => 1

Options:

Keyword Type Default Description
ignore_case boolean false Case insensitive comparison?

Jaro

StrMetrics::Jaro.similarity('abc', 'aac', ignore_case: false)
 => 0.7777777777777777

Options:

Keyword Type Default Description
ignore_case boolean false Case insensitive comparison?

Jaro–Winkler

StrMetrics::JaroWinkler.similarity('abc', 'aac', ignore_case: false, prefix_scaling_factor: 0.1, prefix_scaling_bonus_threshold: 0.7)
 => 0.7999999999999999

StrMetrics::JaroWinkler.distance('abc', 'aac', ignore_case: false, prefix_scaling_factor: 0.1, prefix_scaling_bonus_threshold: 0.7)
 => 0.20000000000000007

Options:

Keyword Type Default Description
ignore_case boolean false Case insensitive comparison?
prefix_scaling_factor decimal 0.1 Constant scaling factor for how much to weight common prefixes. Should not exceed 0.25.
prefix_scaling_bonus_threshold decimal 0.7 Prefix bonus weighting will only be applied if the Jaro similarity is greater given value.

Motivation

The main motivation was to have a central gem which can provide a variety of string metric calculations. Secondary motivation was to experiment with writing a native extension in Rust (instead of C).

Development

Getting started

gem install bundler
git clone https://github.com/anirbanmu/str_metrics.git
cd ./str_metrics
bundle install

Building (for native component)

rake rust_build

Testing (will build native component before running tests)

rake spec

Local installation

rake install

Deploying a new version

To deploy a new version of the gem to rubygems:

  1. Bump version in version.rb according to SemVer.
  2. Get your code merged to master
  3. After a git pull on master:
rake build && rake release

Authors

See all repo contributors here.

Versioning

SemVer is employed. See tags for released versions.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/anirbanmu/str_metrics.

Code of Conduct

Everyone interacting in this project's codebase, issue trackers etc. are expected to follow the code of conduct.

License

This project is licensed under the MIT License - see the LICENSE file for details