Neologdish::Normalizer for Ruby
A Japanese text normalization library for Ruby follows the conventions of neologd/mecab-ipadic-neologd, with some performance optimizations, without external dependencies. It is designed to preprocess Japanese text before applying NLP techniques.
The specific rules are documented here: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja
Usage
require "neologdish-normalizer"
Neologdish::Normalizer.normalize("南アルプスの 天然水- Sparking* Lemon+ レモン一絞り")
# => 南アルプスの天然水-Sparking*Lemon+レモン一絞り
Benchmark
The performance comparison between the official Ruby example (https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#ruby-written-by-kimoto-and-overlast) and this library is as follows:
user system total real
original normalizer: 4.200670 0.032004 4.232674 ( 4.274573)
this library: 1.158801 0.005238 1.164039 ( 1.170226)
The benchmark script is here: ./scripts/benchmark.rb
Installation
Install the gem and add to the application's Gemfile by executing:
bundle add 'neologdish-normalizer'
If bundler is not being used to manage dependencies, install the gem by executing:
gem install 'neologdish-normalizer'
Development
After checking out the repo, run bin/setup
to install dependencies. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and the created tag, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/moznion/neologdish-normalizer.