Byk

Gem Version Build Status

Ruby gem for fast transliteration of Serbian Cyrillic ↔ Latin

byk

Installation

Byk can be used as a standalone console utility or as a String extension in your Ruby programs. It has zero dependencies beyond vanilla Ruby and the toolchain for building native gems 1.

You can install it directly:

$ gem install byk

or add it as a dependency in your application's Gemfile:

gem "byk"

1 For Windows, you might want to check out DevKit

Usage

As a standalone utility

Here's the help banner with all the available options:

usage: byk [options] [files]

options:
  -c, --cyrillic       convert input to Cyrillic (default)
  -l, --latin          convert input to Latin
  -a, --ascii          convert input to "ASCII Latin"
  -v, --version        show version

Translation goes to stdout so you can redirect it or pipe it as you see fit. Let's take a look at some common scenarios.

To translate files to Cyrillic:

$ byk in1.txt in2.txt > out.txt

To translate files to Latin and search for a phrase:

$ byk -l file.txt | grep stvar

Ad hoc conversion:

$ echo "Вук Стефановић Караџић" | byk -a
Vuk Stefanovic Karadzic

or simply omit args and type away:

$ byk
a u ruke Mandušića Vuka
biće svaka puška ubojita!
^D
а у руке Мандушића Вука
биће свака пушка убојита!

^D being ctrl d.

As a String extension

Unless you're using Bundler, make sure to require the gem in your initializer:

require "byk"

This will extend String with a couple of simple methods:

"Šeširdžija".to_cyrillic    # => "Шеширџија"
"Шеширџија".to_latin        # => "Šeširdžija"
"Шеширџија".to_ascii_latin  # => "Sesirdzija"

These do not modify the receiver. For that, there's a destructive variant of each:

text = "Šeširdžija"
text.to_cyrillic!     # => "Шеширџија"
text.to_latin!        # => "Šeširdžija"
text.to_ascii_latin!  # => "Sesirdzija"
text                  # => "Sesirdzija"

Note that both latinization methods observe digraph capitalization rules:

"ЉИЉА Љиљановић".to_latin        # => "LJILJA Ljiljanović"
"ĐORĐE Đorđević".to_ascii_latin  # => "DJORDJE Djordjevic"

Safe require

If you prefer not to monkey patch String, you can do a "safe" require in your Gemfile:

gem "byk", :require => "byk/safe"

or initializer:

require "byk/safe"

Then, you should rely on module methods:

text = "Жвазбука"

Byk.to_latin(text)   # => "Žvazbuka"
text                 # => "Жвазбука"

Byk.to_latin!(text)  # => "Žvazbuka"
text                 # => "Žvazbuka"

# etc.

How fast is "fast" transliteration?

Here's a quick test:

$ wget https://sr.wikipedia.org/ -O sample
$ du -h sample
128K

$ time byk -l sample > /dev/null
0.08s user 0.04s system 96% cpu 0.126 total

Let's up the ante:

$ for i in {1..800}; do cat sample; done > big
$ du -h big
97M

$ time byk -l big > /dev/null
1.71s user 0.13s system 99% cpu 1.846 total

So, ~100MB in under 2s. Fast enough, I suppose. You can expect it to scale linearly.

Compared to the pure Ruby implementation, it is about 10-30x faster, depending on the input composition and the transliteration method applied.

Testing

To test the gem, clone the repo and run:

$ bundle && bundle exec rake

Compatibility

Byk is supported under MRI 1.9.2+. I might try my hand in writing a JRuby extension in a future release.

License

This gem is released under the MIT License.

Уздравље!