Tokenizers Ruby
:slightly_smiling_face: Fast state-of-the-art tokenizers for Ruby
Installation
Add this line to your application’s Gemfile:
gem "tokenizers"
Getting Started
Load a pretrained tokenizer
tokenizer = Tokenizers.from_pretrained("bert-base-cased")
Encode
encoded = tokenizer.encode("I can feel the magic, can you?")
encoded.tokens
encoded.ids
Decode
tokenizer.decode(ids)
Load a tokenizer from files
tokenizer = Tokenizers::CharBPETokenizer.new("vocab.json", "merges.txt")
Training
Check out the Quicktour and equivalent Ruby code
History
View the changelog
Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/tokenizers-ruby.git
cd tokenizers-ruby
bundle install
bundle exec rake compile
bundle exec rake download:files
bundle exec rake test