TreeTagger for Ruby
RubyGems | RTT Project Page | Source Code | Bug Tracker
<img src=“https://badge.fury.io/rb/treetagger-ruby.png” alt=“Gem Version” /> <img src=“https://travis-ci.org/arbox/treetagger-ruby.png” alt=“Build Status” /> <img src=“https://codeclimate.com/github/arbox/treetagger-ruby.png” alt=“Code Climate” />
DESCRIPTION
A Ruby based wrapper for the TreeTagger by Helmut Schmid.
Check it out if you are interested in Natural Language Processing (NLP) and/or Human Language Technology (HLT).
This library provides comprehensive bindings for the TreeTagger, a statistical language independed POS tagging and chunking software.
TreeTagger is language agnostic, it will never guess what language you’re going to use.
TODO:
-
References to Schmid’s publications;
-
How to use TreeTagger in the wild;
-
Input and output format, tokenization;
-
…
-
The actual german parameter file has been estimated on one byte encoded data.
Implemented Features
Simple tagging.
Please have a look at the CHANGELOG file for details on implemented and planned features.
INSTALLATION
Before you install the treetagger-ruby
package please ensure you have downloaded and installed the TreeTagger itself.
The TreeTagger is a copyrighted software by Helmut Schmid and IMS, please read the license agreament before you download the TreeTagger package and language models.
After the installation of the TreeTagger
set the environment variable TREETAGGER_BINARY
to the location where the binary tree-tagger
resides. Usually this binary is located under the bin
directory in the main installation directory of the TreeTagger
.
Also you have to set the variable TREETAGGER_MODEL
to the location of the appropriate language model you have acquired in the training step.
For instance you may add the following lines to your .profile
file:
export TREETAGGER_BINARY='/path/to/your/TreeTagger/bin/tree-tagger'
export TREETAGGER_MODEL='/path/to/your/TreeTagger/lib/german.par'
It is convinient to work with a default language model, but you can change it every time during the instantiation of a new tagger instance.
If you want to feed a lexicon file into your tagger you can do it globally through the environment variable TREETAGGER_LEXICON
.
treetagger-ruby
is provided as a .gem package. Simply install it via RubyGems. To install treetagger-ruby
issue the following command:
$ gem install treetagger-ruby
If you want to do a system wide installation, do this as root (possibly using sudo
).
Alternatively use your Gemfile for dependency management.
SYNOPSIS
Basic Usage
Basic usage is very simple:
$ require 'treetagger'
$ # Instantiate a tagger instance with default values.
$ tagger = TreeTagger::Tagger.new
$ # Process an array of tokens.
$ tagger.process(%w{Ich gehe in die Schule})
$ # Flush the pipeline.
$ tagger.flush
$ # Get the processed data.
$ tagger.get_output
Input Format
Basically you have to provide a tokenized sequence with possibly some additional information on lexical classes of tokens and on their probabilities. Every token has to be on a separate line. Due to technical limitations SGML tags (i.e. sequences with heading < and trailing >) cannot be valid tokes since they are used internally for delimiting meaningful content from flush tokens. It implies the use of the -sgml
option which cannot be changes by user. It is a limitation of this library. If you do need to process tags, fall back and use the TreeTagger as a standalone programm possibly employing temp files to store your input and output. This behaviour will be also implemented in futher versions of treetagger-ruby
.
Every token may occure alone on the line or be followed by additional information:
-
token;
-
token (\tab tag)+;
-
token (\tab tag \space lemma)+;
-
token (\tab tag \space probability)+;
-
token (\tab tag \space probability \space lemma)+.
You input may look like the following sentence:
Die ART 0.99
neuen ADJA neu
Hunde NN NP
stehen VVFIN 0.99 stehen
an
den
Mauern NN Mauer
.
This wrapper accepts the input as String or Array.
If you want to use strings, you are responsible for the proper delimiters inside the string: "Die\tART 0.99\nneuen\tADJA neu\nHunde\tNN NP\nstehen\t VVFIN 0.99 stehen\nan\nden\nMauern\tNN Mauer\n.\n"
Now treetagger-ruby
does not check your markup for correctness and will possibly report a TreeTagger::ExternalError
if the TreeTagger process die due to input errors.
Using arrays is more convinient since they can be built programmatically.
Arrays should have the following structure:
-
[‘token’, ‘token’, ‘token’];
-
[‘token’, [‘token’, [‘POS’, ‘lemma’], [‘POS’, ‘lemma’]], ‘token’];
-
[‘token’, [‘token’, [‘POS’, prob], [‘POS’, ‘prob’]], ‘token’];
-
[‘token’, [‘token’, [‘POS’, prob, ‘lemma’], [‘POS’, ‘prob’, ‘lemma’]]].
It is internally converted in the sequence token\ntoken\tPOS lemma\t POS lemma\ntoken\n
, i.e. in the enriched version alternatives are tab separated and entries a blank separated.
Note that probabilities may be strings or integers.
The lexicon lookup is not
implemented for now, that’s the latter three forms of input arrays are not supported yet.
Output Format
For now you’ll get an array with strings elements. However the precise string structure depends on the cmd arguments you’ve provided during the tagger instantiation.
For instanse for the input ["Veruntreute", "die", "AWO", "Spendengeld", "?"]
you’ll get the following output with default cmd argumetns:
["Veruntreute\tNN\tVeruntreute", "die\tART\td", "AWO\tNN\t<unknown>", "Spendengeld\tNN\tSpendengeld", "?\t$.\t?"]
See documentation in the TreeTagger::Tagger class for details on particular methods.
EXCEPTION HIERARCHY
While using TreeTagger you can face following errors:
-
TreeTagger::UserError
; -
TreeTagger::RuntimeError
; -
TreeTagger::ExternalError
.
This three kinds of errors all subclass TreeTagger::Error
, which in turn is a subclass of StandardError
. For an end user this means that it is possible to intercept all errors from treetagger-ruby with a simple rescue
clause.
SUPPORT
If you have question, bug reports or any suggestions, please drop me an email :)
HOW TO CONTRIBUTE
Please contact me and suggest your ideas, report bugs, talk to me, if you want to implement some features in the future releases of this library.
Please don’t feel offended if I cannot accept all your pull requests, I have to review them and find the appropriate time and place in the code base to incorporate your valuable changes.
Any help is deeply appreciated!
CHANGELOG
For details on future plan and working progress see CHANGELOG.
CAUTION
This library is work in process! Though the interface is mostly complete, you might face some not implemented features.
Please contact me with your suggestions, bug reports and feature requests.
LICENSE
RTT is a copyrighted software by Andrei Beliankou, 2011-
You may use, redistribute and change it under the terms provided in the LICENSE file.