Build Status

English, Spanish, Dutch, Italian, French POS Tagger

This repository contains the source code for the English & Spanish POS tagger of the OpeNER project.

  • English perceptron models have been trained and evaluated using the WSJ treebank as explained in K. Toutanova, D. Klein, and C. D. Manning. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL’03, 2003. Currently we obtain a performance of 96.87% vs 97.24% obtained by Toutanova et al. (2003).

  • Spanish Maximum Entropy models have been trained and evaluated using the Ancora
    corpus; it was randomly divided in 90% for training (450K words) and 10% testing (50K words), obtaining a performance of 98.88%.

  • French Maximum Entropy models trained with the ESTER corpus.

  • Italian Perceptron models trained with the TUT Treebank.

  • Dutch Perceptron model publicly available at Apache OpenNLP website: (http://opennlp.sourceforge.net/models-1.5/)

Requirements

  • Java 1.7 or newer
  • Ruby 1.9.2 or newer
  • Maven
  • Bundler

Installation

Using RubyGems:

gem install opener-pos-tagger-en-es

Using Bundler:

gem 'opener-pos-tagger-en-es',
  :git    => '[email protected]/opener-project/pos-tagger-en-es.git',
  :branch => 'master'

Using specific install:

gem install specific_install
gem specific_install opener-pos-tagger-en-es \
    -l https://github.com/opener-project/pos-tagger-en-es.git

Usage

cat some_input_file.kaf | pos-tagger-en-es

Contributing

First make sure all the required dependencies are installed:

bundle install

Then compile the required Java code:

bundle exec rake java:compile

For this you'll need to have Java 1.7 and Maven installed. These requirements are verified for you before the Rake task calls Maven.

Testing

To run the tests (which are powered by Cucumber), simply run the following:

bundle exec rake

This will take care of verifying the requirements, installing the required Java packages and running the tests.

For more information on the available Rake tasks run the following:

bundle exec rake -T

Structure

This repository comes in two parts: a collection of Java source files and Ruby source files. The Java code can be found in the core/ directory, everything else will be Ruby source code.