Module: CRFPP

Defined in:
lib/crfpp/data.rb,
lib/crfpp/macro.rb,
lib/crfpp/model.rb,
lib/crfpp/token.rb,
lib/crfpp/errors.rb,
lib/crfpp/feature.rb,
lib/crfpp/version.rb,
lib/crfpp/filelike.rb,
lib/crfpp/template.rb,
lib/crfpp/utilities.rb

Defined Under Namespace

Modules: Filelike Classes: Data, Error, Feature, Macro, Model, NativeError, Template, Token

Constant Summary collapse

VERSION =
'0.0.4'.freeze

Class Method Summary collapse

Class Method Details

.learn(template, data, options = {}) ⇒ Object

Creates a new Model based on a template and training data.

:threads: False or the number of threads to us (default is 2).

:algorithm: L1 or L2 (default)

:cost: With this option, you can change the hyper-parameter for the CRFs.

With larger C value, CRF tends to overfit to the give training
corpus. This parameter trades the balance between overfitting and
underfitting. The results will significantly be influenced by this
parameter. You can find an optimal value by using held-out data or
more general model selection method such as cross validation.

:frequency: This parameter sets the cut-off threshold for the features. CRF++

uses the features that occurs no less than NUM times in the given training
data. The default value is 1. When you apply CRF++ to large data, the
number of unique features would amount to several millions. This option is
useful in such cases.


23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# File 'lib/crfpp/utilities.rb', line 23

def learn(template, data, options = {})
  options = { :threads => 2, :algorithm => :L2, :cost => 1.0, :frequency => 1}.merge(options)
  
  model = Model.new    
  arguments = []
  
  # TODO check algorithm names
  # arguments << "--algorithm=#{options[:algorithm]}"
  
  arguments << "--cost=#{options[:cost]}"
  arguments << "--thread=#{options[:threads]}"
  arguments << "--freq=#{options[:frequency]}"
  
  arguments << (template.respond_to?(:path) ? template.path : template)
  arguments << (data.respond_to?(:path) ? data.path : data)
  arguments << model.path

  Native.learn(arguments.join(' '))
  
  model
rescue => error
  raise NativeError, error.message
end

.trainObject

Creates a new Model based on a template and training data.

:threads: False or the number of threads to us (default is 2).

:algorithm: L1 or L2 (default)

:cost: With this option, you can change the hyper-parameter for the CRFs.

With larger C value, CRF tends to overfit to the give training
corpus. This parameter trades the balance between overfitting and
underfitting. The results will significantly be influenced by this
parameter. You can find an optimal value by using held-out data or
more general model selection method such as cross validation.

:frequency: This parameter sets the cut-off threshold for the features. CRF++

uses the features that occurs no less than NUM times in the given training
data. The default value is 1. When you apply CRF++ to large data, the
number of unique features would amount to several millions. This option is
useful in such cases.


47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# File 'lib/crfpp/utilities.rb', line 47

def learn(template, data, options = {})
  options = { :threads => 2, :algorithm => :L2, :cost => 1.0, :frequency => 1}.merge(options)
  
  model = Model.new    
  arguments = []
  
  # TODO check algorithm names
  # arguments << "--algorithm=#{options[:algorithm]}"
  
  arguments << "--cost=#{options[:cost]}"
  arguments << "--thread=#{options[:threads]}"
  arguments << "--freq=#{options[:frequency]}"
  
  arguments << (template.respond_to?(:path) ? template.path : template)
  arguments << (data.respond_to?(:path) ? data.path : data)
  arguments << model.path

  Native.learn(arguments.join(' '))
  
  model
rescue => error
  raise NativeError, error.message
end