Class: Natto::MeCab
- Inherits:
-
Object
- Object
- Natto::MeCab
- Includes:
- Binding, OptionParse
- Defined in:
- lib/natto/natto.rb
Overview
MeCab
is a class providing an interface to the MeCab library.
Options to the MeCab Model, Tagger and Lattice are passed in
as a string (MeCab command-line style) or as a Ruby-style hash at
initialization.
Usage
require 'natto'
text = '凡人にしか見えねえ風景ってのがあるんだよ。'
nm = Natto::MeCab.new
=> #<Natto::MeCab:0x0000080318d278 \
@model=#<FFI::Pointer address=0x000008039174c0>, \
@tagger=#<FFI::Pointer address=0x0000080329ba60>, \
@lattice=#<FFI::Pointer address=0x000008045bd140>, \
@libpath="/usr/local/lib/libmecab.so" \
@options={}, \
@dicts=[#<Natto::DictionaryInfo:0x0000080318ce90 \
@filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic", \
charset=utf8, \
type=0>], \
@version=0.996>
# print entire MeCab result to stdout
#
puts nm.parse(text)
凡人 名詞,一般,*,*,*,*,凡人,ボンジン,ボンジン
に 助詞,格助詞,一般,*,*,*,に,ニ,ニ
しか 助詞,係助詞,*,*,*,*,しか,シカ,シカ
見え 動詞,自立,*,*,一段,未然形,見える,ミエ,ミエ
ねえ 助動詞,*,*,*,特殊・ナイ,音便基本形,ない,ネエ,ネー
風景 名詞,一般,*,*,*,*,風景,フウケイ,フーケイ
って 助詞,格助詞,連語,*,*,*,って,ッテ,ッテ
の 名詞,非自立,一般,*,*,*,の,ノ,ノ
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
ある 動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル
ん 名詞,非自立,一般,*,*,*,ん,ン,ン
だ 助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ
よ 助詞,終助詞,*,*,*,*,よ,ヨ,ヨ
。 記号,句点,*,*,*,*,。,。,。
EOS
# pass a block to iterate over each MeCabNode instance
#
nm.parse(text) do |n|
puts "#{n.surface},#{n.feature}" if !n.is_eos?
end
凡人,名詞,一般,*,*,*,*,凡人,ボンジン,ボンジン
に,助詞,格助詞,一般,*,*,*,に,ニ,ニ
しか,助詞,係助詞,*,*,*,*,しか,シカ,シカ
見え,動詞,自立,*,*,一段,未然形,見える,ミエ,ミエ
ねえ,助動詞,*,*,*,特殊・ナイ,音便基本形,ない,ネエ,ネー
風景,名詞,一般,*,*,*,*,風景,フウケイ,フーケイ
って,助詞,格助詞,連語,*,*,*,って,ッテ,ッテ
の,名詞,非自立,一般,*,*,*,の,ノ,ノ
が,助詞,格助詞,一般,*,*,*,が,ガ,ガ
ある,動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル
ん,名詞,非自立,一般,*,*,*,ん,ン,ン
だ,助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ
よ,助詞,終助詞,*,*,*,*,よ,ヨ,ヨ
。,記号,句点,*,*,*,*,。,。,。
# customize MeCabNode feature attribute with node-formatting
# %m ... morpheme surface
# %F, ... comma-delimited ChaSen feature values
# reading (index 7)
# part-of-speech (index 0)
# %h ... part-of-speech ID (IPADIC)
#
nm = Natto::MeCab.new('-F%m,%F,[7,0],%h')
# Enumerator effectively iterates the MeCabNodes
#
enum = nm.enum_parse(text)
=> #<Enumerator: #<Enumerator::Generator:0x29cc5f8>:each>
# output the feature attribute of each MeCabNode
# only output normal nodes, ignoring any end-of-sentence
# or unknown nodes
#
enum.map.with_index {|n,i| puts "#{i}: #{n.feature}" if n.is_nor?}
0: 凡人,ボンジン,名詞,38
1: に,ニ,助詞,13
2: しか,シカ,助詞,16
3: 見え,ミエ,動詞,31
4: ねえ,ネー,助動詞,25
5: 風景,フーケイ,名詞,38
6: って,ッテ,助詞,15
7: の,ノ,名詞,63
8: が,ガ,助詞,13
9: ある,アル,動詞,31
10: ん,ン,名詞,63
11: だ,ダ,助動詞,25
12: よ,ヨ,助詞,17
13: 。,。,記号,7
# Boundary constraint parsing with output formatting.
# %m ... morpheme surface
# %f ... tab-delimited ChaSen feature values
# part-of-speech (index 0)
# %2 ... MeCab node status value (1 unknown)
#
nm = Natto::MeCab.new('-F%m,\s%f[0],\s%s')
enum = nm.enum_parse(text, boundary_constraint: /見えねえ風景/)
=> #<Enumerator: #<Enumerator::Generator:0x00000801d7aa38>:each>
# output the feature attribute of each MeCabNode
# ignoring any beginning- or end-of-sentence nodes
#
enum.each do |n|
puts n.feature if !(n.is_bos? or n.is_eos?)
end
凡人, 名詞, 0
に, 助詞, 0
しか, 助詞, 0
見えねえ風景, 名詞, 1
って, 助詞, 0
の, 名詞, 0
が, 助詞, 0
ある, 動詞, 0
ん, 名詞, 0
だ, 助動詞, 0
よ, 助詞, 0
。, 記号, 0
Constant Summary collapse
- MECAB_LATTICE_ONE_BEST =
1
- MECAB_LATTICE_NBEST =
2
- MECAB_LATTICE_PARTIAL =
4
- MECAB_LATTICE_MARGINAL_PROB =
8
- MECAB_LATTICE_ALTERNATIVE =
16
- MECAB_LATTICE_ALL_MORPHS =
32
- MECAB_LATTICE_ALLOCATE_SENTENCE =
64
- MECAB_ANY_BOUNDARY =
0
- MECAB_TOKEN_BOUNDARY =
1
- MECAB_INSIDE_TOKEN =
2
Constants included from OptionParse
OptionParse::SUPPORTED_OPTS, OptionParse::WARNING_LATTICE_LEVEL
Constants included from Binding
Instance Attribute Summary collapse
-
#dicts ⇒ Array
readonly
Listing of all of dictionaries referenced.
-
#lattice ⇒ FFI:Pointer
readonly
Pointer to MeCab Lattice.
-
#libpath ⇒ String
readonly
Absolute filepath to MeCab library.
-
#model ⇒ FFI:Pointer
readonly
Pointer to MeCab Model.
-
#options ⇒ Hash
readonly
MeCab options as key-value pairs.
-
#tagger ⇒ FFI:Pointer
readonly
Pointer to MeCab Tagger.
-
#version ⇒ String
readonly
MeCab version.
Class Method Summary collapse
-
.create_free_proc(mptr, tptr, lptr) ⇒ Proc
Returns a
Proc
that will properly free resources when this instance is garbage collected.
Instance Method Summary collapse
-
#enum_parse(text, constraints = {}) ⇒ Enumerator
Parses the given string
text
, returning an Enumerator that may be used to iterate over the resulting MeCabNode objects. -
#initialize(options = {}) ⇒ MeCab
constructor
Initializes the wrapped Tagger instance with the given
options
. -
#inspect ⇒ String
Overrides
Object#inspect
. -
#parse(text, constraints = {}) ⇒ String
Parses the given
text
, returning the MeCab output as a single string. -
#to_s ⇒ String
Returns human-readable details for the wrapped MeCab library.
Methods included from Binding
Constructor Details
#initialize(options = {}) ⇒ MeCab
Initializes the wrapped Tagger instance with the given options
.
Options supported are:
- :rcfile -- resource file
- :dicdir -- system dicdir
- :userdic -- user dictionary
- :lattice_level -- lattice information level (DEPRECATED)
- :output_format_type -- output format type (wakati, chasen, yomi, etc.)
- :all_morphs -- output all morphs (default false)
- :nbest -- output N best results (integer, default 1), requires lattice level >= 1
- :partial -- partial parsing mode
- :marginal -- output marginal probability
- :max_grouping_size -- maximum grouping size for unknown words (default 24)
- :node_format -- user-defined node format
- :unk_format -- user-defined unknown node format
- :bos_format -- user-defined beginning-of-sentence format
- :eos_format -- user-defined end-of-sentence format
- :eon_format -- user-defined end-of-NBest format
- :unk_feature -- feature for unknown word
- :input_buffer_size -- set input buffer size (default 8192)
- :allocate_sentence -- allocate new memory for input sentence
- :theta -- temperature parameter theta (float, default 0.75)
- :cost_factor -- cost factor (integer, default 700)
MeCab command-line arguments (-F) or long (--node-format) may be used in addition to Ruby-style hashs
Use single-quotes to preserve format options that contain escape chars.
e.g.
nm = Natto::MeCab.new(node_format: '%m¥t%f[7]¥n')
=> #<Natto::MeCab:0x00000803503ee8 \
@model=#<FFI::Pointer address=0x00000802b6d9c0>, \
@tagger=#<FFI::Pointer address=0x00000802ad3ec0>, \
@lattice=#<FFI::Pointer address=0x000008035f3980>, \
@libpath="/usr/local/lib/libmecab.so", \
@options={:node_format=>"%m¥t%f[7]¥n"}, \
@dicts=[#<Natto::DictionaryInfo:0x000008035038f8 \
@filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic" \
charset=utf8, \
type=0>] \
@version=0.996>
puts nm.parse('才能とは求める人間に与えられるものではない。')
才能 サイノウ
と ト
は ハ
求 モトメル
人間 ニンゲン
に ニ
与え アタエ
られる ラレル
もの モノ
で デ
は ハ
ない ナイ
。 。
EOS
228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 |
# File 'lib/natto/natto.rb', line 228 def initialize(={}) @options = self.class.() opt_str = self.class.(@options) @model = self.class.mecab_model_new2(opt_str) if @model.address == 0x0 raise MeCabError.new("Could not initialize Model with options: '#{opt_str}'") end @tagger = self.class.mecab_model_new_tagger(@model) if @tagger.address == 0x0 raise MeCabError.new("Could not initialize Tagger with options: '#{opt_str}'") end @lattice = self.class.mecab_model_new_lattice(@model) if @lattice.address == 0x0 raise MeCabError.new("Could not initialize Lattice with options: '#{opt_str}'") end @libpath = self.class.find_library if @options[:nbest] && @options[:nbest] > 1 self.mecab_lattice_set_request_type(@lattice, MECAB_LATTICE_NBEST) else self.mecab_lattice_set_request_type(@lattice, MECAB_LATTICE_ONE_BEST) end if @options[:partial] self.mecab_lattice_add_request_type(@lattice, MECAB_LATTICE_PARTIAL) end if @options[:marginal] self.mecab_lattice_add_request_type(@lattice, MECAB_LATTICE_MARGINAL_PROB) end if @options[:all_morphs] # required when node parsing #self.mecab_lattice_add_request_type(@lattice, MECAB_LATTICE_NBEST) self.mecab_lattice_add_request_type(@lattice, MECAB_LATTICE_ALL_MORPHS) end if @options[:allocate_sentence] self.mecab_lattice_add_request_type(@lattice, MECAB_LATTICE_ALLOCATE_SENTENCE) end if @options[:theta] self.mecab_lattice_set_theta(@lattice, @options[:theta]) end @parse_tostr = ->(text, constraints) { begin if @options[:nbest] && @options[:nbest] > 1 n = @options[:nbest] else n = 1 end if constraints[:boundary_constraints] tokens = tokenize_by_pattern(text, constraints[:boundary_constraints]) text = tokens.map {|t| t.first}.join self.mecab_lattice_set_sentence(@lattice, text) bpos = 0 tokens.each do |token| c = token.first.bytes.count self.mecab_lattice_set_boundary_constraint(@lattice, bpos, MECAB_TOKEN_BOUNDARY) bpos += 1 mark = token.last ? MECAB_INSIDE_TOKEN : MECAB_ANY_BOUNDARY (c-1).times do self.mecab_lattice_set_boundary_constraint(@lattice, bpos, mark) bpos += 1 end end elsif constraints[:feature_constraints] features = constraints[:feature_constraints] tokens = tokenize_by_features(text, features.keys) text = tokens.map {|t| t.first}.join self.mecab_lattice_set_sentence(@lattice, text) bpos = 0 tokens.each do |token| chunk = token.first c = chunk.bytes.count if token.last self.mecab_lattice_set_feature_constraint(@lattice, bpos, bpos+c, features[chunk]) end bpos += c end else self.mecab_lattice_set_sentence(@lattice, text) end self.mecab_parse_lattice(@tagger, @lattice) if n > 1 retval = self.mecab_lattice_nbest_tostr(@lattice, n) else retval = self.mecab_lattice_tostr(@lattice) end retval.force_encoding(Encoding.default_external) rescue => ex = self.mecab_lattice_strerror(@lattice) raise ex if == '' raise MeCabError.new() end } @parse_tonodes = ->(text, constraints) { Enumerator.new do |y| begin if @options[:nbest] && @options[:nbest] > 1 n = @options[:nbest] else n = 1 end if constraints[:boundary_constraints] tokens = tokenize_by_pattern(text, constraints[:boundary_constraints]) text = tokens.map {|t| t.first}.join self.mecab_lattice_set_sentence(@lattice, text) bpos = 0 tokens.each do |token| c = token.first.bytes.count self.mecab_lattice_set_boundary_constraint(@lattice, bpos, MECAB_TOKEN_BOUNDARY) bpos += 1 mark = token.last ? MECAB_INSIDE_TOKEN : MECAB_ANY_BOUNDARY (c-1).times do self.mecab_lattice_set_boundary_constraint(@lattice, bpos, mark) bpos += 1 end end elsif constraints[:feature_constraints] features = constraints[:feature_constraints] tokens = tokenize_by_features(text, features.keys) text = tokens.map {|t| t.first}.join self.mecab_lattice_set_sentence(@lattice, text) bpos = 0 tokens.each do |token| chunk = token.first c = chunk.bytes.count if token.last self.mecab_lattice_set_feature_constraint(@lattice, bpos, bpos+c, features[chunk]) end bpos += c end else self.mecab_lattice_set_sentence(@lattice, text) end self.mecab_parse_lattice(@tagger, @lattice) n.times do check = self.mecab_lattice_next(@lattice) if check nptr = self.mecab_lattice_get_bos_node(@lattice) while nptr && nptr.address!=0x0 mn = Natto::MeCabNode.new(nptr) if !mn.is_bos? surf = mn[:surface].bytes.to_a.slice(0,mn.length).pack('C*') mn.surface = surf.force_encoding(Encoding.default_external) if @options[:output_format_type] || @options[:node_format] mn.feature = self.mecab_format_node(@tagger, nptr).force_encoding(Encoding.default_external) end y.yield mn end nptr = mn[:next] end end end nil rescue => ex = self.mecab_lattice_strerror(@lattice) raise ex if == '' raise MeCabError.new() end end } @dicts = [] @dicts << Natto::DictionaryInfo.new(self.mecab_model_dictionary_info(@model)) while @dicts.last.next.address != 0x0 @dicts << Natto::DictionaryInfo.new(@dicts.last.next) end @version = self.mecab_version ObjectSpace.define_finalizer(self, self.class.create_free_proc(@model, @tagger, @lattice)) end |
Instance Attribute Details
#dicts ⇒ Array (readonly)
Returns listing of all of dictionaries referenced.
164 165 166 |
# File 'lib/natto/natto.rb', line 164 def dicts @dicts end |
#lattice ⇒ FFI:Pointer (readonly)
Returns pointer to MeCab Lattice.
158 159 160 |
# File 'lib/natto/natto.rb', line 158 def lattice @lattice end |
#libpath ⇒ String (readonly)
Returns absolute filepath to MeCab library.
160 161 162 |
# File 'lib/natto/natto.rb', line 160 def libpath @libpath end |
#model ⇒ FFI:Pointer (readonly)
Returns pointer to MeCab Model.
154 155 156 |
# File 'lib/natto/natto.rb', line 154 def model @model end |
#options ⇒ Hash (readonly)
Returns MeCab options as key-value pairs.
162 163 164 |
# File 'lib/natto/natto.rb', line 162 def @options end |
#tagger ⇒ FFI:Pointer (readonly)
Returns pointer to MeCab Tagger.
156 157 158 |
# File 'lib/natto/natto.rb', line 156 def tagger @tagger end |
#version ⇒ String (readonly)
Returns MeCab version.
166 167 168 |
# File 'lib/natto/natto.rb', line 166 def version @version end |
Class Method Details
.create_free_proc(mptr, tptr, lptr) ⇒ Proc
Returns a Proc
that will properly free resources
when this instance is garbage collected.
573 574 575 576 577 578 579 |
# File 'lib/natto/natto.rb', line 573 def self.create_free_proc(mptr, tptr, lptr) Proc.new do self.mecab_lattice_destroy(lptr) self.mecab_destroy(tptr) self.mecab_model_destroy(mptr) end end |
Instance Method Details
#enum_parse(text, constraints = {}) ⇒ Enumerator
Parses the given string text
, returning an
Enumerator that may be
used to iterate over the resulting Natto::MeCabNode objects. This is more
efficient than parsing to a simple string, since each node's
information will not be materialized all at once as it is with
string output.
MeCab nodes contain much more detailed information about
the morpheme. Node-formatting may also be used to customize
the resulting node's feature
attribute.
Boundary constraint parsing is available by passing in the
boundary_constraints
key in the options
hash. Boundary constraints
parsing provides hints to MeCab on where the morpheme boundaries in the
given text
are located. boundary_constraints
value may be either a
Regexp
or String
; please see
String#scan
Feature constraint parsing is available by passing in the
feature_constraints
key in the options
hash. Feature constraints
parsing provides instructions to MeCab to use the feature indicated
for any morpheme that is an exact match for the given key.
feature_constraints
is a hash mapping a specific morpheme (String)
to a corresponding feature value (String).
517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 |
# File 'lib/natto/natto.rb', line 517 def enum_parse(text, constraints={}) if text.nil? raise ArgumentError.new 'Text to parse cannot be nil' elsif constraints[:boundary_constraints] if !(constraints[:boundary_constraints].is_a?(Regexp) || constraints[:boundary_constraints].is_a?(String)) raise ArgumentError.new 'boundary constraints must be a Regexp or String' end elsif constraints[:feature_constraints] && !constraints[:feature_constraints].is_a?(Hash) raise ArgumentError.new 'feature constraints must be a Hash' elsif @options[:partial] && !text.end_with?("\n") raise ArgumentError.new 'partial parsing requires new-line char at end of text' end @parse_tonodes.call(text, constraints) end |
#inspect ⇒ String
Overrides Object#inspect
.
563 564 565 |
# File 'lib/natto/natto.rb', line 563 def inspect self.to_s end |
#parse(text, constraints = {}) ⇒ String
Parses the given text
, returning the MeCab output as a single string.
If a block is passed to this method, then node parsing will be used
and each node yielded to the given block.
Boundary constraint parsing is available via passing in the
boundary_constraints
key in the options
hash. Boundary constraints
parsing provides hints to MeCab on where the morpheme boundaries in the
given text
are located. boundary_constraints
value may be either a
Regexp
or String
; please see String#scan
The boundary constraint parsed output will be returned as a single
string, unless a block is passed to this method for node parsing.
Feature constraint parsing is available by passing in the
feature_constraints
key in the options
hash. Feature constraints
parsing provides instructions to MeCab to use the feature indicated
for any morpheme that is an exact match for the given key.
feature_constraints
is a hash mapping a specific morpheme (String)
to a corresponding feature value (String).
465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 |
# File 'lib/natto/natto.rb', line 465 def parse(text, constraints={}) if text.nil? raise ArgumentError.new 'Text to parse cannot be nil' elsif constraints[:boundary_constraints] if !(constraints[:boundary_constraints].is_a?(Regexp) || constraints[:boundary_constraints].is_a?(String)) raise ArgumentError.new 'boundary constraints must be a Regexp or String' end elsif constraints[:feature_constraints] && !constraints[:feature_constraints].is_a?(Hash) raise ArgumentError.new 'feature constraints must be a Hash' elsif @options[:partial] && !text.end_with?("\n") raise ArgumentError.new 'partial parsing requires new-line char at end of text' end if block_given? @parse_tonodes.call(text, constraints).each {|n| yield n } else @parse_tostr.call(text, constraints) end end |
#to_s ⇒ String
Returns human-readable details for the wrapped MeCab library.
Overrides Object#to_s
.
- encoded object id
- underlying FFI pointer to the MeCab Model
- underlying FFI pointer to the MeCab Tagger
- underlying FFI pointer to the MeCab Lattice
- real file path to MeCab library
- options hash
- list of dictionaries
- MeCab version
548 549 550 551 552 553 554 555 556 557 |
# File 'lib/natto/natto.rb', line 548 def to_s [ super.chop, "@model=#{@model},", "@tagger=#{@tagger},", "@lattice=#{@lattice},", "@libpath=\"#{@libpath}\",", "@options=#{@options.inspect},", "@dicts=#{@dicts.to_s},", "@version=#{@version.to_s}>" ].join(' ') end |