Introduction
Rseg is a Chinese Word Segmentation(中文分词) routine in pure Ruby.
The algorithm is based on this article: xiecc.blog.163.com/blog/static/14032200671110224190/
Usage
Rseg now support two modes: inline and C/S mode.
-
Inline mode
> require ‘rubygems’ > require ‘rseg’ > Rseg.segment(“需要分词的文章”)
- ‘需要’, ‘分词’, ‘的’, ‘文章’
-
The first call to Rseg#segment will need about 30 seconds to load the dictionary, the second call will be very fast, you can also call Rseg#load to load dictionaries manually.
-
C/S mode
$ rseg_server
Sinatra/0.9.4 has taken the stage on 4100
This will start rseg server on localhost:4100
You can visit it via your browser or the rseg command.
$ rseg ‘需要分词的文章’ 需要 分词 的 文章
You can also access server with the Rseg#remote_segment
$ irb > require ‘rubygems’ > require ‘rseg’ > Rseg.remote_segment(“需要分词的文章”) # This will be very fast
- ‘需要’, ‘分词’, ‘的’, ‘文章’
-
Performance
About 5M character/s on my Macbook (Intel Core 2 Duo 2GHz/4G mem).
License
Rseg includes two built-in dictionaries:
-
CC-CEDICT (cc-cedict.org/wiki/) with Creative Commons Attribution-Share Alike 3.0 License (creativecommons.org/licenses/by-sa/3.0/)
-
Wikipedia Chinese article title list (download.wikimedia.org/zhwiki/) with Creative Commons Attribution-Share Alike 3.0 License(creativecommons.org/licenses/by-sa/3.0/)
The codes and others in Rseg are licensed under MIT license.
Feedback
All feedback are welcome, Yuanyi Zhang(zhangyuanyi#gmail.com)
-
-