analects.rb

Gem Version Build Status Dependency Status Code Climate

Public datasets on the Chinese language, accessible from Ruby

Download the data

With Rake

# Rakefile
require 'analects/rake_tasks'

Analects.init_rake_tasks do
  data_dir '/tmp/analects' # defaults to ~/.analects

  task :import_cedict do
    library.cedict.each do |entry|
      # ..
    end
  end
end
rake analects:download:all        # download all sources
rake analects:download:cedict     # download CC-CEDICT
rake analects:download:chise_ids  # download Chise-IDS
rake analects:download:hsk        # download HSK data
rake analects:download:unihan     # download Unihan database

Or from Ruby

analects = Analects::Library.new(data_dir: '/tmp/analects')
analects.cedict.retrieve
analects.chise_ids.retrieve

Use the data

analects = Analects::Library.new(data_dir: '/tmp/analects')
analects.cedict.take(3)
# => [["AA制", "AA制", "A A zhi4", "/to split the bill/to go Dutch/"], ["A咖", "A咖", "A ka1", "/class \"A\"/top grade/"], ["A片", "A片", "A pian4", "/adult movie/pornography/"]]

analects.chise_ids.to_a.sample(3)
# [["U+59BF", "妿", "⿱加女"], ["U-0002441B", "𤐛", "⿰火閙"], ["U+83A1", "莡", "⿱艹足"]]

Other stuff

Analects wraps RMMSeg for easy segmenting of Chinese text

Analects::Tokenizer.new.tokenize("为待那个朋友拿哟出来,咿呀噢哎…")
# => ["为", "待", "那个", "朋友", "拿", "哟", "出来", ",", "咿", "呀", "噢", "哎", "…"]

If you have Chinese text in GB or BIG5 encoding, you can do stuff like this

Analects::Encoding.valid_cjk(str)
Analects::Encoding.from_gb(str)   # returns UTF-8
Analects::Encoding.from_big5(str) # returns UTF-8

License

Copyright ⓒ Arne Brasseur 2012-2014

Licensed as GPL-v3