UnihanLang

unihan_lang is a Ruby library for identifying text language (Traditional Chinese, Simplified Chinese) and performing various checks on Chinese characters.

This document can also be read in Japanese.

Installation

Add this line to your application's Gemfile:

gem 'unihan_lang'

And then execute:

bundle install

Or install it yourself as:

gem install unihan_lang

Usage

require 'unihan_lang'

unihan = UnihanLang::Unihan.new

# Language determination
puts unihan.determine_language("這是繁體中文") # => "ZH_TW"
puts unihan.determine_language("这是简体中文") # => "ZH_CN"

# Check if text is Traditional Chinese
puts unihan.zh_tw?("這是繁體中文") # => true
puts unihan.zh_tw?("这不是繁体中文") # => false

# Check if text is Simplified Chinese
puts unihan.zh_cn?("这是简体中文") # => true
puts unihan.zh_cn?("這不是簡體中文") # => false

# Check if text contains Chinese characters
puts unihan.contains_chinese?("This text contains 中文") # => true
puts unihan.contains_chinese?("This text has no Chinese") # => false

# Extract Chinese characters from text
puts unihan.extract_chinese_characters("This text contains 中文").join # => "中文"

# Check if text consists only of Traditional Chinese characters
puts unihan.only_zh_tw?("繁體") # => true
puts unihan.only_zh_tw?("繁體简体") # => false

# Check if text consists only of Simplified Chinese characters
puts unihan.only_zh_cn?("简体") # => true
puts unihan.only_zh_cn?("简体繁體") # => false

# Check if text contains Traditional Chinese characters
puts unihan.contains_zh_tw?("這個text包含繁體字") # => true
puts unihan.contains_zh_tw?("这个text不包含繁体字") # => false

# Check if text contains Simplified Chinese characters
puts unihan.contains_zh_cn?("这个text包含简体字") # => true
puts unihan.contains_zh_cn?("這個text不包含簡體字") # => false

Features

  • determine_language(text): Determines the language of the text ("ZH_TW", "ZH_CN", "JA", "Unknown").
  • zh_tw?(text): Checks if the text is in Traditional Chinese.
  • zh_cn?(text): Checks if the text is in Simplified Chinese.
  • contains_chinese?(text): Checks if the text contains Chinese characters.
  • extract_chinese_characters(text): Extracts Chinese characters from the text.
  • only_zh_tw?(text): Checks if the text consists only of Traditional Chinese characters.
  • only_zh_cn?(text): Checks if the text consists only of Simplified Chinese characters.
  • contains_zh_tw?(text): Checks if the text contains Traditional Chinese characters.
  • contains_zh_cn?(text): Checks if the text contains Simplified Chinese characters.

Note

This library does not guarantee 100% accuracy in language identification. Particularly for short texts or texts containing multiple languages, determination may be challenging. The distinction between Traditional and Simplified Chinese is based on the Unihan database.