Module: WenlinDbScanner::Dicts
- Defined in:
- lib/wenlin_db_scanner/dict.rb
Overview
Parses the data in the dictionary databases.
Class Method Summary collapse
-
.en_zh(db_root) ⇒ Enumerator<DictEntry>
The entries in the English->Chinese dictionary.
-
.entries(db_file) ⇒ Enumerator<DictEntry>
Generic decodeder for a database of dictionary entries.
-
.key_frequency(key) ⇒ Integer?
The frequency information expressed in a dictionary key.
-
.key_latin_frequency(key) ⇒ Boolean
The latin frequency information expressed in a dictionary key.
-
.key_latin_term(key) ⇒ String
The term defined by a dictionary key, spelled using Latin characters.
-
.key_term(key) ⇒ String
The term defined by a dictionary key.
-
.zh_en(db_root) ⇒ Enumerator<DictEntry>
The entries in the Chinese->English dictionary.
Class Method Details
.en_zh(db_root) ⇒ Enumerator<DictEntry>
The entries in the English->Chinese dictionary.
11 12 13 |
# File 'lib/wenlin_db_scanner/dict.rb', line 11 def self.en_zh(db_root) entries File.join(db_root, 'yinghan.db') end |
.entries(db_file) ⇒ Enumerator<DictEntry>
Generic decodeder for a database of dictionary entries.
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
# File 'lib/wenlin_db_scanner/dict.rb', line 27 def self.entries(db_file) Enumerator.new do |yielder| db = Db.new db_file db.records.each do |record| next if record.binary? lines = record.text.split("\n").map(&:strip).reject(&:empty?) key = lines[0] entry = DictEntry.new entry.key = key entry.term = key_term key entry.latin_term = key_latin_term key entry.term_frequency = key_frequency key entry.latin_frequency_boost = key_latin_frequency key collect_values = false lines[1..-1].each do |line| tag, data = *line.split(' ', 2) tag_parts = /^(\d*)(\w+)(\@.*)?$/.match tag unless tag_parts raise "Unknown tag format #{tag} in dictionary entry!\n#{record.text}" end case tag_parts[2] when 'ipa' prop = :ipa when 'a' prop = :abbreviates when 'c' prop = nil prop1 = :zh data1 = data.gsub(/\[[^\]]*\]/, '').strip prop2 = :zh_tw data2 = data.scan(/\[([^\]]*)\]/).map(&:first).join('; ').strip if data2.empty? data2 = data1 else if data2.index '-' # Handle entries like data2 = data2.chars.map.with_index { |char, index| (char == '-') ? data1[index] : char }.join '' end end when 'd' prop = :defn when 'b' # NOTE: base of? prop = nil prop1 = :used_in_terms prop2 = :used_in_serials data1 = data.gsub(/\[[^\]]*\]/, '').strip data2 = data.scan(/\[([^\]]*)\]/).map(&:first).join('; ').strip collect_values = true when 'e' # NOTE: equivalent? prop = nil prop1 = :linked_terms prop2 = :linked_serials data1 = data.gsub(/\[[^\]]*\]/, '').strip data2 = data.scan(/\[([^\]]*)\]/).map(&:first).join('; ').strip collect_values = true when 'f' # e.g. 2.2 [XHPC:4] prop = :freq data = data.split('[', 2).first.strip when 'gr' prop = :grade when 'h' # NOTE: guessing this means it shows up in the application's help. # it seems to only be set for technical terms prop = false when 'hz' prop = :example_zh when 'infl' prop = :inflection when 'j' # NOTE: jump? prop = :see_serial when 'k' prop = :see_term when 'm' prop = :measure_word # NOTE: stripping the complex hanzi, as it can be found by # cross-referencing the measure word's key data = data.gsub(/\[[^\]]*\]/, '').strip data = data.split('/').map(&:strip) when 'n' # NOTE: the field of reference sometimes looks like "mus.[音]" data2 = data.scan(/\[([^\]]*)\]/).map(&:first).join('; ').strip if data2.empty? prop = :field else prop = nil prop1 = :field prop2 = :field_zh data1 = data.gsub(/\[[^\]]*\]/, '').strip data2 = data.scan(/\[([^\]]*)\]/).map(&:first).join('; ').strip end when 'note' prop = :note when 'o' prop = :construction when 'p' prop = :speech_part when 'q' prop = :usage when 'r', 'rem' # NOTE: skipping remarks / revisions for now; they might be # interesting for research prop = false when 's' prop = :serial when 'sub' prop = nil prop1 = :extend prop2 = :extend_serial data1 = data.gsub(/\[[^\]]*\]/, '').strip data2 = data.scan(/\[([^\]]*)\]/).map(&:first).join('; ').strip collect_values = true when 'subof' prop = nil prop1 = :extended_from prop2 = :extended_from_serial data1 = data.gsub(/\[[^\]]*\]/, '').strip data2 = data.scan(/\[([^\]]*)\]/).map(&:first).join('; ').strip collect_values = true when 't' prop = :example_translation when 'u' prop = :unverified data = true when 'v' # NOTE: no idea what this is, only shows up once prop = false when 'w' prop = :reference when 'x' prop = :example when 'y' prop = :years else raise "Unknown tag #{tag} in dictionary entry!\n#{record.text}" end next if prop == false ops = if prop [[prop, data]] else [[prop1, data1], [prop2, data2]] end ops.each do |k, v| if tag_parts[1].empty? if collect_values entry[k] ||= [] entry[k] << v else entry[k] = v end else # Exampe: 31x means example: [blah, blah, [value]] indexes = tag_parts[1].chars.map do |char| ((char == ?0) ? 10 : char.to_i) - 1 end if indexes.any? { |i| i < 0 } puts "Broken tag #{tag} #{tag_parts[1]} #{indexes.inspect}\n#{record.text}" end entry[k] ||= [] unless entry[k].kind_of?(Array) # Fix entries listing props x and 2x instead of 1x, 2x. entry[k] = [entry[k]] end array = entry[k] indexes[0...-1].each do |i| array[i] ||= [] unless array[i].kind_of?(Array) # Fix entries listing props 1x and 12x instead of 11x, 12x. array[i] = [array[i]] end array = array[i] end if collect_values array[indexes.last] ||= [] array[indexes.last] << v else array[indexes.last] = v end end end end yielder << entry end end end |
.key_frequency(key) ⇒ Integer?
The frequency information expressed in a dictionary key.
This shows the relative frequency of the term, among all the terms with the same exact spelling. For Chinese terms, the spelling is pinyin.
242 243 244 245 246 |
# File 'lib/wenlin_db_scanner/dict.rb', line 242 def self.key_frequency(key) match = /^[^\p{L}]+/.match(key) return nil unless match match[0].tr('⁰¹²³⁴⁵⁶⁷⁸⁹' , '0123456789').to_i end |
.key_latin_frequency(key) ⇒ Boolean
The latin frequency information expressed in a dictionary key.
This is true if the term is the most frequent, among all terms with the same latin spelling. For Chinese terms, the latin spelling is pinyin with tone information removed.
256 257 258 |
# File 'lib/wenlin_db_scanner/dict.rb', line 256 def self.key_latin_frequency(key) key[-1] == ?* end |
.key_latin_term(key) ⇒ String
The term defined by a dictionary key, spelled using Latin characters.
231 232 233 |
# File 'lib/wenlin_db_scanner/dict.rb', line 231 def self.key_latin_term(key) Chars. key_term(key) end |
.key_term(key) ⇒ String
The term defined by a dictionary key.
223 224 225 |
# File 'lib/wenlin_db_scanner/dict.rb', line 223 def self.key_term(key) key.gsub(/[^\p{L}]/, '') end |
.zh_en(db_root) ⇒ Enumerator<DictEntry>
The entries in the Chinese->English dictionary.
19 20 21 |
# File 'lib/wenlin_db_scanner/dict.rb', line 19 def self.zh_en(db_root) entries File.join(db_root, 'cidian.db') end |