Class: Chinese::Vocab
- Inherits:
-
Object
- Object
- Chinese::Vocab
- Includes:
- HelperMethods, WithValidations
- Defined in:
- lib/chinese_vocab/vocab.rb
Constant Summary collapse
- OPTIONS =
Mandatory constant for the [WithValidations](rubydoc.info/github/bytesource/with_validations/file/README.md) module. Each key-value pair is of the following type:
`option_key => [default_value, validation]`
{:compact => [false, lambda {|value| is_boolean?(value) }], :with_pinyin => [true, lambda {|value| is_boolean?(value) }], :thread_count => [8, lambda {|value| value.kind_of?(Integer) }]}
Instance Attribute Summary collapse
-
#compact ⇒ Boolean
readonly
The value of the :compact options key.
-
#not_found ⇒ Array<String>
readonly
of the supported online dictionaries during a call to either #sentences or #min_sentences.
-
#stored_sentences ⇒ Array<Hash>
readonly
Holds the return value of either #sentences or #min_sentences, whichever was called last.
-
#with_pinyin ⇒ Boolean
readonly
The value of the ‘:with_pinyin` option key.
-
#words ⇒ Array<String>
readonly
The list of Chinese words after calling #edit_vocab.
Class Method Summary collapse
-
.parse_words(path_to_csv, word_col, options = {}) ⇒ Array<String>
Extracts the vocabulary column from a CSV file as an array of strings.
-
.within_range?(column, row) ⇒ Boolean
Input: column: word column number (counting from 1) row : Array of the processed CSV data that contains our word column.
Instance Method Summary collapse
- #add_key(hash_array, key, &block) ⇒ Object
- #add_target_words(hash_array, words) ⇒ Object
- #alternate_source(sources, selection) ⇒ Object
- #contains_all_target_words?(selected_rows, sentence_key) ⇒ Boolean
- #convert(text) ⇒ Object
-
#edit_vocab(word_array) ⇒ Object
Remove all non-word characters.
- #find_minimum_sentences(sentences, words) ⇒ Object
-
#initialize(word_array, options = {}) ⇒ Vocab
constructor
Intializes an object.
- #is_boolean?(value) ⇒ Boolean
- #make_hash(*data) ⇒ Object
- #min_sentences(options) ⇒ Array<Hash>, []
- #occurrence_count(word_array, frequency) ⇒ Object
- #remove_er_character_from_end(word) ⇒ Object
- #remove_keys(hash_array, *keys) ⇒ Object
-
#remove_parens(word) ⇒ Object
Helper functions —————–.
-
#remove_redundant_single_char_words(words) ⇒ Object
Input: [“看”, “书”, “看书”] Output: [“看书”].
- #remove_slash(word) ⇒ Object
-
#select_minimum_necessary_sentences(sentences) ⇒ Object
deprecated
Deprecated.
This method has been replaced by #find_minimum_sentences.
-
#select_sentence(word, options) ⇒ Object
Uses options passed from #sentences.
-
#sentences(options) ⇒ Hash
For every Chinese word in #words fetches a Chinese sentence and its English translation from an online dictionary, The return value is also stored in #stored_sentences.
-
#sentences_unique_chars(sentences) ⇒ Array<String>
Finds the unique Chinese characters from either the data in #stored_sentences or an array of Chinese sentences passed as an argument.
- #sort_by_target_word_count(with_target_words) ⇒ Object
- #target_words_per_sentence(sentence, words) ⇒ Object
-
#to_csv(path_to_file, options = {}) ⇒ void
Saves the data stored in #stored_sentences to disk.
- #try_alternate_download_sources(alternate_sources, word, options) ⇒ Object
- #uwc_tag(string) ⇒ Object
-
#word_frequency ⇒ Hash
Calculates the number of occurences of every word of #words in #stored_sentences.
Methods included from HelperMethods
#distinct_words, #include_every_char?, included, #is_unicode?
Constructor Details
#initialize(word_array, options) ⇒ Vocab #initialize(word_array) ⇒ Vocab
Words that are composite expressions must be written with a least one non-word character (such as whitespace) between each sub-expression. Example: “除了 以外” or “除了。。以外” instead of “除了以外”.
Intializes an object.
63 64 65 66 67 68 69 70 |
# File 'lib/chinese_vocab/vocab.rb', line 63 def initialize(word_array, ={}) @compact = validate { :compact } @words = edit_vocab(word_array) @words = remove_redundant_single_char_words(@words) if @compact @chinese = is_unicode?(@words[0]) @not_found = [] @stored_sentences = [] end |
Instance Attribute Details
#compact ⇒ Boolean (readonly)
Returns the value of the :compact options key.
30 31 32 |
# File 'lib/chinese_vocab/vocab.rb', line 30 def compact @compact end |
#not_found ⇒ Array<String> (readonly)
of the supported online dictionaries during a call to either #sentences or #min_sentences. Defaults to ‘[]`.
34 35 36 |
# File 'lib/chinese_vocab/vocab.rb', line 34 def not_found @not_found end |
#stored_sentences ⇒ Array<Hash> (readonly)
Returns holds the return value of either #sentences or #min_sentences, whichever was called last. Defaults to ‘[]`.
39 40 41 |
# File 'lib/chinese_vocab/vocab.rb', line 39 def stored_sentences @stored_sentences end |
#with_pinyin ⇒ Boolean (readonly)
Returns the value of the ‘:with_pinyin` option key.
36 37 38 |
# File 'lib/chinese_vocab/vocab.rb', line 36 def @with_pinyin end |
#words ⇒ Array<String> (readonly)
The list of Chinese words after calling #edit_vocab. Editing includes:
* Removing parentheses (with the content inside each parenthesis).
* Removing any slash (/) and only keeping the longest part.
* Removing trailing '儿' from any word longer than two characters.
* Removing non-word characters such as points and commas.
* Removing and duplicate words.
28 29 30 |
# File 'lib/chinese_vocab/vocab.rb', line 28 def words @words end |
Class Method Details
.parse_words(path_to_csv, word_col, options) ⇒ Array<String> .parse_words(path_to_csv, word_col) ⇒ Array<String>
Words that are composite expressions must be written with a least one non-word character (such as whitespace) between each sub-expression. Example: “除了 以外” or “除了。。以外” instead of “除了以外”.
Extracts the vocabulary column from a CSV file as an array of strings. The array is
normally provided as an argument to {#initialize}
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# File 'lib/chinese_vocab/vocab.rb', line 86 def self.parse_words(path_to_csv, word_col, ={}) # Enforced options: # encoding: utf-8 (necessary for parsing Chinese characters) # skip_blanks: true .merge!({:encoding => 'utf-8', :skip_blanks => true}) csv = CSV.read(path_to_csv, ) raise ArgumentError, "Column number (#{word_col}) out of range." unless within_range?(word_col, csv[0]) # 'word_col counting starts at 1, but CSV.read returns an array, # where counting starts at 0. col = word_col-1 csv.reduce([]) {|words, row| word = row[col] # If word_col contains no data, CSV::read returns nil. # We also want to skip empty strings or strings that only contain whitespace. words << word unless word.nil? || word.strip.empty? words } end |
.within_range?(column, row) ⇒ Boolean
Input: column: word column number (counting from 1) row : Array of the processed CSV data that contains our word column.
666 667 668 669 |
# File 'lib/chinese_vocab/vocab.rb', line 666 def self.within_range?(column, row) no_of_cols = row.size column >= 1 && column <= no_of_cols end |
Instance Method Details
#add_key(hash_array, key, &block) ⇒ Object
613 614 615 616 617 618 619 620 621 |
# File 'lib/chinese_vocab/vocab.rb', line 613 def add_key(hash_array, key, &block) hash_array.map do |row| if block row.merge({key => block.call(row)}) else row end end end |
#add_target_words(hash_array, words) ⇒ Object
512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 |
# File 'lib/chinese_vocab/vocab.rb', line 512 def add_target_words(hash_array, words) from_queue = Queue.new to_queue = Queue.new # semaphore = Mutex.new result = [] # words = @words hash_array.each {|hash| from_queue << hash} 10.times.map { Thread.new(words) do while(row = from_queue.pop!) sentence = row[:chinese] target_words = target_words_per_sentence(sentence, words) to_queue << row.merge(:target_words => target_words) end end }.map {|thread| thread.join} to_queue.to_a end |
#alternate_source(sources, selection) ⇒ Object
672 673 674 675 676 |
# File 'lib/chinese_vocab/vocab.rb', line 672 def alternate_source(sources, selection) sources = sources.dup sources.delete(selection) sources.pop end |
#contains_all_target_words?(selected_rows, sentence_key) ⇒ Boolean
635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 |
# File 'lib/chinese_vocab/vocab.rb', line 635 def contains_all_target_words?(selected_rows, sentence_key) matched_words = @words.reduce([]) do |acc, word| result = selected_rows.find do |row| sentence = row[sentence_key] include_every_char?(word, sentence) end if result acc << word end acc end # matched_words.size == @words.size if matched_words.size == @words.size true else puts "Words not found in sentences:" p @words - matched_words false end end |
#convert(text) ⇒ Object
507 508 509 |
# File 'lib/chinese_vocab/vocab.rb', line 507 def convert(text) eval(text.chomp) end |
#edit_vocab(word_array) ⇒ Object
Remove all non-word characters
408 409 410 411 412 413 414 415 416 |
# File 'lib/chinese_vocab/vocab.rb', line 408 def edit_vocab(word_array) word_array.map {|word| edited = remove_parens(word) edited = remove_slash(edited) edited = remove_er_character_from_end(edited) distinct_words(edited).join(' ') }.uniq end |
#find_minimum_sentences(sentences, words) ⇒ Object
295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 |
# File 'lib/chinese_vocab/vocab.rb', line 295 def find_minimum_sentences(sentences, words) min_sentences = [] # At the start the variable 'remaining words' contains all # target words - minus those with no sentence found. remaining_words = Set.new(words.dup) # On every round: # Finds the sentence with the most target words ('best sentence'). # Adds that sentence to the result array. # Deletes all target words from the remaining words that are part of # the best sentence. while(!remaining_words.empty?) do puts "Number of remaining_words: #{remaining_words.size}" # puts "Take five: #{remaining_words.take(5)}" # Return the sentence with the largest number of target words. sentences = sentences.sort_by do |row| # Returns a new array containing elements common to # the two arrays, with no duplicates. words_left = remaining_words.intersection(row[:target_words]) # Sort by size of words left first (in descsending order), # if equal, sort by length of the Chinese sentence (in ascending order). [-words_left.size, row[:chinese].size] end best_sentence = sentences.first # Add the sentence with the largest number of # target words to the result array. min_sentences << best_sentence # Remove the target words that are part of the # best sentence from the remaining words. remaining_words = remaining_words - best_sentence[:target_words] end # puts "Number of minimum sentences: #{min_sentences.size}" min_sentences end |
#is_boolean?(value) ⇒ Boolean
401 402 403 404 |
# File 'lib/chinese_vocab/vocab.rb', line 401 def is_boolean?(value) # Only true for either 'false' or 'true' !!value == value end |
#make_hash(*data) ⇒ Object
437 438 439 440 441 |
# File 'lib/chinese_vocab/vocab.rb', line 437 def make_hash(*data) require 'digest' data = data.reduce("") { |acc, item| acc << item.to_s } Digest::SHA2.hexdigest(data)[0..6] end |
#min_sentences(options) ⇒ Array<Hash>, []
In case of a network error during dowloading the sentences the data fetched so far is automatically copied to a file after several retries. This data is read and processed on the next run to reduce the time spend with downloading the sentences (which is by far the most time-consuming part).
Despite the download source chosen (by using the default or setting the ‘:source` options), if a word was not found on the first site, the second site is used as an alternative.
For every Chinese word in #words fetches a Chinese sentence and its English translation from an online dictionary, then calculates and the minimum number of sentences necessary to cover every word in #words at least once. The calculation is based on the fact that many words occur in more than one sentence.
The return value is also stored in {#stored_sentences}.
268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 |
# File 'lib/chinese_vocab/vocab.rb', line 268 def min_sentences( = {}) @with_pinyin = validate { :with_pinyin } # Always run this method. thread_count = validate { :thread_count } sentences = sentences() # Remove those words that don't have a sentence words = @words - @not_found puts "Determining the target words for every sentence..." sentences = add_target_words(sentences, words) minimum_sentences = find_minimum_sentences(sentences, words) # :uwc = 'unique words count' with_uwc_tag = add_key(minimum_sentences, :uwc) {|row| uwc_tag(row[:target_words]) } # :uws = 'unique words string' = add_key(with_uwc_tag, :uws) do |row| words = row[:target_words].sort.join(', ') "[" + words + "]" end # Remove those keys we don't need anymore result = remove_keys(, :target_words, :word) @stored_sentences = result @stored_sentences end |
#occurrence_count(word_array, frequency) ⇒ Object
601 602 603 604 605 |
# File 'lib/chinese_vocab/vocab.rb', line 601 def occurrence_count(word_array, frequency) word_array.reduce(0) do |acc, word| acc + frequency[word] end end |
#remove_er_character_from_end(word) ⇒ Object
419 420 421 422 423 424 425 |
# File 'lib/chinese_vocab/vocab.rb', line 419 def remove_er_character_from_end(word) if word.size > 2 word.gsub(/儿$/, '') else # Don't remove "儿" form words like 女儿 word end end |
#remove_keys(hash_array, *keys) ⇒ Object
608 609 610 |
# File 'lib/chinese_vocab/vocab.rb', line 608 def remove_keys(hash_array, *keys) hash_array.map { |row| row.delete_keys(*keys) } end |
#remove_parens(word) ⇒ Object
Helper functions
394 395 396 397 398 |
# File 'lib/chinese_vocab/vocab.rb', line 394 def remove_parens(word) # 1) Remove all ASCII parens and all data in between. # 2) Remove all Chinese parens and all data in between. word.gsub(/\(.*?\)/, '').gsub(/(.*?)/, '') end |
#remove_redundant_single_char_words(words) ⇒ Object
Input: [“看”, “书”, “看书”] Output: [“看书”]
446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 |
# File 'lib/chinese_vocab/vocab.rb', line 446 def remove_redundant_single_char_words(words) puts "Removing redundant single character words from the vocabulary..." single_char_words, multi_char_words = words.partition {|word| word.length == 1 } return single_char_words if multi_char_words.empty? non_redundant_single_char_words = single_char_words.reduce([]) do |acc, single_c| already_found = multi_char_words.find do |multi_c| multi_c.include?(single_c) end # Add single char word to array if it is not part of any of the multi char words. acc << single_c unless already_found acc end non_redundant_single_char_words + multi_char_words end |
#remove_slash(word) ⇒ Object
428 429 430 431 432 433 434 |
# File 'lib/chinese_vocab/vocab.rb', line 428 def remove_slash(word) if word.match(/\//) word.split(/\//).sort_by { |w| w.size }.last else word end end |
#select_minimum_necessary_sentences(sentences) ⇒ Object
This method has been replaced by #find_minimum_sentences.
574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 |
# File 'lib/chinese_vocab/vocab.rb', line 574 def select_minimum_necessary_sentences(sentences) words = @words - @not_found with_target_words = add_target_words(sentences, words) rows = sort_by_target_word_count(with_target_words) selected_rows = [] unmatched_words = @words.dup matched_words = [] rows.each do |row| words = row[:target_words].dup # Delete all words from 'words' that have already been encoutered # (and are included in 'matched_words'). words = words - matched_words if words.size > 0 # Words that where not deleted above have to be part of 'unmatched_words'. selected_rows << row # Select this row. # When a row is selected, its 'words' are no longer unmatched but matched. unmatched_words = unmatched_words - words matched_words = matched_words + words end end selected_rows end |
#select_sentence(word, options) ⇒ Object
Uses options passed from #sentences
467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 |
# File 'lib/chinese_vocab/vocab.rb', line 467 def select_sentence(word, ) sentence_pair = Scraper.sentence(word, ) sources = Scraper::Sources.keys sentence_pair = try_alternate_download_sources(sources, word, ) if sentence_pair.empty? if sentence_pair.empty? @not_found << word return nil else chinese, english = sentence_pair result = Hash.new result.merge!(word: word) result.merge!(chinese: chinese) result.merge!(pinyin: chinese.) if @with_pinyin result.merge!(english: english) end end |
#sentences(options) ⇒ Hash
(Normally you only call this method directly if you really need one sentence per Chinese word (even if these words might appear in more than one of the sentences.).
In case of a network error during dowloading the sentences the data fetched so far is automatically copied to a file after several retries. This data is read and processed on the next run to reduce the time spend with downloading the sentences (which is by far the most time-consuming part).
Despite the download source chosen (by using the default or setting the ‘:source` options), if a word was not found on the first site, the second site is used as an alternative.
For every Chinese word in #words fetches a Chinese sentence and its English translation from an online dictionary,
The return value is also stored in {#stored_sentences}.
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
# File 'lib/chinese_vocab/vocab.rb', line 144 def sentences(={}) puts "Fetching sentences..." # Always run this method. # We assign all options to a variable here (also those that are passed on) # as we need them in order to calculate the id. @with_pinyin = validate { :with_pinyin } thread_count = validate { :thread_count } id = make_hash(@words, .to_a.sort) words = @words from_queue = Queue.new to_queue = Queue.new file_name = id if File.exist?(file_name) puts "Examining file..." words, sentences, not_found = File.open(file_name) { |f| f.readlines } words = convert(words) convert(sentences).each { |s| to_queue << s } @not_found = convert(not_found) size_a = words.size size_b = to_queue.size puts "Size(@not_found) = #{@not_found.size}" puts "Size(words) = #{size_a}" puts "Size(to_queue) = #{size_b}" puts "Size(words+queue) = #{size_a+size_b}" puts "Size(sentences) = #{to_queue.size}" # Remove file File.unlink(file_name) end words.each {|word| from_queue << word } result = [] Thread.abort_on_exception = false 1.upto(thread_count).map { Thread.new do while(word = from_queue.pop!) do begin local_result = select_sentence(word, ) puts "Processing word: #{word} (#{from_queue.size} words left)" # rescue SocketError, Timeout::Error, Errno::ETIMEDOUT, # Errno::ECONNREFUSED, Errno::ECONNRESET, EOFError => e rescue Exception => e puts " #{e.}." puts "Please DO NOT abort, but wait for either the program to continue or all threads" puts "to terminate (in which case the data will be saved to disk for fast retrieval on the next run.)" puts "Number of running threads: #{Thread.list.size - 1}." raise ensure from_queue << word if $! puts "Wrote '#{word}' back to queue" if $! end to_queue << local_result unless local_result.nil? end end }.each {|thread| thread.join } @stored_sentences = to_queue.to_a @stored_sentences ensure if $! while(Thread.list.size > 1) do # Wait for all child threads to terminate. sleep 5 end File.open(file_name, 'w') do |f| p "=============================" p "Writing data to file..." f.write from_queue.to_a f.puts f.write to_queue.to_a f.puts f.write @not_found puts "Finished writing data." puts "Please run the program again after solving the (connection) problem." end end end |
#sentences_unique_chars(sentences) ⇒ Array<String>
If no argument is passed, the data from #stored_sentences is used as input
Finds the unique Chinese characters from either the data in #stored_sentences or an array of Chinese sentences passed as an argument.
363 364 365 366 367 368 369 370 371 |
# File 'lib/chinese_vocab/vocab.rb', line 363 def sentences_unique_chars(sentences = stored_sentences) # If the argument is an array of hashes, then it must be the data from @stored_sentences sentences = sentences.map { |hash| hash[:chinese] } if sentences[0].kind_of?(Hash) sentences.reduce([]) do |acc, row| acc = acc | row.scan(/\p{Word}/) # only return characters, skip punctuation marks acc end end |
#sort_by_target_word_count(with_target_words) ⇒ Object
543 544 545 546 547 548 549 550 551 552 553 554 |
# File 'lib/chinese_vocab/vocab.rb', line 543 def sort_by_target_word_count(with_target_words) # First sort by size of unique word array (from large to short) # If the unique word count is equal, sort by the length of the sentence (from small to large) with_target_words.sort_by {|row| [-row[:target_words].size, row[:chinese].size] } # The above is the same as: # with_target_words.sort {|a,b| # first = -(a[:target_words].size <=> b[:target_words].size) # first.nonzero? || (a[:chinese].size <=> b[:chinese].size) } end |
#target_words_per_sentence(sentence, words) ⇒ Object
538 539 540 |
# File 'lib/chinese_vocab/vocab.rb', line 538 def target_words_per_sentence(sentence, words) words.select {|w| include_every_char?(w, sentence) } end |
#to_csv(path_to_file, options) ⇒ void #to_csv(path_to_file) ⇒ void
This method returns an undefined value.
Saves the data stored in #stored_sentences to disk.
382 383 384 385 386 387 388 389 |
# File 'lib/chinese_vocab/vocab.rb', line 382 def to_csv(path_to_file, = {}) CSV.open(path_to_file, "w", ) do |csv| @stored_sentences.each do |row| csv << row.values end end end |
#try_alternate_download_sources(alternate_sources, word, options) ⇒ Object
488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 |
# File 'lib/chinese_vocab/vocab.rb', line 488 def try_alternate_download_sources(alternate_sources, word, ) sources = alternate_sources.dup sources.delete([:source]) result = sources.find do |s| = .merge(:source => s) sentence = Scraper.sentence(word, ) sentence.empty? ? nil : sentence end if result optins = .merge(:source => result) Scraper.sentence(word, ) else [] end end |
#uwc_tag(string) ⇒ Object
624 625 626 627 628 629 630 631 632 |
# File 'lib/chinese_vocab/vocab.rb', line 624 def uwc_tag(string) size = string.length case size when 1 "1_word" else "#{size}_words" end end |
#word_frequency ⇒ Hash
Calculates the number of occurences of every word of #words in #stored_sentences
559 560 561 562 563 564 565 566 567 568 569 570 |
# File 'lib/chinese_vocab/vocab.rb', line 559 def word_frequency words.reduce({}) do |acc, word| acc[word] = 0 # Set key with a default value of zero. stored_sentences.each do |row| sentence = row[:chinese] acc[word] += 1 if include_every_char?(word, sentence) end acc end end |