Class: Chinese::Vocab

Inherits:
Object
  • Object
show all
Includes:
HelperMethods, WithValidations
Defined in:
lib/chinese_vocab/vocab.rb

Constant Summary collapse

OPTIONS =

Mandatory constant for the [WithValidations](rubydoc.info/github/bytesource/with_validations/file/README.md) module. Each key-value pair is of the following type:

`option_key => [default_value, validation]`
{:compact      => [false, lambda {|value| is_boolean?(value) }],
:with_pinyin  => [true,  lambda {|value| is_boolean?(value) }],
:thread_count => [8,     lambda {|value| value.kind_of?(Integer) }]}

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from HelperMethods

#distinct_words, #include_every_char?, included, #is_unicode?

Constructor Details

#initialize(word_array, options) ⇒ Vocab #initialize(word_array) ⇒ Vocab

Note:

Words that are composite expressions must be written with a least one non-word character (such as whitespace) between each sub-expression. Example: “除了 以外” or “除了。。以外” instead of “除了以外”.

Intializes an object.

Examples:

require 'chinese_vocab'

# Extract the Chinese words from a CSV file.
words = Chinese::Vocab.parse_words('path/to/file/hsk.csv', 4)

# Initialize Chinese::Vocab with word array
# :compact => true means single character words are that also appear in multi-character
# words are removed from the word array (["看", "看书"] => [看书])
vocabulary = Chinese::Vocab.new(words, :compact => true)

# Return minimum necessary sentences.
vocabulary.min_sentences(:size => small)

# See how what are the unique characters in all these sentences.
vocabulary.sentences_unique_chars(my_sentences)
# => ["我", "们", "跟", "他", "是", "好", "朋", "友", ...]

# Save to file
vocabulary.to_csv('path/to_file/vocab_sentences.csv')

Overloads:

  • #initialize(word_array, options) ⇒ Vocab

    Parameters:

    • word_array (Array<String>)

      An array of Chinese words that is stored in #words after all non-ascii, non-unicode characters have been stripped and double entries removed.

    • options (Hash)

      The options to customize the following feature.

    Options Hash (options):

    • :compact (Boolean)

      Whether or not to remove all single character words that also appear in at least one multi character word. Example: ([“看”, “看书”] => [看书]) The reason behind this option is to remove redundancy in meaning and focus on learning distinct words. Defaults to ‘false`.

Parameters:

  • word_array (Array<String>)

    An array of Chinese words that is stored in #words after all non-ascii, non-unicode characters have been stripped and double entries removed.



63
64
65
66
67
68
69
70
# File 'lib/chinese_vocab/vocab.rb', line 63

def initialize(word_array, options={})
  @compact = validate { :compact }
  @words    = edit_vocab(word_array)
  @words    = remove_redundant_single_char_words(@words)  if @compact
  @chinese  = is_unicode?(@words[0])
  @not_found        = []
  @stored_sentences = []
end

Instance Attribute Details

#compactBoolean (readonly)

Returns the value of the :compact options key.

Returns:

  • (Boolean)

    the value of the :compact options key.



30
31
32
# File 'lib/chinese_vocab/vocab.rb', line 30

def compact
  @compact
end

#not_foundArray<String> (readonly)

of the supported online dictionaries during a call to either #sentences or #min_sentences. Defaults to ‘[]`.

Returns:

  • (Array<String>)

    holds those Chinese words from #words that could not be found in any



34
35
36
# File 'lib/chinese_vocab/vocab.rb', line 34

def not_found
  @not_found
end

#stored_sentencesArray<Hash> (readonly)

Returns holds the return value of either #sentences or #min_sentences, whichever was called last. Defaults to ‘[]`.

Returns:



39
40
41
# File 'lib/chinese_vocab/vocab.rb', line 39

def stored_sentences
  @stored_sentences
end

#with_pinyinBoolean (readonly)

Returns the value of the ‘:with_pinyin` option key.

Returns:

  • (Boolean)

    the value of the ‘:with_pinyin` option key.



36
37
38
# File 'lib/chinese_vocab/vocab.rb', line 36

def with_pinyin
  @with_pinyin
end

#wordsArray<String> (readonly)

The list of Chinese words after calling #edit_vocab. Editing includes:

* Removing parentheses (with the content inside each parenthesis).
* Removing any slash (/) and only keeping the longest part.
* Removing trailing '儿' from any word longer than two characters.
* Removing non-word characters such as points and commas.
* Removing and duplicate words.

Returns:



28
29
30
# File 'lib/chinese_vocab/vocab.rb', line 28

def words
  @words
end

Class Method Details

.parse_words(path_to_csv, word_col, options) ⇒ Array<String> .parse_words(path_to_csv, word_col) ⇒ Array<String>

Note:

Words that are composite expressions must be written with a least one non-word character (such as whitespace) between each sub-expression. Example: “除了 以外” or “除了。。以外” instead of “除了以外”.

Extracts the vocabulary column from a CSV file as an array of strings. The array is

normally provided as an argument to {#initialize}

Examples:

require 'chinese_vocab'

# Extract the Chinese words from a CSV file.
words = Chinese::Vocab.parse_words('path/to/file/hsk.csv', 4)

# Initialize Chinese::Vocab with word array
# :compact => true means single character words are that also appear in multi-character
# words are removed from the word array (["看", "看书"] => [看书])
vocabulary = Chinese::Vocab.new(words, :compact => true)

# Return minimum necessary sentences.
vocabulary.min_sentences(:size => small)

# See how what are the unique characters in all these sentences.
vocabulary.sentences_unique_chars(my_sentences)
# => ["我", "们", "跟", "他", "是", "好", "朋", "友", ...]

# Save to file
vocabulary.to_csv('path/to_file/vocab_sentences.csv')

Overloads:

  • .parse_words(path_to_csv, word_col, options) ⇒ Array<String>

    Parameters:

    • path_to_csv (String)

      The relative or full path to the CSV file.

    • word_col (Integer)

      The column number of the vocabulary column (counting starts at 1).

    • options (Hash)

      The [supported options](ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new) of Ruby’s CSV library as well as the ‘:encoding` parameter. Exceptions: `:encoding` is always set to `utf-8` and `:skip_blanks` to `true` internally.

  • .parse_words(path_to_csv, word_col) ⇒ Array<String>

    Parameters:

    • path_to_csv (String)

      The relative or full path to the CSV file.

    • word_col (Integer)

      The column number of the vocabulary column (counting starts at 1).

Returns:

  • (Array<String>)

    The vocabluary (Chinese words)

Raises:

  • (ArgumentError)


86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/chinese_vocab/vocab.rb', line 86

def self.parse_words(path_to_csv, word_col, options={})
  # Enforced options:
  # encoding: utf-8 (necessary for parsing Chinese characters)
  # skip_blanks: true
  options.merge!({:encoding => 'utf-8', :skip_blanks => true})
  csv = CSV.read(path_to_csv, options)

  raise ArgumentError, "Column number (#{word_col}) out of range."  unless within_range?(word_col, csv[0])
  # 'word_col counting starts at 1, but CSV.read returns an array,
  # where counting starts at 0.
  col = word_col-1
  csv.reduce([]) {|words, row|
    word = row[col]
    # If word_col contains no data, CSV::read returns nil.
    # We also want to skip empty strings or strings that only contain whitespace.
    words << word  unless word.nil? || word.strip.empty?
    words
  }
end

.within_range?(column, row) ⇒ Boolean

Input: column: word column number (counting from 1) row : Array of the processed CSV data that contains our word column.

Returns:

  • (Boolean)


666
667
668
669
# File 'lib/chinese_vocab/vocab.rb', line 666

def self.within_range?(column, row)
  no_of_cols = row.size
  column >= 1 && column <= no_of_cols
end

Instance Method Details

#add_key(hash_array, key, &block) ⇒ Object



613
614
615
616
617
618
619
620
621
# File 'lib/chinese_vocab/vocab.rb', line 613

def add_key(hash_array, key, &block)
  hash_array.map do |row|
    if block
      row.merge({key => block.call(row)})
    else
      row
    end
  end
end

#add_target_words(hash_array, words) ⇒ Object



512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
# File 'lib/chinese_vocab/vocab.rb', line 512

def add_target_words(hash_array, words)
  from_queue  = Queue.new
  to_queue    = Queue.new
  # semaphore = Mutex.new
  result      = []
  # words       = @words
  hash_array.each {|hash| from_queue << hash}

  10.times.map {
    Thread.new(words) do

      while(row = from_queue.pop!)
        sentence     = row[:chinese]
        target_words = target_words_per_sentence(sentence, words)

        to_queue << row.merge(:target_words => target_words)

      end
    end
  }.map {|thread| thread.join}

  to_queue.to_a

end

#alternate_source(sources, selection) ⇒ Object



672
673
674
675
676
# File 'lib/chinese_vocab/vocab.rb', line 672

def alternate_source(sources, selection)
  sources = sources.dup
  sources.delete(selection)
  sources.pop
end

#contains_all_target_words?(selected_rows, sentence_key) ⇒ Boolean

Returns:

  • (Boolean)


635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
# File 'lib/chinese_vocab/vocab.rb', line 635

def contains_all_target_words?(selected_rows, sentence_key)

  matched_words = @words.reduce([]) do |acc, word|

    result = selected_rows.find do |row|
      sentence = row[sentence_key]
      include_every_char?(word, sentence)
    end

    if result
      acc << word
    end

    acc
  end

  # matched_words.size == @words.size

  if matched_words.size == @words.size
    true
  else
    puts "Words not found in sentences:"
    p @words - matched_words
    false
  end
end

#convert(text) ⇒ Object



507
508
509
# File 'lib/chinese_vocab/vocab.rb', line 507

def convert(text)
  eval(text.chomp)
end

#edit_vocab(word_array) ⇒ Object

Remove all non-word characters



408
409
410
411
412
413
414
415
416
# File 'lib/chinese_vocab/vocab.rb', line 408

def edit_vocab(word_array)

  word_array.map {|word|
    edited = remove_parens(word)
    edited = remove_slash(edited)
    edited = remove_er_character_from_end(edited)
    distinct_words(edited).join(' ')
  }.uniq
end

#find_minimum_sentences(sentences, words) ⇒ Object



295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
# File 'lib/chinese_vocab/vocab.rb', line 295

def find_minimum_sentences(sentences, words)
  min_sentences   = []
  # At the start the variable 'remaining words' contains all
  # target words - minus those with no sentence found.
  remaining_words = Set.new(words.dup)


  # On every round:
  # Finds the sentence with the most target words ('best sentence').
  # Adds that sentence to the result array.
  # Deletes all target words from the remaining words that are part of
  # the best sentence.
  while(!remaining_words.empty?) do
    puts "Number of remaining_words: #{remaining_words.size}"
    # puts "Take five: #{remaining_words.take(5)}"

    # Return the sentence with the largest number of target words.
    sentences = sentences.sort_by do |row|
      # Returns a new array containing elements common to
      # the two arrays, with no duplicates.
      words_left = remaining_words.intersection(row[:target_words])

      # Sort by size of words left first (in descsending order),
      # if equal, sort by length of the Chinese sentence (in ascending order).
      [-words_left.size, row[:chinese].size]
    end

    best_sentence = sentences.first

    # Add the sentence with the largest number of
    # target words to the result array.
    min_sentences << best_sentence
    # Remove the target words that are part of the
    # best sentence from the remaining words.
    remaining_words = remaining_words - best_sentence[:target_words]
  end

  # puts "Number of minimum sentences: #{min_sentences.size}"
  min_sentences
end

#is_boolean?(value) ⇒ Boolean

Returns:

  • (Boolean)


401
402
403
404
# File 'lib/chinese_vocab/vocab.rb', line 401

def is_boolean?(value)
  # Only true for either 'false' or 'true'
  !!value == value
end

#make_hash(*data) ⇒ Object



437
438
439
440
441
# File 'lib/chinese_vocab/vocab.rb', line 437

def make_hash(*data)
  require 'digest'
  data = data.reduce("") { |acc, item| acc << item.to_s }
  Digest::SHA2.hexdigest(data)[0..6]
end

#min_sentences(options) ⇒ Array<Hash>, []

Note:

In case of a network error during dowloading the sentences the data fetched so far is automatically copied to a file after several retries. This data is read and processed on the next run to reduce the time spend with downloading the sentences (which is by far the most time-consuming part).

Note:

Despite the download source chosen (by using the default or setting the ‘:source` options), if a word was not found on the first site, the second site is used as an alternative.

For every Chinese word in #words fetches a Chinese sentence and its English translation from an online dictionary, then calculates and the minimum number of sentences necessary to cover every word in #words at least once. The calculation is based on the fact that many words occur in more than one sentence.

The return value is also stored in {#stored_sentences}.

Examples:

require 'chinese_vocab'

# Extract the Chinese words from a CSV file.
words = Chinese::Vocab.parse_words('path/to/file/hsk.csv', 4)

# Initialize Chinese::Vocab with word array
# :compact => true means single character words are that also appear in multi-character
# words are removed from the word array (["看", "看书"] => [看书])
vocabulary = Chinese::Vocab.new(words, :compact => true)

# Return minimum necessary sentences.
vocabulary.min_sentences(:size => small)

# See how what are the unique characters in all these sentences.
vocabulary.sentences_unique_chars(my_sentences)
# => ["我", "们", "跟", "他", "是", "好", "朋", "友", ...]

# Save to file
vocabulary.to_csv('path/to_file/vocab_sentences.csv')

Parameters:

  • options (Hash)

    The options to customize the following features.

Options Hash (options):

  • :source (Symbol)

    The online dictionary to download the sentences from, either [:nciku](www.nciku.com) or [:jukuu](www.jukuu.com). Defaults to ‘:nciku`.

  • :size (Symbol)

    The size of the sentence to return from a possible set of several sentences. Supports the values ‘:short`, `:average`, and `:long`. Defaults to `:short`.

  • :with_pinyin (Boolean)

    Whether or not to return the pinyin representation of a sentence. Defaults to ‘true`.

  • :thread_count (Integer)

    The number of threads used to download the sentences. Defaults to ‘8`.

Returns:

  • (Array<Hash>, [])

    By default each hash holds the following key-value pairs (The return value is also stored in #stored_sentences.):

    • :chinese => Chinese sentence

    • :english => English translation

    • :pinyin => Pinyin

    • :uwc => Unique words count tag (String) of the form “x_words”, where x denotes the number of unique words from #words found in the sentence.

    • :uws => Unique words string tag (String) of the form “[词语1,词语2,…]”, where *词语* denotes the actual word(s) from #words found in the sentence.



268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
# File 'lib/chinese_vocab/vocab.rb', line 268

def min_sentences(options = {})
  @with_pinyin = validate { :with_pinyin }
  # Always run this method.
  thread_count = validate { :thread_count }
  sentences    = sentences(options)

  # Remove those words that don't have a sentence
  words             = @words - @not_found
  puts "Determining the target words for every sentence..."
  sentences         = add_target_words(sentences, words)

  minimum_sentences = find_minimum_sentences(sentences, words)

  # :uwc = 'unique words count'
  with_uwc_tag = add_key(minimum_sentences, :uwc) {|row| uwc_tag(row[:target_words]) }
  # :uws = 'unique words string'
  with_uwc_uws_tags = add_key(with_uwc_tag, :uws) do |row|
    words = row[:target_words].sort.join(', ')
    "[" + words + "]"
  end
  # Remove those keys we don't need anymore
  result = remove_keys(with_uwc_uws_tags, :target_words, :word)
  @stored_sentences = result
  @stored_sentences
end

#occurrence_count(word_array, frequency) ⇒ Object



601
602
603
604
605
# File 'lib/chinese_vocab/vocab.rb', line 601

def occurrence_count(word_array, frequency)
  word_array.reduce(0) do |acc, word|
    acc + frequency[word]
  end
end

#remove_er_character_from_end(word) ⇒ Object



419
420
421
422
423
424
425
# File 'lib/chinese_vocab/vocab.rb', line 419

def remove_er_character_from_end(word)
  if word.size > 2
  word.gsub(/儿$/, '')
  else # Don't remove "儿" form words like 女儿
    word
  end
end

#remove_keys(hash_array, *keys) ⇒ Object



608
609
610
# File 'lib/chinese_vocab/vocab.rb', line 608

def remove_keys(hash_array, *keys)
  hash_array.map { |row| row.delete_keys(*keys) }
end

#remove_parens(word) ⇒ Object

Helper functions




394
395
396
397
398
# File 'lib/chinese_vocab/vocab.rb', line 394

def remove_parens(word)
  # 1) Remove all ASCII parens and all data in between.
  # 2) Remove all Chinese parens and all data in between.
  word.gsub(/\(.*?\)/, '').gsub(/(.*?)/, '')
end

#remove_redundant_single_char_words(words) ⇒ Object

Input: [“看”, “书”, “看书”] Output: [“看书”]



446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
# File 'lib/chinese_vocab/vocab.rb', line 446

def remove_redundant_single_char_words(words)
  puts "Removing redundant single character words from the vocabulary..."

  single_char_words, multi_char_words = words.partition {|word| word.length == 1 }
  return single_char_words  if multi_char_words.empty?

  non_redundant_single_char_words = single_char_words.reduce([]) do |acc, single_c|

    already_found = multi_char_words.find do |multi_c|
      multi_c.include?(single_c)
    end
    # Add single char word to array if it is not part of any of the multi char words.
    acc << single_c  unless already_found
    acc
  end

  non_redundant_single_char_words + multi_char_words
end

#remove_slash(word) ⇒ Object



428
429
430
431
432
433
434
# File 'lib/chinese_vocab/vocab.rb', line 428

def remove_slash(word)
  if word.match(/\//)
    word.split(/\//).sort_by { |w| w.size }.last
  else
    word
  end
end

#select_minimum_necessary_sentences(sentences) ⇒ Object

Deprecated.

This method has been replaced by #find_minimum_sentences.



574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
# File 'lib/chinese_vocab/vocab.rb', line 574

def select_minimum_necessary_sentences(sentences)
  words = @words - @not_found
  with_target_words = add_target_words(sentences, words)
  rows              = sort_by_target_word_count(with_target_words)

  selected_rows   = []
  unmatched_words = @words.dup
  matched_words   = []

  rows.each do |row|
    words = row[:target_words].dup
    # Delete all words from 'words' that have already been encoutered
    # (and are included in 'matched_words').
    words = words - matched_words

    if words.size > 0  # Words that where not deleted above have to be part of 'unmatched_words'.
      selected_rows << row  # Select this row.

      # When a row is selected, its 'words' are no longer unmatched but matched.
      unmatched_words = unmatched_words - words
      matched_words   = matched_words + words
    end
  end
  selected_rows
end

#select_sentence(word, options) ⇒ Object

Uses options passed from #sentences



467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
# File 'lib/chinese_vocab/vocab.rb', line 467

def select_sentence(word, options)
  sentence_pair = Scraper.sentence(word, options)

  sources = Scraper::Sources.keys
  sentence_pair = try_alternate_download_sources(sources, word, options)  if sentence_pair.empty?

  if sentence_pair.empty?
    @not_found << word
    return nil
  else
    chinese, english = sentence_pair

    result = Hash.new
    result.merge!(word:    word)
    result.merge!(chinese: chinese)
    result.merge!(pinyin:  chinese.to_pinyin)  if @with_pinyin
    result.merge!(english: english)
  end
end

#sentences(options) ⇒ Hash

Note:

(Normally you only call this method directly if you really need one sentence per Chinese word (even if these words might appear in more than one of the sentences.).

Note:

In case of a network error during dowloading the sentences the data fetched so far is automatically copied to a file after several retries. This data is read and processed on the next run to reduce the time spend with downloading the sentences (which is by far the most time-consuming part).

Note:

Despite the download source chosen (by using the default or setting the ‘:source` options), if a word was not found on the first site, the second site is used as an alternative.

For every Chinese word in #words fetches a Chinese sentence and its English translation from an online dictionary,

The return value is also stored in {#stored_sentences}.

Examples:

require 'chinese_vocab'

# Extract the Chinese words from a CSV file.
words = Chinese::Vocab.parse_words('path/to/file/hsk.csv', 4)

# Initialize Chinese::Vocab with word array
# :compact => true means single character words are that also appear in multi-character
# words are removed from the word array (["看", "看书"] => [看书])
vocabulary = Chinese::Vocab.new(words, :compact => true)

# Return a sentence for each word
vocabulary.sentences(:size => small)

Parameters:

  • options (Hash)

    The options to customize the following features.

Options Hash (options):

  • :source (Symbol)

    The online dictionary to download the sentences from, either [:nciku](www.nciku.com) or [:jukuu](www.jukuu.com). Defaults to :nciku.

  • :size (Symbol)

    The size of the sentence to return from a possible set of several sentences. Supports the values :short, :average, and :long. Defaults to :short.

  • :with_pinyin (Boolean)

    Whether or not to return the pinyin representation of a sentence. Defaults to ‘true`.

  • :thread_count (Integer)

    The number of threads used to download the sentences. Defaults to ‘8`.

Returns:

  • (Hash)

    By default each hash holds the following key-value pairs (The return value is also stored in #stored_sentences.):

    • :chinese => Chinese sentence

    • :english => English translation

    • :pinyin => Pinyin



144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# File 'lib/chinese_vocab/vocab.rb', line 144

def sentences(options={})
  puts "Fetching sentences..."
  # Always run this method.

  # We assign all options to a variable here (also those that are passed on)
  # as we need them in order to calculate the id.
  @with_pinyin = validate { :with_pinyin }
  thread_count = validate { :thread_count }
  id           = make_hash(@words, options.to_a.sort)
  words        = @words

  from_queue  = Queue.new
  to_queue    = Queue.new
  file_name   = id

  if File.exist?(file_name)
    puts "Examining file..."
    words, sentences, not_found = File.open(file_name) { |f| f.readlines }
    words = convert(words)
    convert(sentences).each { |s| to_queue << s }
    @not_found = convert(not_found)
    size_a = words.size
    size_b = to_queue.size
    puts "Size(@not_found)  = #{@not_found.size}"
    puts "Size(words)       = #{size_a}"
    puts "Size(to_queue)    = #{size_b}"
    puts "Size(words+queue) = #{size_a+size_b}"
    puts "Size(sentences)   = #{to_queue.size}"

    # Remove file
    File.unlink(file_name)
  end

  words.each {|word| from_queue << word }
  result = []

  Thread.abort_on_exception = false

  1.upto(thread_count).map {
    Thread.new do

      while(word = from_queue.pop!) do

        begin
          local_result = select_sentence(word, options)
          puts "Processing word: #{word} (#{from_queue.size} words left)"
          # rescue SocketError, Timeout::Error, Errno::ETIMEDOUT,
          # Errno::ECONNREFUSED, Errno::ECONNRESET, EOFError => e
        rescue Exception => e
          puts " #{e.message}."
          puts "Please DO NOT abort, but wait for either the program to continue or all threads"
          puts "to terminate (in which case the data will be saved to disk for fast retrieval on the next run.)"
          puts "Number of running threads: #{Thread.list.size - 1}."
          raise

        ensure
          from_queue << word  if $!
          puts "Wrote '#{word}' back to queue"  if $!
        end

        to_queue << local_result  unless local_result.nil?

      end
    end
  }.each {|thread| thread.join }

  @stored_sentences = to_queue.to_a
  @stored_sentences

ensure
  if $!
    while(Thread.list.size > 1) do # Wait for all child threads to terminate.
      sleep 5
    end

    File.open(file_name, 'w') do |f|
      p "============================="
      p "Writing data to file..."
      f.write from_queue.to_a
      f.puts
      f.write to_queue.to_a
      f.puts
      f.write @not_found
      puts "Finished writing data."
      puts "Please run the program again after solving the (connection) problem."
    end
  end
end

#sentences_unique_chars(sentences) ⇒ Array<String>

Note:

If no argument is passed, the data from #stored_sentences is used as input

Finds the unique Chinese characters from either the data in #stored_sentences or an array of Chinese sentences passed as an argument.

Examples:

require 'chinese_vocab'

# Extract the Chinese words from a CSV file.
words = Chinese::Vocab.parse_words('path/to/file/hsk.csv', 4)

# Initialize Chinese::Vocab with word array
# :compact => true means single character words are that also appear in multi-character
# words are removed from the word array (["看", "看书"] => [看书])
vocabulary = Chinese::Vocab.new(words, :compact => true)

# Return minimum necessary sentences.
vocabulary.min_sentences(:size => small)

# See how what are the unique characters in all these sentences.
vocabulary.sentences_unique_chars(my_sentences)
# => ["我", "们", "跟", "他", "是", "好", "朋", "友", ...]

# Save to file
vocabulary.to_csv('path/to_file/vocab_sentences.csv')

Parameters:

  • sentences (Array<String>, Array<Hash>)

    An array of chinese sentences or an array of hashes with the key ‘:chinese`.

Returns:

  • (Array<String>)

    The unique Chinese characters



363
364
365
366
367
368
369
370
371
# File 'lib/chinese_vocab/vocab.rb', line 363

def sentences_unique_chars(sentences = stored_sentences)
  # If the argument is an array of hashes, then it must be the data from @stored_sentences
  sentences = sentences.map { |hash| hash[:chinese] }  if sentences[0].kind_of?(Hash)

  sentences.reduce([]) do |acc, row|
    acc = acc | row.scan(/\p{Word}/) # only return characters, skip punctuation marks
    acc
  end
end

#sort_by_target_word_count(with_target_words) ⇒ Object



543
544
545
546
547
548
549
550
551
552
553
554
# File 'lib/chinese_vocab/vocab.rb', line 543

def sort_by_target_word_count(with_target_words)

  # First sort by size of unique word array (from large to short)
  # If the unique word count is equal, sort by the length of the sentence (from small to large)
  with_target_words.sort_by {|row|
    [-row[:target_words].size, row[:chinese].size] }

    #  The above is the same as:
    #   with_target_words.sort {|a,b|
    #     first = -(a[:target_words].size <=> b[:target_words].size)
    #     first.nonzero? || (a[:chinese].size <=> b[:chinese].size) }
end

#target_words_per_sentence(sentence, words) ⇒ Object



538
539
540
# File 'lib/chinese_vocab/vocab.rb', line 538

def target_words_per_sentence(sentence, words)
   words.select {|w| include_every_char?(w, sentence) }
end

#to_csv(path_to_file, options) ⇒ void #to_csv(path_to_file) ⇒ void

This method returns an undefined value.

Saves the data stored in #stored_sentences to disk.

Examples:

require 'chinese_vocab'

# Extract the Chinese words from a CSV file.
words = Chinese::Vocab.parse_words('path/to/file/hsk.csv', 4)

# Initialize Chinese::Vocab with word array
# :compact => true means single character words are that also appear in multi-character
# words are removed from the word array (["看", "看书"] => [看书])
vocabulary = Chinese::Vocab.new(words, :compact => true)

# Return minimum necessary sentences.
vocabulary.min_sentences(:size => small)

# See how what are the unique characters in all these sentences.
vocabulary.sentences_unique_chars(my_sentences)
# => ["我", "们", "跟", "他", "是", "好", "朋", "友", ...]

# Save to file
vocabulary.to_csv('path/to_file/vocab_sentences.csv')

Overloads:

  • #to_csv(path_to_file, options) ⇒ void

    Parameters:

  • #to_csv(path_to_file) ⇒ void

    Parameters:

    • path_to_file (String)

      file name and path of where to save the file.



382
383
384
385
386
387
388
389
# File 'lib/chinese_vocab/vocab.rb', line 382

def to_csv(path_to_file, options = {})

  CSV.open(path_to_file, "w", options) do |csv|
    @stored_sentences.each do |row|
      csv << row.values
    end
  end
end

#try_alternate_download_sources(alternate_sources, word, options) ⇒ Object



488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
# File 'lib/chinese_vocab/vocab.rb', line 488

def try_alternate_download_sources(alternate_sources, word, options)
  sources = alternate_sources.dup
  sources.delete(options[:source])

  result = sources.find do |s|
    options  = options.merge(:source => s)
    sentence = Scraper.sentence(word, options)
    sentence.empty? ? nil : sentence
  end

  if result
    optins = options.merge(:source => result)
    Scraper.sentence(word, options)
  else
    []
  end
end

#uwc_tag(string) ⇒ Object



624
625
626
627
628
629
630
631
632
# File 'lib/chinese_vocab/vocab.rb', line 624

def uwc_tag(string)
  size = string.length
  case size
  when 1
    "1_word"
  else
    "#{size}_words"
  end
end

#word_frequencyHash

Calculates the number of occurences of every word of #words in #stored_sentences

Returns:



559
560
561
562
563
564
565
566
567
568
569
570
# File 'lib/chinese_vocab/vocab.rb', line 559

def word_frequency

  words.reduce({}) do |acc, word|
    acc[word] = 0 # Set key with a default value of zero.

    stored_sentences.each do |row|
      sentence = row[:chinese]
      acc[word] += 1 if include_every_char?(word, sentence)
    end
    acc
  end
end