Module: OcrChallenge::NameParser
- Included in:
- IBusinessCardParser
- Defined in:
- lib/ocr_challenge/name_parser.rb
Overview
It turns out that identifying names in a blob of text is hard. I decided to use a dictionary of names in combination with eliminating lines with digits.
Instance Method Summary collapse
-
#parse_names(dir_path) ⇒ Object
Note: the name files are expected to be new line separated names.
Instance Method Details
#parse_names(dir_path) ⇒ Object
Note: the name files are expected to be new line separated names
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# File 'lib/ocr_challenge/name_parser.rb', line 8 def parse_names(dir_path) #TODO: catch IO exception names_dir = Pathname.new(dir_path) name_files= names_dir.children preprocessed_lines = lines.map(&:strip).reject do |line| line =~ /\d/ # names shouldn't have digits in them end # compare the current line with all the names available in the name files preprocessed_lines.select do |line| name_files.any? do |file| name_lines = file.readlines name_lines.any? do |name_line| line.downcase =~ /\b#{name_line.downcase.chomp}\b/ end end end end |