Module: OcrChallenge::NameParser

Included in:
IBusinessCardParser
Defined in:
lib/ocr_challenge/name_parser.rb

Overview

It turns out that identifying names in a blob of text is hard. I decided to use a dictionary of names in combination with eliminating lines with digits.

Instance Method Summary collapse

Instance Method Details

#parse_names(dir_path) ⇒ Object

Note: the name files are expected to be new line separated names



8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# File 'lib/ocr_challenge/name_parser.rb', line 8

def parse_names(dir_path)

  #TODO: catch IO exception
  names_dir = Pathname.new(dir_path)
  name_files= names_dir.children

  preprocessed_lines = lines.map(&:strip).reject do |line|
    line =~ /\d/    # names shouldn't have digits in them
  end

  # compare the current line with all the names available in the name files
  preprocessed_lines.select do |line|
    name_files.any? do |file|
      name_lines = file.readlines
      name_lines.any? do |name_line|
        line.downcase =~ /\b#{name_line.downcase.chomp}\b/
      end
    end
  end
end