Class: Preprocessor::Simple

Inherits:
Object
  • Object
show all
Includes:
ParallelHelper
Defined in:
lib/svm_helper/preprocessors/simple.rb

Overview

Preprocessor which just cleans to text

Author:

  • Andreas Eger

Direct Known Subclasses

IDMapping, Stemming

Constant Summary collapse

GENDER_FILTER =

filters most gender stuff

%r{(\(*(m|w)(\/|\|)(w|m)\)*)|(/-*in)|\(in\)}
SYMBOL_FILTER =

filters most wierd symbols

%r{/|-|–|:|\+|!|,|\.|\*|\?|/|·|\"|„|•||\||(\S*(&|;)\S*)}
URL_FILTER =

urls and email filter

/(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?/
EMAIL_FILTER =
/([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})/
NEW_LINES =

filter for new lines

/(\r\n)|\r|\n/
WORDS_IN_BRACKETS =

extract words from brackets

/\(([a-zA-Z]+)\)/
WHITESPACE =

filters multiple whitesspace

/(\s| )+/
XML_TAG_FILTER =

filters all kind of XMl/HTML tags

/<(.*?)>/
CODE_TOKEN_FILTER =

filter for used job tokens

/\[[^\]]*\]|\([^\)]*\)|\{[^\}]*\}|\S*\d+\w+/
STOPWORD_LOCATION =

stopword file TODO use File.expand_path

File.join(File.dirname(__FILE__),'..','stopwords')

Constants included from ParallelHelper

ParallelHelper::THREAD_COUNT

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from ParallelHelper

#p_map, #p_map_with_index, #parallel?

Constructor Details

#initialize(args = {}) ⇒ Simple

Returns a new instance of Simple.



34
35
36
37
38
# File 'lib/svm_helper/preprocessors/simple.rb', line 34

def initialize args={}
  @language = args.fetch(:language){'en'}
  @parallel = args.fetch(:parallel){false}
  @stopwords ||= IO.read(File.join(STOPWORD_LOCATION,@language)).split
end

Instance Attribute Details

#languageObject

Returns the value of attribute language.



31
32
33
# File 'lib/svm_helper/preprocessors/simple.rb', line 31

def language
  @language
end

Instance Method Details

#clean_description(desc) ⇒ String

converts string into a cleaner version

Parameters:

  • desc (String)

    job description

Returns:

  • (String)

    clean and lowercase version of input



93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# File 'lib/svm_helper/preprocessors/simple.rb', line 93

def clean_description desc
  strip_stopwords(
    desc.gsub(XML_TAG_FILTER,' ')
        .gsub(EMAIL_FILTER,'')
        .gsub(URL_FILTER,'')
        .gsub(GENDER_FILTER,'')
        .gsub(NEW_LINES,'')
        .gsub(SYMBOL_FILTER,' ')
        .gsub(WHITESPACE,' ')
        .gsub(WORDS_IN_BRACKETS, '\1')
        .gsub(CODE_TOKEN_FILTER,'')
        .downcase
        .strip
    )
end

#clean_title(title) ⇒ String

converts string into a cleaner version

Parameters:

  • title (String)

    job title

Returns:

  • (String)

    clean and lowercase version of input



79
80
81
82
83
84
85
86
87
# File 'lib/svm_helper/preprocessors/simple.rb', line 79

def clean_title title
  title.gsub(GENDER_FILTER,'').
        gsub(SYMBOL_FILTER,'').
        gsub(WORDS_IN_BRACKETS, '\1').
        gsub(CODE_TOKEN_FILTER,'').
        gsub(WHITESPACE,' ').
        downcase.
        strip
end

#labelObject



40
41
42
# File 'lib/svm_helper/preprocessors/simple.rb', line 40

def label
  "simple"
end

#process(jobs, classification) ⇒ Array<PreprocessedData> #process(jobs, classification) ⇒ Array<PreprocessedData>

cleans provided jobs

Overloads:

  • #process(jobs, classification) ⇒ Array<PreprocessedData>

    Parameters:

    • jobs (Hash)

      single Job

    • classification (Symbol)

      in :industry, :function, :career_level

  • #process(jobs, classification) ⇒ Array<PreprocessedData>

    Parameters:

    • jobs (Array<Hash>)

      list of Jobs

    • classification (Symbol)

      in :industry, :function, :career_level

Returns:



57
58
59
60
61
62
63
# File 'lib/svm_helper/preprocessors/simple.rb', line 57

def process jobs
  if jobs.is_a? Array
    p_map(jobs) {|job| process_job job }
  else
    process_job jobs
  end
end

#strip_stopwords(text) ⇒ Array<String>

loads a txt file with stop words

Parameters:

  • location

    String folder with stopword lists

Returns:

  • (Array<String>)

    Array of stopwords



70
71
72
# File 'lib/svm_helper/preprocessors/simple.rb', line 70

def strip_stopwords(text)
  (text.split - @stopwords).delete_if { |e| e.size <= 2 }
end