Class: Preprocessor::Simple
- Inherits:
-
Object
- Object
- Preprocessor::Simple
- Includes:
- ParallelHelper
- Defined in:
- lib/svm_helper/preprocessors/simple.rb
Overview
Preprocessor which just cleans to text
Constant Summary collapse
- GENDER_FILTER =
filters most gender stuff
%r{(\(*(m|w)(\/|\|)(w|m)\)*)|(/-*in)|\(in\)}
- SYMBOL_FILTER =
filters most wierd symbols
%r{/|-|–|:|\+|!|,|\.|\*|\?|/|·|\"|„|•||\||(\S*(&|;)\S*)}
- URL_FILTER =
urls and email filter
/(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?/
- EMAIL_FILTER =
/([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})/
- NEW_LINES =
filter for new lines
/(\r\n)|\r|\n/
- WORDS_IN_BRACKETS =
extract words from brackets
/\(([a-zA-Z]+)\)/
- WHITESPACE =
filters multiple whitesspace
/(\s| )+/
- XML_TAG_FILTER =
filters all kind of XMl/HTML tags
/<(.*?)>/
- CODE_TOKEN_FILTER =
filter for used job tokens
/\[[^\]]*\]|\([^\)]*\)|\{[^\}]*\}|\S*\d+\w+/
- STOPWORD_LOCATION =
stopword file TODO use File.expand_path
File.join(File.dirname(__FILE__),'..','stopwords')
Constants included from ParallelHelper
Instance Attribute Summary collapse
-
#language ⇒ Object
Returns the value of attribute language.
Instance Method Summary collapse
-
#clean_description(desc) ⇒ String
converts string into a cleaner version.
-
#clean_title(title) ⇒ String
converts string into a cleaner version.
-
#initialize(args = {}) ⇒ Simple
constructor
A new instance of Simple.
- #label ⇒ Object
-
#process(jobs) ⇒ Array<PreprocessedData>
cleans provided jobs.
-
#strip_stopwords(text) ⇒ Array<String>
loads a txt file with stop words.
Methods included from ParallelHelper
#p_map, #p_map_with_index, #parallel?
Constructor Details
#initialize(args = {}) ⇒ Simple
Returns a new instance of Simple.
34 35 36 37 38 |
# File 'lib/svm_helper/preprocessors/simple.rb', line 34 def initialize args={} @language = args.fetch(:language){'en'} @parallel = args.fetch(:parallel){false} @stopwords ||= IO.read(File.join(STOPWORD_LOCATION,@language)).split end |
Instance Attribute Details
#language ⇒ Object
Returns the value of attribute language.
31 32 33 |
# File 'lib/svm_helper/preprocessors/simple.rb', line 31 def language @language end |
Instance Method Details
#clean_description(desc) ⇒ String
converts string into a cleaner version
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
# File 'lib/svm_helper/preprocessors/simple.rb', line 93 def clean_description desc strip_stopwords( desc.gsub(XML_TAG_FILTER,' ') .gsub(EMAIL_FILTER,'') .gsub(URL_FILTER,'') .gsub(GENDER_FILTER,'') .gsub(NEW_LINES,'') .gsub(SYMBOL_FILTER,' ') .gsub(WHITESPACE,' ') .gsub(WORDS_IN_BRACKETS, '\1') .gsub(CODE_TOKEN_FILTER,'') .downcase .strip ) end |
#clean_title(title) ⇒ String
converts string into a cleaner version
79 80 81 82 83 84 85 86 87 |
# File 'lib/svm_helper/preprocessors/simple.rb', line 79 def clean_title title title.gsub(GENDER_FILTER,''). gsub(SYMBOL_FILTER,''). gsub(WORDS_IN_BRACKETS, '\1'). gsub(CODE_TOKEN_FILTER,''). gsub(WHITESPACE,' '). downcase. strip end |
#label ⇒ Object
40 41 42 |
# File 'lib/svm_helper/preprocessors/simple.rb', line 40 def label "simple" end |
#process(jobs, classification) ⇒ Array<PreprocessedData> #process(jobs, classification) ⇒ Array<PreprocessedData>
cleans provided jobs
57 58 59 60 61 62 63 |
# File 'lib/svm_helper/preprocessors/simple.rb', line 57 def process jobs if jobs.is_a? Array p_map(jobs) {|job| process_job job } else process_job jobs end end |
#strip_stopwords(text) ⇒ Array<String>
loads a txt file with stop words
70 71 72 |
# File 'lib/svm_helper/preprocessors/simple.rb', line 70 def strip_stopwords(text) (text.split - @stopwords).delete_if { |e| e.size <= 2 } end |