Module: PROIEL::Tokenization
- Defined in:
- lib/proiel/tokenization.rb
Constant Summary collapse
- WORD_PATTERN =
/([^[\u{E000}-\u{F8FF}][[:word:]]]+)/.freeze
Class Method Summary collapse
-
.is_splitable?(form) ⇒ true, false
Tests if a token form is splitable.
-
.load_patterns(filename) ⇒ Hash
Loads tokenization patterns from a configuration file.
-
.make_regex(pattern) ⇒ Regexp
Makes a regular expression from a pattern given in the configuration file.
-
.split_form(language_tag, form) ⇒ Array<String>
Splits a token form using the tokenization patterns that apply for a the specified language.
Class Method Details
.is_splitable?(form) ⇒ true, false
Tests if a token form is splitable. Any form with more than one character is splitable.
56 57 58 59 60 |
# File 'lib/proiel/tokenization.rb', line 56 def self.is_splitable?(form) raise ArgumentError, 'invalid form' unless form.is_a?(String) or form.nil? form and form.length > 1 end |
.load_patterns(filename) ⇒ Hash
Loads tokenization patterns from a configuration file.
The configuration file should be a JSON file. The keys should be language tags and the values tokenization patterns.
The method can be called multiple times. On the first invocation patterns will be loaded, on subsequent invocations patterns will be updated. Only patterns for languages that are defined in the configuration file will be updated, other patterns will remain unchanged.
22 23 24 25 26 27 28 29 30 31 |
# File 'lib/proiel/tokenization.rb', line 22 def self.load_patterns(filename) raise ArgumentError, 'invalid filename' unless filename.is_a?(String) patterns = JSON.parse(File.read(filename)) regexes = patterns.map { |l, p| [l, make_regex(p)] }.to_h @@regexes ||= {} @@regexes.merge!(regexes) end |
.make_regex(pattern) ⇒ Regexp
Makes a regular expression from a pattern given in the configuration file.
The regular expression is to avoid partial matches. Multi-line matches are allowed in case characters that are interpreted as line separators occur in the data.
43 44 45 46 47 |
# File 'lib/proiel/tokenization.rb', line 43 def self.make_regex(pattern) raise ArgumentError, 'invalid pattern' unless pattern.is_a?(String) Regexp.new("^#{pattern}$", Regexp::MULTILINE) end |
.split_form(language_tag, form) ⇒ Array<String>
Splits a token form using the tokenization patterns that apply for a the specified language. Tokenization patterns must already have been loaded.
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
# File 'lib/proiel/tokenization.rb', line 74 def self.split_form(language_tag, form) raise ArgumentError, 'invalid language tag' unless language_tag.is_a?(String) raise ArgumentError, 'invalid form' unless form.is_a?(String) if form[WORD_PATTERN] # Split on any non-word character like a space or punctuation form.split(WORD_PATTERN) elsif @@regexes.key?(language_tag) and form[@@regexes[language_tag]] # Apply language-specific pattern form.match(@@regexes[language_tag]).captures elsif form == '' [''] else # Give up and split by character form.split(/()/) end end |