Class: Linguist::Tokenizer
- Inherits:
-
Object
- Object
- Linguist::Tokenizer
- Defined in:
- lib/linguist/tokenizer.rb
Overview
Generic programming language tokenizer.
Tokens are designed for use in the language bayes classifier. It strips any data strings or comments and preserves significant language symbols.
Constant Summary collapse
- SINGLE_LINE_COMMENTS =
[ '//', # C '#', # Ruby '%', # Tex ]
- MULTI_LINE_COMMENTS =
[ ['/*', '*/'], # C ['<!--', '-->'], # XML ['{-', '-}'], # Haskell ['(*', '*)'] # Coq ]
- START_SINGLE_LINE_COMMENT =
Regexp.compile(SINGLE_LINE_COMMENTS.map { |c| "^\s*#{Regexp.escape(c)} " }.join("|"))
- START_MULTI_LINE_COMMENT =
Regexp.compile(MULTI_LINE_COMMENTS.map { |c| Regexp.escape(c[0]) }.join("|"))
Class Method Summary collapse
-
.tokenize(data) ⇒ Object
Public: Extract tokens from data.
Instance Method Summary collapse
-
#extract_sgml_tokens(data) ⇒ Object
Internal: Extract tokens from inside SGML tag.
-
#extract_tokens(data) ⇒ Object
Internal: Extract generic tokens from data.
Class Method Details
.tokenize(data) ⇒ Object
Public: Extract tokens from data
data - String to tokenize
Returns Array of token Strings.
13 14 15 |
# File 'lib/linguist/tokenizer.rb', line 13 def self.tokenize(data) new.extract_tokens(data) end |
Instance Method Details
#extract_sgml_tokens(data) ⇒ Object
Internal: Extract tokens from inside SGML tag.
data - SGML tag String.
Examples
extract_sgml_tokens("<a href='' class=foo>")
# => ["<a>", "href="]
Returns Array of token Strings.
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
# File 'lib/linguist/tokenizer.rb', line 108 def extract_sgml_tokens(data) s = StringScanner.new(data) tokens = [] until s.eos? # Emit start token if token = s.scan(/<\/?[^\s>]+/) tokens << "#{token}>" # Emit attributes with trailing = elsif token = s.scan(/\w+=/) tokens << token # Then skip over attribute value if s.scan(/"/) s.skip_until(/[^\\]"/) elsif s.scan(/'/) s.skip_until(/[^\\]'/) else s.skip_until(/\w+/) end # Emit lone attributes elsif token = s.scan(/\w+/) tokens << token # Stop at the end of the tag elsif s.scan(/>/) s.terminate else s.getch end end tokens end |
#extract_tokens(data) ⇒ Object
Internal: Extract generic tokens from data.
data - String to scan.
Examples
extract_tokens("printf('Hello')")
# => ['printf', '(', ')']
Returns Array of token Strings.
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
# File 'lib/linguist/tokenizer.rb', line 48 def extract_tokens(data) s = StringScanner.new(data) tokens = [] until s.eos? # Single line comment if token = s.scan(START_SINGLE_LINE_COMMENT) tokens << token.strip s.skip_until(/\n|\Z/) # Multiline comments elsif token = s.scan(START_MULTI_LINE_COMMENT) tokens << token close_token = MULTI_LINE_COMMENTS.assoc(token)[1] s.skip_until(Regexp.compile(Regexp.escape(close_token))) tokens << close_token # Skip single or double quoted strings elsif s.scan(/"/) s.skip_until(/[^\\]"/) elsif s.scan(/'/) s.skip_until(/[^\\]'/) # Skip number literals elsif s.scan(/(0x)?\d+/) # SGML style brackets elsif token = s.scan(/<[^\s<>][^<>]*>/) extract_sgml_tokens(token).each { |t| tokens << t } # Common programming punctuation elsif token = s.scan(/;|\{|\}|\(|\)/) tokens << token # Regular token elsif token = s.scan(/[\w\.@#\/\*]+/) tokens << token # Common operators elsif token = s.scan(/<<?|\+|\-|\*|\/|%|&&?|\|\|?/) tokens << token else s.getch end end tokens end |