Class: Linguist::Tokenizer
- Inherits:
-
Object
- Object
- Linguist::Tokenizer
- Defined in:
- lib/linguist/tokenizer.rb
Overview
Generic programming language tokenizer.
Tokens are designed for use in the language bayes classifier. It strips any data strings or comments and preserves significant language symbols.
Constant Summary collapse
- SINGLE_LINE_COMMENTS =
[ '//', # C '#', # Ruby '%', # Tex ]
- MULTI_LINE_COMMENTS =
[ ['/*', '*/'], # C ['<!--', '-->'], # XML ['{-', '-}'], # Haskell ['(*', '*)'] # Coq ]
- START_SINGLE_LINE_COMMENT =
Regexp.compile(SINGLE_LINE_COMMENTS.map { |c| "^\s*#{Regexp.escape(c)} " }.join("|"))
- START_MULTI_LINE_COMMENT =
Regexp.compile(MULTI_LINE_COMMENTS.map { |c| Regexp.escape(c[0]) }.join("|"))
Class Method Summary collapse
-
.tokenize(data) ⇒ Object
Public: Extract tokens from data.
Instance Method Summary collapse
-
#extract_sgml_tokens(data) ⇒ Object
Internal: Extract tokens from inside SGML tag.
-
#extract_shebang(data) ⇒ Object
Internal: Extract normalized shebang command token.
-
#extract_tokens(data) ⇒ Object
Internal: Extract generic tokens from data.
Class Method Details
.tokenize(data) ⇒ Object
Public: Extract tokens from data
data - String to tokenize
Returns Array of token Strings.
15 16 17 |
# File 'lib/linguist/tokenizer.rb', line 15 def self.tokenize(data) new.extract_tokens(data) end |
Instance Method Details
#extract_sgml_tokens(data) ⇒ Object
Internal: Extract tokens from inside SGML tag.
data - SGML tag String.
Examples
extract_sgml_tokens("<a href='' class=foo>")
# => ["<a>", "href="]
Returns Array of token Strings.
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
# File 'lib/linguist/tokenizer.rb', line 150 def extract_sgml_tokens(data) s = StringScanner.new(data) tokens = [] until s.eos? # Emit start token if token = s.scan(/<\/?[^\s>]+/) tokens << "#{token}>" # Emit attributes with trailing = elsif token = s.scan(/\w+=/) tokens << token # Then skip over attribute value if s.scan(/"/) s.skip_until(/[^\\]"/) elsif s.scan(/'/) s.skip_until(/[^\\]'/) else s.skip_until(/\w+/) end # Emit lone attributes elsif token = s.scan(/\w+/) tokens << token # Stop at the end of the tag elsif s.scan(/>/) s.terminate else s.getch end end tokens end |
#extract_shebang(data) ⇒ Object
Internal: Extract normalized shebang command token.
Examples
extract_shebang("#!/usr/bin/ruby")
# => "ruby"
extract_shebang("#!/usr/bin/env node")
# => "node"
Returns String token or nil it couldn’t be parsed.
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
# File 'lib/linguist/tokenizer.rb', line 124 def extract_shebang(data) s = StringScanner.new(data) if path = s.scan(/^#!\s*\S+/) script = path.split('/').last if script == 'env' s.scan(/\s+/) script = s.scan(/\S+/) end script = script[/[^\d]+/, 0] return script end nil end |
#extract_tokens(data) ⇒ Object
Internal: Extract generic tokens from data.
data - String to scan.
Examples
extract_tokens("printf('Hello')")
# => ['printf', '(', ')']
Returns Array of token Strings.
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
# File 'lib/linguist/tokenizer.rb', line 50 def extract_tokens(data) s = StringScanner.new(data) tokens = [] until s.eos? if token = s.scan(/^#!.+$/) if name = extract_shebang(token) tokens << "SHEBANG#!#{name}" end # Single line comment elsif token = s.scan(START_SINGLE_LINE_COMMENT) tokens << token.strip s.skip_until(/\n|\Z/) # Multiline comments elsif token = s.scan(START_MULTI_LINE_COMMENT) tokens << token close_token = MULTI_LINE_COMMENTS.assoc(token)[1] s.skip_until(Regexp.compile(Regexp.escape(close_token))) tokens << close_token # Skip single or double quoted strings elsif s.scan(/"/) if s.peek(1) == "\"" s.getch else s.skip_until(/[^\\]"/) end elsif s.scan(/'/) if s.peek(1) == "'" s.getch else s.skip_until(/[^\\]'/) end # Skip number literals elsif s.scan(/(0x)?\d(\d|\.)*/) # SGML style brackets elsif token = s.scan(/<[^\s<>][^<>]*>/) extract_sgml_tokens(token).each { |t| tokens << t } # Common programming punctuation elsif token = s.scan(/;|\{|\}|\(|\)|\[|\]/) tokens << token # Regular token elsif token = s.scan(/[\w\.@#\/\*]+/) tokens << token # Common operators elsif token = s.scan(/<<?|\+|\-|\*|\/|%|&&?|\|\|?/) tokens << token else s.getch end end tokens end |