Class: Linguist::Tokenizer
- Inherits:
-
Object
- Object
- Linguist::Tokenizer
- Defined in:
- lib/linguist/tokenizer.rb
Overview
Generic programming language tokenizer.
Tokens are designed for use in the language bayes classifier. It strips any data strings or comments and preserves significant language symbols.
Constant Summary collapse
- BYTE_LIMIT =
Read up to 100KB
100_000
- SINGLE_LINE_COMMENTS =
Start state on token, ignore anything till the next newline
[ '//', # C '#', # Ruby '%', # Tex ]
- MULTI_LINE_COMMENTS =
Start state on opening token, ignore anything until the closing token is reached.
[ ['/*', '*/'], # C ['<!--', '-->'], # XML ['{-', '-}'], # Haskell ['(*', '*)'] # Coq ]
- START_SINGLE_LINE_COMMENT =
Regexp.compile(SINGLE_LINE_COMMENTS.map { |c| "\s*#{Regexp.escape(c)} " }.join("|"))
- START_MULTI_LINE_COMMENT =
Regexp.compile(MULTI_LINE_COMMENTS.map { |c| Regexp.escape(c[0]) }.join("|"))
Class Method Summary collapse
-
.tokenize(data) ⇒ Object
Public: Extract tokens from data.
Instance Method Summary collapse
-
#extract_sgml_tokens(data) ⇒ Object
Internal: Extract tokens from inside SGML tag.
-
#extract_shebang(data) ⇒ Object
Internal: Extract normalized shebang command token.
-
#extract_tokens(data) ⇒ Object
Internal: Extract generic tokens from data.
Class Method Details
.tokenize(data) ⇒ Object
Public: Extract tokens from data
data - String to tokenize
Returns Array of token Strings.
15 16 17 |
# File 'lib/linguist/tokenizer.rb', line 15 def self.tokenize(data) new.extract_tokens(data) end |
Instance Method Details
#extract_sgml_tokens(data) ⇒ Object
Internal: Extract tokens from inside SGML tag.
data - SGML tag String.
Examples
extract_sgml_tokens("<a href='' class=foo>")
# => ["<a>", "href="]
Returns Array of token Strings.
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
# File 'lib/linguist/tokenizer.rb', line 158 def extract_sgml_tokens(data) s = StringScanner.new(data) tokens = [] until s.eos? # Emit start token if token = s.scan(/<\/?[^\s>]+/) tokens << "#{token}>" # Emit attributes with trailing = elsif token = s.scan(/\w+=/) tokens << token # Then skip over attribute value if s.scan(/"/) s.skip_until(/[^\\]"/) elsif s.scan(/'/) s.skip_until(/[^\\]'/) else s.skip_until(/\w+/) end # Emit lone attributes elsif token = s.scan(/\w+/) tokens << token # Stop at the end of the tag elsif s.scan(/>/) s.terminate else s.getch end end tokens end |
#extract_shebang(data) ⇒ Object
Internal: Extract normalized shebang command token.
Examples
extract_shebang("#!/usr/bin/ruby")
# => "ruby"
extract_shebang("#!/usr/bin/env node")
# => "node"
Returns String token or nil it couldn’t be parsed.
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
# File 'lib/linguist/tokenizer.rb', line 132 def extract_shebang(data) s = StringScanner.new(data) if path = s.scan(/^#!\s*\S+/) script = path.split('/').last if script == 'env' s.scan(/\s+/) script = s.scan(/\S+/) end script = script[/[^\d]+/, 0] if script return script end nil end |
#extract_tokens(data) ⇒ Object
Internal: Extract generic tokens from data.
data - String to scan.
Examples
extract_tokens("printf('Hello')")
# => ['printf', '(', ')']
Returns Array of token Strings.
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
# File 'lib/linguist/tokenizer.rb', line 56 def extract_tokens(data) s = StringScanner.new(data) tokens = [] until s.eos? break if s.pos >= BYTE_LIMIT if token = s.scan(/^#!.+$/) if name = extract_shebang(token) tokens << "SHEBANG#!#{name}" end # Single line comment elsif s.beginning_of_line? && token = s.scan(START_SINGLE_LINE_COMMENT) # tokens << token.strip s.skip_until(/\n|\Z/) # Multiline comments elsif token = s.scan(START_MULTI_LINE_COMMENT) # tokens << token close_token = MULTI_LINE_COMMENTS.assoc(token)[1] s.skip_until(Regexp.compile(Regexp.escape(close_token))) # tokens << close_token # Skip single or double quoted strings elsif s.scan(/"/) if s.peek(1) == "\"" s.getch else s.skip_until(/[^\\]"/) end elsif s.scan(/'/) if s.peek(1) == "'" s.getch else s.skip_until(/[^\\]'/) end # Skip number literals elsif s.scan(/(0x)?\d(\d|\.)*/) # SGML style brackets elsif token = s.scan(/<[^\s<>][^<>]*>/) extract_sgml_tokens(token).each { |t| tokens << t } # Common programming punctuation elsif token = s.scan(/;|\{|\}|\(|\)|\[|\]/) tokens << token # Regular token elsif token = s.scan(/[\w\.@#\/\*]+/) tokens << token # Common operators elsif token = s.scan(/<<?|\+|\-|\*|\/|%|&&?|\|\|?/) tokens << token else s.getch end end tokens end |