Class: Linguist::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/linguist/tokenizer.rb

Overview

Generic programming language tokenizer.

Tokens are designed for use in the language bayes classifier. It strips any data strings or comments and preserves significant language symbols.

Constant Summary collapse

SINGLE_LINE_COMMENTS =
[
  '//', # C
  '#',  # Ruby
  '%',  # Tex
]
MULTI_LINE_COMMENTS =
[
  ['/*', '*/'],    # C
  ['<!--', '-->'], # XML
  ['{-', '-}'],    # Haskell
  ['(*', '*)']     # Coq
]
START_SINGLE_LINE_COMMENT =
Regexp.compile(SINGLE_LINE_COMMENTS.map { |c|
  "^\s*#{Regexp.escape(c)} "
}.join("|"))
START_MULTI_LINE_COMMENT =
Regexp.compile(MULTI_LINE_COMMENTS.map { |c|
  Regexp.escape(c[0])
}.join("|"))

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.tokenize(data) ⇒ Object

Public: Extract tokens from data

data - String to tokenize

Returns Array of token Strings.



15
16
17
# File 'lib/linguist/tokenizer.rb', line 15

def self.tokenize(data)
  new.extract_tokens(data)
end

Instance Method Details

#extract_sgml_tokens(data) ⇒ Object

Internal: Extract tokens from inside SGML tag.

data - SGML tag String.

Examples

extract_sgml_tokens("<a href='' class=foo>")
# => ["<a>", "href="]

Returns Array of token Strings.



150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
# File 'lib/linguist/tokenizer.rb', line 150

def extract_sgml_tokens(data)
  s = StringScanner.new(data)

  tokens = []

  until s.eos?
    # Emit start token
    if token = s.scan(/<\/?[^\s>]+/)
      tokens << "#{token}>"

    # Emit attributes with trailing =
    elsif token = s.scan(/\w+=/)
      tokens << token

      # Then skip over attribute value
      if s.scan(/"/)
        s.skip_until(/[^\\]"/)
      elsif s.scan(/'/)
        s.skip_until(/[^\\]'/)
      else
        s.skip_until(/\w+/)
      end

    # Emit lone attributes
    elsif token = s.scan(/\w+/)
      tokens << token

    # Stop at the end of the tag
    elsif s.scan(/>/)
      s.terminate

    else
      s.getch
    end
  end

  tokens
end

#extract_shebang(data) ⇒ Object

Internal: Extract normalized shebang command token.

Examples

extract_shebang("#!/usr/bin/ruby")
# => "ruby"

extract_shebang("#!/usr/bin/env node")
# => "node"

Returns String token or nil it couldn’t be parsed.



124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# File 'lib/linguist/tokenizer.rb', line 124

def extract_shebang(data)
  s = StringScanner.new(data)

  if path = s.scan(/^#!\s*\S+/)
    script = path.split('/').last
    if script == 'env'
      s.scan(/\s+/)
      script = s.scan(/\S+/)
    end
    script = script[/[^\d]+/, 0]
    return script
  end

  nil
end

#extract_tokens(data) ⇒ Object

Internal: Extract generic tokens from data.

data - String to scan.

Examples

extract_tokens("printf('Hello')")
# => ['printf', '(', ')']

Returns Array of token Strings.



50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# File 'lib/linguist/tokenizer.rb', line 50

def extract_tokens(data)
  s = StringScanner.new(data)

  tokens = []
  until s.eos?
    if token = s.scan(/^#!.+$/)
      if name = extract_shebang(token)
        tokens << "SHEBANG#!#{name}"
      end

    # Single line comment
    elsif token = s.scan(START_SINGLE_LINE_COMMENT)
      tokens << token.strip
      s.skip_until(/\n|\Z/)

    # Multiline comments
    elsif token = s.scan(START_MULTI_LINE_COMMENT)
      tokens << token
      close_token = MULTI_LINE_COMMENTS.assoc(token)[1]
      s.skip_until(Regexp.compile(Regexp.escape(close_token)))
      tokens << close_token

    # Skip single or double quoted strings
    elsif s.scan(/"/)
      if s.peek(1) == "\""
        s.getch
      else
        s.skip_until(/[^\\]"/)
      end
    elsif s.scan(/'/)
      if s.peek(1) == "'"
        s.getch
      else
        s.skip_until(/[^\\]'/)
      end

    # Skip number literals
    elsif s.scan(/(0x)?\d(\d|\.)*/)

    # SGML style brackets
    elsif token = s.scan(/<[^\s<>][^<>]*>/)
      extract_sgml_tokens(token).each { |t| tokens << t }

    # Common programming punctuation
    elsif token = s.scan(/;|\{|\}|\(|\)|\[|\]/)
      tokens << token

    # Regular token
    elsif token = s.scan(/[\w\.@#\/\*]+/)
      tokens << token

    # Common operators
    elsif token = s.scan(/<<?|\+|\-|\*|\/|%|&&?|\|\|?/)
      tokens << token

    else
      s.getch
    end
  end

  tokens
end