Class: Linguist::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/linguist/tokenizer.rb

Overview

Generic programming language tokenizer.

Tokens are designed for use in the language bayes classifier. It strips any data strings or comments and preserves significant language symbols.

Constant Summary collapse

SINGLE_LINE_COMMENTS =
[
  '//', # C
  '#',  # Ruby
  '%',  # Tex
]
MULTI_LINE_COMMENTS =
[
  ['/*', '*/'],    # C
  ['<!--', '-->'], # XML
  ['{-', '-}'],    # Haskell
  ['(*', '*)']     # Coq
]
START_SINGLE_LINE_COMMENT =
Regexp.compile(SINGLE_LINE_COMMENTS.map { |c|
  "^\s*#{Regexp.escape(c)} "
}.join("|"))
START_MULTI_LINE_COMMENT =
Regexp.compile(MULTI_LINE_COMMENTS.map { |c|
  Regexp.escape(c[0])
}.join("|"))

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.tokenize(data) ⇒ Object

Public: Extract tokens from data

data - String to tokenize

Returns Array of token Strings.



13
14
15
# File 'lib/linguist/tokenizer.rb', line 13

def self.tokenize(data)
  new.extract_tokens(data)
end

Instance Method Details

#extract_sgml_tokens(data) ⇒ Object

Internal: Extract tokens from inside SGML tag.

data - SGML tag String.

Examples

extract_sgml_tokens("<a href='' class=foo>")
# => ["<a>", "href="]

Returns Array of token Strings.



108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# File 'lib/linguist/tokenizer.rb', line 108

def extract_sgml_tokens(data)
  s = StringScanner.new(data)

  tokens = []

  until s.eos?
    # Emit start token
    if token = s.scan(/<\/?[^\s>]+/)
      tokens << "#{token}>"

    # Emit attributes with trailing =
    elsif token = s.scan(/\w+=/)
      tokens << token

      # Then skip over attribute value
      if s.scan(/"/)
        s.skip_until(/[^\\]"/)
      elsif s.scan(/'/)
        s.skip_until(/[^\\]'/)
      else
        s.skip_until(/\w+/)
      end

    # Emit lone attributes
    elsif token = s.scan(/\w+/)
      tokens << token

    # Stop at the end of the tag
    elsif s.scan(/>/)
      s.terminate

    else
      s.getch
    end
  end

  tokens
end

#extract_tokens(data) ⇒ Object

Internal: Extract generic tokens from data.

data - String to scan.

Examples

extract_tokens("printf('Hello')")
# => ['printf', '(', ')']

Returns Array of token Strings.



48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# File 'lib/linguist/tokenizer.rb', line 48

def extract_tokens(data)
  s = StringScanner.new(data)

  tokens = []
  until s.eos?
    # Single line comment
    if token = s.scan(START_SINGLE_LINE_COMMENT)
      tokens << token.strip
      s.skip_until(/\n|\Z/)

    # Multiline comments
    elsif token = s.scan(START_MULTI_LINE_COMMENT)
      tokens << token
      close_token = MULTI_LINE_COMMENTS.assoc(token)[1]
      s.skip_until(Regexp.compile(Regexp.escape(close_token)))
      tokens << close_token

    # Skip single or double quoted strings
    elsif s.scan(/"/)
      s.skip_until(/[^\\]"/)
    elsif s.scan(/'/)
      s.skip_until(/[^\\]'/)

    # Skip number literals
    elsif s.scan(/(0x)?\d+/)

    # SGML style brackets
    elsif token = s.scan(/<[^\s<>][^<>]*>/)
      extract_sgml_tokens(token).each { |t| tokens << t }

    # Common programming punctuation
    elsif token = s.scan(/;|\{|\}|\(|\)/)
      tokens << token

    # Regular token
    elsif token = s.scan(/[\w\.@#\/\*]+/)
      tokens << token

    # Common operators
    elsif token = s.scan(/<<?|\+|\-|\*|\/|%|&&?|\|\|?/)
      tokens << token

    else
      s.getch
    end
  end

  tokens
end