Class: Spacy::Doc

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/ruby-spacy.rb

Overview

See also spaCy Python API document for [‘Doc`](spacy.io/api/doc).

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ Doc

It is recommended to use Language#read method to create a doc. If you need to create one using #initialize, there are two method signatures: ‘Spacy::Doc.new(nlp_id, py_doc: Object)` and `Spacy::Doc.new(nlp_id, text: String)`.

Parameters:

  • nlp (Language)

    an instance of Language class

  • py_doc (Object) (defaults to: nil)

    an instance of Python ‘Doc` class

  • text (String) (defaults to: nil)

    the text string to be analyzed



76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/ruby-spacy.rb', line 76

def initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL,
               retrial: 0)
  @py_nlp = nlp
  @py_doc = py_doc || @py_doc = nlp.call(text)
  @text = @py_doc.text
rescue StandardError
  retrial += 1
  raise "Error: Failed to construct a Doc object" unless retrial <= max_retrial

  sleep 0.5
  initialize(nlp, py_doc: py_doc, text: text, max_retrial: max_retrial, retrial: retrial)
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *args) ⇒ Object

Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.



339
340
341
# File 'lib/ruby-spacy.rb', line 339

def method_missing(name, *args)
  @py_doc.send(name, *args)
end

Instance Attribute Details

#py_docObject (readonly)

Returns a Python ‘Doc` instance accessible via `PyCall`.

Returns:

  • (Object)

    a Python ‘Doc` instance accessible via `PyCall`



59
60
61
# File 'lib/ruby-spacy.rb', line 59

def py_doc
  @py_doc
end

#py_nlpObject (readonly)

Returns a Python ‘Language` instance accessible via `PyCall`.

Returns:

  • (Object)

    a Python ‘Language` instance accessible via `PyCall`



56
57
58
# File 'lib/ruby-spacy.rb', line 56

def py_nlp
  @py_nlp
end

#textString (readonly)

Returns a text string of the document.

Returns:

  • (String)

    a text string of the document



62
63
64
# File 'lib/ruby-spacy.rb', line 62

def text
  @text
end

Instance Method Details

#[](range) ⇒ Object

Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.

Parameters:

  • range (Range)

    an ordinary Ruby’s range object such as ‘0..3`, `1…4`, or `3 .. -1`



192
193
194
195
196
197
198
199
# File 'lib/ruby-spacy.rb', line 192

def [](range)
  if range.is_a?(Range)
    py_span = @py_doc[range]
    Span.new(self, start_index: py_span.start, end_index: py_span.end - 1)
  else
    Token.new(@py_doc[range])
  end
end

#displacy(style: "dep", compact: false) ⇒ String

Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).

Parameters:

  • style (String) (defaults to: "dep")

    either ‘dep` or `ent`

  • compact (Boolean) (defaults to: false)

    only relevant to the ‘dep’ style

Returns:

  • (String)

    in the case of ‘dep`, the output text will be an SVG, whereas in the `ent` style, the output text will be an HTML.



212
213
214
# File 'lib/ruby-spacy.rb', line 212

def displacy(style: "dep", compact: false)
  PyDisplacy.render(py_doc, style: style, options: { compact: compact }, jupyter: false)
end

#eachObject

Iterates over the elements in the doc yielding a token instance each time.



128
129
130
131
132
# File 'lib/ruby-spacy.rb', line 128

def each
  PyCall::List.call(@py_doc).each do |py_token|
    yield Token.new(py_token)
  end
end

#entsArray<Span>

Returns an array of spans each representing a named entity.

Returns:



178
179
180
181
182
183
184
185
186
187
188
# File 'lib/ruby-spacy.rb', line 178

def ents
  # so that ents canbe "each"-ed in Ruby
  ent_array = []
  PyCall::List.call(@py_doc.ents).each do |ent|
    ent.define_singleton_method :label do
      label_
    end
    ent_array << ent
  end
  ent_array
end

#noun_chunksArray<Span>

Returns an array of spans representing noun chunks.

Returns:



156
157
158
159
160
161
162
163
# File 'lib/ruby-spacy.rb', line 156

def noun_chunks
  chunk_array = []
  py_chunks = PyCall::List.call(@py_doc.noun_chunks)
  py_chunks.each do |py_chunk|
    chunk_array << Span.new(self, start_index: py_chunk.start, end_index: py_chunk.end - 1)
  end
  chunk_array
end

#openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-3.5-turbo-0613") ⇒ Object



294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
# File 'lib/ruby-spacy.rb', line 294

def openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-3.5-turbo-0613")
  messages = [
    { role: "system", content: "Complete the text input by the user." },
    { role: "user", content: @text }
  ]
  access_token ||= ENV["OPENAI_API_KEY"]
  raise "Error: OPENAI_API_KEY is not set" unless access_token

  begin
    response = Spacy.openai_client(access_token: access_token).chat(
      parameters: {
        model: model,
        messages: messages,
        max_tokens: max_tokens,
        temperature: temperature
      }
    )
    response.dig("choices", 0, "message", "content")
  rescue StandardError => e
    puts "Error: OpenAI API call failed."
    pp e.message
    pp e.backtrace
  end
end

#openai_embeddings(access_token: nil, model: "text-embedding-ada-002") ⇒ Object



319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
# File 'lib/ruby-spacy.rb', line 319

def openai_embeddings(access_token: nil, model: "text-embedding-ada-002")
  access_token ||= ENV["OPENAI_API_KEY"]
  raise "Error: OPENAI_API_KEY is not set" unless access_token

  begin
    response = Spacy.openai_client(access_token: access_token).embeddings(
      parameters: {
        model: model,
        input: @text
      }
    )
    response.dig("data", 0, "embedding")
  rescue StandardError => e
    puts "Error: OpenAI API call failed."
    pp e.message
    pp e.backtrace
  end
end

#openai_query(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-3.5-turbo-0613", messages: [], prompt: nil) ⇒ Object



216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
# File 'lib/ruby-spacy.rb', line 216

def openai_query(access_token: nil,
                 max_tokens: 1000,
                 temperature: 0.7,
                 model: "gpt-3.5-turbo-0613",
                 messages: [],
                 prompt: nil)
  if messages.empty?
    messages = [
      { role: "system", content: prompt },
      { role: "user", content: @text }
    ]
  end

  access_token ||= ENV["OPENAI_API_KEY"]
  raise "Error: OPENAI_API_KEY is not set" unless access_token

  begin
    response = Spacy.openai_client(access_token: access_token).chat(
      parameters: {
        model: model,
        messages: messages,
        max_tokens: max_tokens,
        temperature: temperature,
        function_call: "auto",
        stream: false,
        functions: [
          {
            name: "get_tokens",
            description: "Tokenize given text and return a list of tokens with their attributes: surface, lemma, tag, pos (part-of-speech), dep (dependency), ent_type (entity type), and morphology",
            "parameters": {
              "type": "object",
              "properties": {
                "text": {
                  "type": "string",
                  "description": "text to be tokenized"
                }
              },
              "required": ["text"]
            }
          }
        ]
      }
    )

    message = response.dig("choices", 0, "message")

    if message["role"] == "assistant" && message["function_call"]
      messages << message
      function_name = message.dig("function_call", "name")
      _args = JSON.parse(message.dig("function_call", "arguments"))
      case function_name
      when "get_tokens"
        res = tokens.map do |t|
          {
            "surface": t.text,
            "lemma": t.lemma,
            "pos": t.pos,
            "tag": t.tag,
            "dep": t.dep,
            "ent_type": t.ent_type,
            "morphology": t.morphology
          }
        end.to_json
      end
      messages << { role: "system", content: res }
      openai_query(access_token: access_token, max_tokens: max_tokens,
                   temperature: temperature, model: model,
                   messages: messages, prompt: prompt)
    else
      message["content"]
    end
  rescue StandardError => e
    puts "Error: OpenAI API call failed."
    pp e.message
    pp e.backtrace
  end
end

#respond_to_missing?(sym) ⇒ Boolean

Returns:

  • (Boolean)


343
344
345
# File 'lib/ruby-spacy.rb', line 343

def respond_to_missing?(sym)
  sym ? true : super
end

#retokenize(start_index, end_index, attributes = {}) ⇒ Object

Retokenizes the text merging a span into a single token.

Parameters:

  • start_index (Integer)

    the start position of the span to be retokenized in the document

  • end_index (Integer)

    the end position of the span to be retokenized in the document

  • attributes (Hash) (defaults to: {})

    attributes to set on the merged token



93
94
95
96
97
# File 'lib/ruby-spacy.rb', line 93

def retokenize(start_index, end_index, attributes = {})
  PyCall.with(@py_doc.retokenize) do |retokenizer|
    retokenizer.merge(@py_doc[start_index..end_index], attrs: attributes)
  end
end

#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object

Retokenizes the text splitting the specified token.

Parameters:

  • pos_in_doc (Integer)

    the position of the span to be retokenized in the document

  • split_array (Array<String>)

    text strings of the split results

  • ancestor_pos (Integer)

    the position of the immediate ancestor element of the split elements in the document

  • attributes (Hash) (defaults to: {})

    the attributes of the split elements



104
105
106
107
108
109
# File 'lib/ruby-spacy.rb', line 104

def retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {})
  PyCall.with(@py_doc.retokenize) do |retokenizer|
    heads = [[@py_doc[pos_in_doc], head_pos_in_split], @py_doc[ancestor_pos]]
    retokenizer.split(@py_doc[pos_in_doc], split_array, heads: heads, attrs: attributes)
  end
end

#sentsArray<Span>

Returns an array of spans each representing a sentence.

Returns:



167
168
169
170
171
172
173
174
# File 'lib/ruby-spacy.rb', line 167

def sents
  sentence_array = []
  py_sentences = PyCall::List.call(@py_doc.sents)
  py_sentences.each do |py_sent|
    sentence_array << Span.new(self, start_index: py_sent.start, end_index: py_sent.end - 1)
  end
  sentence_array
end

#similarity(other) ⇒ Float

Returns a semantic similarity estimate.

Parameters:

  • other (Doc)

    the other doc to which a similarity estimation is made

Returns:

  • (Float)


204
205
206
# File 'lib/ruby-spacy.rb', line 204

def similarity(other)
  py_doc.similarity(other.py_doc)
end

#span(range_or_start, optional_size = nil) ⇒ Span

Returns a span of the specified range within the doc. The method should be used either of the two ways: ‘Doc#span(range)` or `Doc#spansize_of_span`.

Parameters:

  • range_or_start (Range, Integer)

    a range object, or, alternatively, an integer that represents the start position of the span

  • optional_size (Integer) (defaults to: nil)

    an integer representing the size of the span

Returns:



139
140
141
142
143
144
145
146
147
148
149
150
151
152
# File 'lib/ruby-spacy.rb', line 139

def span(range_or_start, optional_size = nil)
  if optional_size
    start_index = range_or_start
    temp = tokens[start_index...start_index + optional_size]
  else
    start_index = range_or_start.first
    range = range_or_start
    temp = tokens[range]
  end

  end_index = start_index + temp.size - 1

  Span.new(self, start_index: start_index, end_index: end_index)
end

#to_sString

String representation of the document.

Returns:

  • (String)


113
114
115
# File 'lib/ruby-spacy.rb', line 113

def to_s
  @text
end

#tokensArray<Token>

Returns an array of tokens contained in the doc.

Returns:



119
120
121
122
123
124
125
# File 'lib/ruby-spacy.rb', line 119

def tokens
  results = []
  PyCall::List.call(@py_doc).each do |py_token|
    results << Token.new(py_token)
  end
  results
end