Class: Spacy::Doc
Overview
See also spaCy Python API document for [‘Doc`](spacy.io/api/doc).
Instance Attribute Summary collapse
-
#py_doc ⇒ Object
readonly
A Python ‘Doc` instance accessible via `PyCall`.
-
#py_nlp ⇒ Object
readonly
A Python ‘Language` instance accessible via `PyCall`.
-
#text ⇒ String
readonly
A text string of the document.
Instance Method Summary collapse
-
#[](range) ⇒ Object
Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.
-
#displacy(style: "dep", compact: false) ⇒ String
Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).
-
#each ⇒ Object
Iterates over the elements in the doc yielding a token instance each time.
-
#ents ⇒ Array<Span>
Returns an array of spans each representing a named entity.
-
#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ Doc
constructor
It is recommended to use Language#read method to create a doc.
-
#method_missing(name, *args) ⇒ Object
Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.
-
#noun_chunks ⇒ Array<Span>
Returns an array of spans representing noun chunks.
- #openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-3.5-turbo-0613") ⇒ Object
- #openai_embeddings(access_token: nil, model: "text-embedding-ada-002") ⇒ Object
- #openai_query(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-3.5-turbo-0613", messages: [], prompt: nil) ⇒ Object
- #respond_to_missing?(sym) ⇒ Boolean
-
#retokenize(start_index, end_index, attributes = {}) ⇒ Object
Retokenizes the text merging a span into a single token.
-
#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object
Retokenizes the text splitting the specified token.
-
#sents ⇒ Array<Span>
Returns an array of spans each representing a sentence.
-
#similarity(other) ⇒ Float
Returns a semantic similarity estimate.
-
#span(range_or_start, optional_size = nil) ⇒ Span
Returns a span of the specified range within the doc.
-
#to_s ⇒ String
String representation of the document.
-
#tokens ⇒ Array<Token>
Returns an array of tokens contained in the doc.
Constructor Details
#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ Doc
It is recommended to use Language#read method to create a doc. If you need to create one using #initialize, there are two method signatures: ‘Spacy::Doc.new(nlp_id, py_doc: Object)` and `Spacy::Doc.new(nlp_id, text: String)`.
76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/ruby-spacy.rb', line 76 def initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) @py_nlp = nlp @py_doc = py_doc || @py_doc = nlp.call(text) @text = @py_doc.text rescue StandardError retrial += 1 raise "Error: Failed to construct a Doc object" unless retrial <= max_retrial sleep 0.5 initialize(nlp, py_doc: py_doc, text: text, max_retrial: max_retrial, retrial: retrial) end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(name, *args) ⇒ Object
Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.
339 340 341 |
# File 'lib/ruby-spacy.rb', line 339 def method_missing(name, *args) @py_doc.send(name, *args) end |
Instance Attribute Details
#py_doc ⇒ Object (readonly)
Returns a Python ‘Doc` instance accessible via `PyCall`.
59 60 61 |
# File 'lib/ruby-spacy.rb', line 59 def py_doc @py_doc end |
#py_nlp ⇒ Object (readonly)
Returns a Python ‘Language` instance accessible via `PyCall`.
56 57 58 |
# File 'lib/ruby-spacy.rb', line 56 def py_nlp @py_nlp end |
#text ⇒ String (readonly)
Returns a text string of the document.
62 63 64 |
# File 'lib/ruby-spacy.rb', line 62 def text @text end |
Instance Method Details
#[](range) ⇒ Object
Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.
192 193 194 195 196 197 198 199 |
# File 'lib/ruby-spacy.rb', line 192 def [](range) if range.is_a?(Range) py_span = @py_doc[range] Span.new(self, start_index: py_span.start, end_index: py_span.end - 1) else Token.new(@py_doc[range]) end end |
#displacy(style: "dep", compact: false) ⇒ String
Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).
212 213 214 |
# File 'lib/ruby-spacy.rb', line 212 def displacy(style: "dep", compact: false) PyDisplacy.render(py_doc, style: style, options: { compact: compact }, jupyter: false) end |
#each ⇒ Object
Iterates over the elements in the doc yielding a token instance each time.
128 129 130 131 132 |
# File 'lib/ruby-spacy.rb', line 128 def each PyCall::List.call(@py_doc).each do |py_token| yield Token.new(py_token) end end |
#ents ⇒ Array<Span>
Returns an array of spans each representing a named entity.
178 179 180 181 182 183 184 185 186 187 188 |
# File 'lib/ruby-spacy.rb', line 178 def ents # so that ents canbe "each"-ed in Ruby ent_array = [] PyCall::List.call(@py_doc.ents).each do |ent| ent.define_singleton_method :label do label_ end ent_array << ent end ent_array end |
#noun_chunks ⇒ Array<Span>
Returns an array of spans representing noun chunks.
156 157 158 159 160 161 162 163 |
# File 'lib/ruby-spacy.rb', line 156 def noun_chunks chunk_array = [] py_chunks = PyCall::List.call(@py_doc.noun_chunks) py_chunks.each do |py_chunk| chunk_array << Span.new(self, start_index: py_chunk.start, end_index: py_chunk.end - 1) end chunk_array end |
#openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-3.5-turbo-0613") ⇒ Object
294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 |
# File 'lib/ruby-spacy.rb', line 294 def openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-3.5-turbo-0613") = [ { role: "system", content: "Complete the text input by the user." }, { role: "user", content: @text } ] access_token ||= ENV["OPENAI_API_KEY"] raise "Error: OPENAI_API_KEY is not set" unless access_token begin response = Spacy.openai_client(access_token: access_token).chat( parameters: { model: model, messages: , max_tokens: max_tokens, temperature: temperature } ) response.dig("choices", 0, "message", "content") rescue StandardError => e puts "Error: OpenAI API call failed." pp e. pp e.backtrace end end |
#openai_embeddings(access_token: nil, model: "text-embedding-ada-002") ⇒ Object
319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 |
# File 'lib/ruby-spacy.rb', line 319 def (access_token: nil, model: "text-embedding-ada-002") access_token ||= ENV["OPENAI_API_KEY"] raise "Error: OPENAI_API_KEY is not set" unless access_token begin response = Spacy.openai_client(access_token: access_token).( parameters: { model: model, input: @text } ) response.dig("data", 0, "embedding") rescue StandardError => e puts "Error: OpenAI API call failed." pp e. pp e.backtrace end end |
#openai_query(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-3.5-turbo-0613", messages: [], prompt: nil) ⇒ Object
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 |
# File 'lib/ruby-spacy.rb', line 216 def openai_query(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-3.5-turbo-0613", messages: [], prompt: nil) if .empty? = [ { role: "system", content: prompt }, { role: "user", content: @text } ] end access_token ||= ENV["OPENAI_API_KEY"] raise "Error: OPENAI_API_KEY is not set" unless access_token begin response = Spacy.openai_client(access_token: access_token).chat( parameters: { model: model, messages: , max_tokens: max_tokens, temperature: temperature, function_call: "auto", stream: false, functions: [ { name: "get_tokens", description: "Tokenize given text and return a list of tokens with their attributes: surface, lemma, tag, pos (part-of-speech), dep (dependency), ent_type (entity type), and morphology", "parameters": { "type": "object", "properties": { "text": { "type": "string", "description": "text to be tokenized" } }, "required": ["text"] } } ] } ) = response.dig("choices", 0, "message") if ["role"] == "assistant" && ["function_call"] << function_name = .dig("function_call", "name") _args = JSON.parse(.dig("function_call", "arguments")) case function_name when "get_tokens" res = tokens.map do |t| { "surface": t.text, "lemma": t.lemma, "pos": t.pos, "tag": t.tag, "dep": t.dep, "ent_type": t.ent_type, "morphology": t.morphology } end.to_json end << { role: "system", content: res } openai_query(access_token: access_token, max_tokens: max_tokens, temperature: temperature, model: model, messages: , prompt: prompt) else ["content"] end rescue StandardError => e puts "Error: OpenAI API call failed." pp e. pp e.backtrace end end |
#respond_to_missing?(sym) ⇒ Boolean
343 344 345 |
# File 'lib/ruby-spacy.rb', line 343 def respond_to_missing?(sym) sym ? true : super end |
#retokenize(start_index, end_index, attributes = {}) ⇒ Object
Retokenizes the text merging a span into a single token.
93 94 95 96 97 |
# File 'lib/ruby-spacy.rb', line 93 def retokenize(start_index, end_index, attributes = {}) PyCall.with(@py_doc.retokenize) do |retokenizer| retokenizer.merge(@py_doc[start_index..end_index], attrs: attributes) end end |
#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object
Retokenizes the text splitting the specified token.
104 105 106 107 108 109 |
# File 'lib/ruby-spacy.rb', line 104 def retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) PyCall.with(@py_doc.retokenize) do |retokenizer| heads = [[@py_doc[pos_in_doc], head_pos_in_split], @py_doc[ancestor_pos]] retokenizer.split(@py_doc[pos_in_doc], split_array, heads: heads, attrs: attributes) end end |
#sents ⇒ Array<Span>
Returns an array of spans each representing a sentence.
167 168 169 170 171 172 173 174 |
# File 'lib/ruby-spacy.rb', line 167 def sents sentence_array = [] py_sentences = PyCall::List.call(@py_doc.sents) py_sentences.each do |py_sent| sentence_array << Span.new(self, start_index: py_sent.start, end_index: py_sent.end - 1) end sentence_array end |
#similarity(other) ⇒ Float
Returns a semantic similarity estimate.
204 205 206 |
# File 'lib/ruby-spacy.rb', line 204 def similarity(other) py_doc.similarity(other.py_doc) end |
#span(range_or_start, optional_size = nil) ⇒ Span
Returns a span of the specified range within the doc. The method should be used either of the two ways: ‘Doc#span(range)` or `Doc#spansize_of_span`.
139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
# File 'lib/ruby-spacy.rb', line 139 def span(range_or_start, optional_size = nil) if optional_size start_index = range_or_start temp = tokens[start_index...start_index + optional_size] else start_index = range_or_start.first range = range_or_start temp = tokens[range] end end_index = start_index + temp.size - 1 Span.new(self, start_index: start_index, end_index: end_index) end |
#to_s ⇒ String
String representation of the document.
113 114 115 |
# File 'lib/ruby-spacy.rb', line 113 def to_s @text end |