Class: Typingpool::Amazon::HIT

Inherits:
Object
  • Object
show all
Defined in:
lib/typingpool/amazon/hit.rb,
lib/typingpool/amazon/hit/full.rb,
lib/typingpool/amazon/hit/assignment.rb,
lib/typingpool/amazon/hit/assignment/empty.rb,
lib/typingpool/amazon/hit/full/fromsearchhits.rb

Overview

Class representing an Amazon Mechanical Turk Human Intelligence Task (HIT).

We go above and beyond RTurk::Hit for several practical reasons:

  • To allow easy serialization. Caching is a very useful way of reducing network calls to Amazon, and thus of speeding up Typingpool. RTurk::Hit objects cannot be dumped via Marshal, apparently due to some Nokogiri objects they contain. Typingpool::Amazon::HIT objects, in contrast, are designed to be easily and compactly serialized. They store the minimal subset of information we need via simple attribtues. (Presently we serialize via PStore.)

  • To attach convenience methods. RTurk does not make it easy, for example, to get HITs beyond the first “page” returned by Amazon. This class provides methods that make it easy to get ALL HITs returned by various operations.

  • To attach methods specific to Typingpool. For example, the url and project_id methods read params we’ve embedded in the annotation or in hidden fields on an external question, while the underlying stashed_params method optimizes its lookup of these variables based on how the app is most likely to be used. See also the ours? and cacheable? methods.

  • To simplify. Typingpool HITs are constrained such that we can assume they all contain only one assignment and thus only a maximum of one answer. Also, once we’ve determined that a HIT does not belong to Typingpool, it is safe to cache it forever and never download it again from Amazon.

  • To clearly partition methods that result in network calls. When you access an attribute under hit.full, like hit.full.status, it is clear you are doing something potentially expensive to obtain your hit status. Same thing with accessing an attribute under hit.assignment, like hit.assignment.worker_id – it is clear an assignment object will need to be created, implying a network call. Calling hit.id, in contrast, is always fast. (Caveat: Accessing partitioned attributes often, but not always, results in a network call. In some cases, hit.full is generated at the same time we create the hit, since we’ve obtained a full HIT serialization from Amazon. In other cases, we only have a HIT id, so accessing anything under hit.full generates a network call.)

Defined Under Namespace

Classes: Assignment, Full

Constant Summary collapse

@@cacheable_assignment_status =
Set.new %w(Approved Rejected)

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(rturk_hit) ⇒ HIT

Constructor. Takes an RTurk::Hit instance.



236
237
238
# File 'lib/typingpool/amazon/hit.rb', line 236

def initialize(rturk_hit)
  @id = rturk_hit.id
end

Instance Attribute Details

#idObject (readonly)

Corresponds to the Amazon Mechanical Turk HIT#HITId



233
234
235
# File 'lib/typingpool/amazon/hit.rb', line 233

def id
  @id
end

Class Method Details

.all(&filter) ⇒ Object

Returns all HITs associated with your AWS account as an array of Typingpool::Amazon::HIT instances. Takes an optional filter block (which should return true for HITs to be included in the final results). If not supplied, will filter so the returned hits are all Typingpool HITs (hit.ours? == true).



154
155
156
157
158
159
160
161
162
163
164
165
# File 'lib/typingpool/amazon/hit.rb', line 154

def all(&filter)
  hits = each_page do |page_number|
    page = RTurk::SearchHITs.create(:page_number => page_number)
    raw_hits = page.xml.xpath('//HIT')
    page.hits.map do |rturk_hit|
      annotation = raw_hits.shift.xpath('RequesterAnnotation').inner_text.strip
      full = Amazon::HIT::Full::FromSearchHITs.new(rturk_hit, annotation)
      cached_or_new_from_searchhits(rturk_hit, annotation)
    end
  end
  filter_ours(hits, &filter)
end

.all_approvedObject

Returns all Typingpool HITs that have been approved, as an array of Typingpool::Amazon::HIT instances.



113
114
115
116
117
118
119
120
121
122
123
124
125
126
# File 'lib/typingpool/amazon/hit.rb', line 113

def all_approved
  hits = all_reviewable do |hit|
    begin
      #optimization: we assume it is more common to have an
      #unapproved HIT than an approved HIT that does not
      #belong to this app
      hit.approved? && hit.ours? 
    rescue RestClient::ServiceUnavailable => e
      warn "Warning: Service unavailable error, skipped HIT #{hit.id}. (Error: #{e})"
      false
    end
  end
  hits
end

.all_for_project(id) ⇒ Object

Takes a Typingpool::Project::Local#id and returns all HITs associated with that project, as an array of Typingpool::Amazon::HIT instances.



144
145
146
# File 'lib/typingpool/amazon/hit.rb', line 144

def all_for_project(id)
  all{|hit| hit.ours? && hit.project_id == id}
end

.all_reviewable(&filter) ⇒ Object

Returns as an array of Typingpool::Amazon::HIT instances all HITs returned by Amazon’s GetReviewableHITs operation (which have HIT status == ‘Reviewable’). Takes an optional filter block (which should return true for HITs to be included in the final results). If not supplied, will filter so the returned hits are all Typingpool HITs (hit.ours? == true).



134
135
136
137
138
139
# File 'lib/typingpool/amazon/hit.rb', line 134

def all_reviewable(&filter)
  hits = each_page do |page_number|
    RTurk.GetReviewableHITs(:page_number => page_number).hit_ids.map{|id| RTurk::Hit.new(id) }.map{|hit| cached_or_new(hit) }
  end
  filter_ours(hits, &filter)
end

.cache_key(hit_id, id_at = self.id_at, url_at = self.url_at) ⇒ Object



207
208
209
# File 'lib/typingpool/amazon/hit.rb', line 207

def cache_key(hit_id, id_at=self.id_at, url_at=self.url_at)
  "RESULT///#{hit_id}///#{url_at}///#{id_at}"
end

.cached_or_new(rturk_hit) ⇒ Object

Constructor. Takes an RTurk::Hit instance. Returns a Typingpool::Amazon::HIT instance, preferably from the cache.



171
172
173
# File 'lib/typingpool/amazon/hit.rb', line 171

def cached_or_new(rturk_hit)
  from_cache(rturk_hit.id) || new(rturk_hit)
end

.cached_or_new_from_searchhits(rturk_hit, annotation) ⇒ Object

Constructor. Same as cached_or_new, but handles peculiarities of objects returned by RTurk::SearchHITs. Such objects map two Amazon HIT fields to different names than those used by other RTurk HIT instances. They also do not bother to extract the annotation from the Amazon HIT, so we have to do that ourselves (elsewhere) and take it as a param here. Finally, on the bright side, RTurk::SearchHITs already contain a big chunk of hit.full attributes, potentially obviating the need for an additional network call to flesh out the HIT, so this method pre-fleshes-out the HIT.



185
186
187
188
189
190
191
# File 'lib/typingpool/amazon/hit.rb', line 185

def cached_or_new_from_searchhits(rturk_hit, annotation)
  if not (typingpool_hit = from_cache(rturk_hit.id))
    typingpool_hit = new(rturk_hit)
    typingpool_hit.full(Amazon::HIT::Full::FromSearchHITs.new(rturk_hit, annotation))
  end
  typingpool_hit
end

.create(question, config_assign) ⇒ Object

Constructor. Creates an Amazon Mechanical Turk HIT. ** Warning: This method can spend your money! **

Params

question

Typingpool::Amazon::Question instance, used not only to generate the (external) question but also parsed to provide one or more core HIT attributes. Must include a non-nil annotation attribute. Provides fallback values for HIT title and description.

config_assign

The ‘assign’ attribute of a Typingpool::Config instance (that is, a Typingpool::Config::Root::Assign instance). Must include values for reward, lifetime, duration, and approval. May include values for keywords and qualifications. Preferred source for HIT title and description. See Typingpool::Config documentation for further details.

Returns

Typingpool::Amazon::HIT instance corresponding to the new Mechanical Turk HIT.



74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/typingpool/amazon/hit.rb', line 74

def create(question, config_assign)
  new(RTurk::Hit.create(:title => config_assign.title || question.title) do |hit|
        hit.description = config_assign.description || question.description
        hit.question(question.url)
        hit.note = question.annotation or raise Error, "Missing annotation from question"
        hit.reward = config_assign.reward or raise Error, "Missing reward config"
        hit.assignments = 1
        hit.lifetime = config_assign.lifetime or raise Error, "Missing lifetime config"
        hit.duration = config_assign.deadline or raise Error, "Missing deadline config"
        hit.auto_approval = config_assign.approval or raise Error, "Missing approval config"
        hit.keywords = config_assign.keywords if config_assign.keywords
        config_assign.qualify.each{|q| hit.qualifications.add(*q.to_arg)} if config_assign.qualify
      end)
end

.delete_cache(hit_id, id_at = self.id_at, url_at = self.url_at) ⇒ Object



199
200
201
202
203
204
205
# File 'lib/typingpool/amazon/hit.rb', line 199

def delete_cache(hit_id, id_at=self.id_at, url_at=self.url_at)
  Amazon.cache.transaction do
    key = cache_key(hit_id, id_at, url_at)
    cached = Amazon.cache[key]
    Amazon.cache.delete(key) unless cached.nil?
  end
end

.each_pageObject



211
212
213
214
215
216
217
218
219
220
# File 'lib/typingpool/amazon/hit.rb', line 211

def each_page
  results = []
  page = 0
  begin
    page += 1
    new_results = yield(page)
    results.push(*new_results)
  end while new_results.count > 0
  results
end

.filter_ours(hits, &filter) ⇒ Object



222
223
224
225
226
227
228
229
# File 'lib/typingpool/amazon/hit.rb', line 222

def filter_ours(hits, &filter)
  filter ||= lambda{|hit| hit.ours? }
  hits.select do |hit| 
    selected = filter.call(hit)
    hit.to_cache
    selected
  end
end

.from_cache(hit_id, id_at = self.id_at, url_at = self.url_at) ⇒ Object



193
194
195
196
197
# File 'lib/typingpool/amazon/hit.rb', line 193

def from_cache(hit_id, id_at=self.id_at, url_at=self.url_at)
  Amazon.cache.transaction do
    Amazon.cache[cache_key(hit_id, id_at, url_at)] 
  end
end

.id_atObject

Name of the hidden HTML form field used to provide the project_id in an external question or (form-encoded) annotation. Hard coded to typingpool_project_id but overridable in a subclass.



93
94
95
# File 'lib/typingpool/amazon/hit.rb', line 93

def id_at
  @@id_at ||= 'typingpool_project_id'
end

.url_atObject

Name of the hidden HTML form field used to provide the (audio) url in an external question or (form-encoded) annotation. Hard coded to typingpool_url but overridable in a subclass.



101
102
103
# File 'lib/typingpool/amazon/hit.rb', line 101

def url_at
  @@url_at ||= 'typingpool_url'
end

.with_ids(ids) ⇒ Object

Takes an array of HIT ids, returns Typingpool::Amazon::HIT instances corresponding to those ids.



107
108
109
# File 'lib/typingpool/amazon/hit.rb', line 107

def with_ids(ids)
  ids.map{|id| cached_or_new(RTurk::Hit.new(id)) }
end

Instance Method Details

#approved?Boolean

Returns true if this HIT has an approved assignment associated with it. (Attached to Typingpool::Amazon::HIT rather than Typingpool::Amazon::HIT::Assignment because sometimes we can tell simply from looking at hit.full that there are no approved assignments – hit.full.assignments_completed == 0. This check is only performed when hit.full has already been loaded.)

Returns:

  • (Boolean)


268
269
270
# File 'lib/typingpool/amazon/hit.rb', line 268

def approved?
  assignment_status_match?('Approved')
end

#assignmentObject

Returns the assignment associated with this HIT - a Typingpool::Amazon::HIT::Assignment instance. The first time this is called, an Amazon HTTP request is typically (but not always) sent.



380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
# File 'lib/typingpool/amazon/hit.rb', line 380

def assignment
  if @assignment.nil?
    if @full && full.assignments_completed == 0
      #It would be dangerous to do this if the HIT were to be
      #cached, since we would then never check for the
      #assignment again. But we know this HIT won't be cached
      #while it is active, since we only cache approved and
      #rejected HITs.
      @assignment = Assignment::Empty.new
    else
      @assignment = Assignment.new(at_amazon) #expensive
    end
  end
  @assignment
end

#assignment_status_match?(status) ⇒ Boolean

Returns:

  • (Boolean)


434
435
436
437
438
439
440
# File 'lib/typingpool/amazon/hit.rb', line 434

def assignment_status_match?(status)
  if @full
    return false if full.assignments_completed == 0
    return false if full.status != 'Reviewable'
  end
  assignment.status == status
end

#at_amazonObject

Returns an RTurk::Hit instance corresponding to this HIT.



336
337
338
# File 'lib/typingpool/amazon/hit.rb', line 336

def at_amazon
  Amazon.rturk_hit_full(@id)
end

#cacheable?Boolean

Returns:

  • (Boolean)


444
445
446
447
448
449
450
451
452
453
454
455
# File 'lib/typingpool/amazon/hit.rb', line 444

def cacheable?
  if @ours == false
    return true
  end
  if @full
    return true if full.expired_and_overdue?
  end
  if @assignment && assignment.status
    return true if @@cacheable_assignment_status.include?(assignment.status)
  end
  return false
end

#full(full_hit = nil) ⇒ Object

Returns “the full hit” - a Typingpool::Amazon::HIT::Full instance associated with this HIT. If the instance is being created for the first time, this will trigger an HTTP request to Amazon’s servers. “Full” hit fields segregated because accessing any one of them is expensive if we only have a hit id (but after fetching one all are cheap). Accepts an optional Typingpool::Amazon::HIT::Full (or subclass) to set for this attribute, preventing the need to create one. This is useful in cases in which extensive HIT data was returned by an Amazon operation (for example, SearchHITs returns lots of HIT data)



369
370
371
372
373
374
# File 'lib/typingpool/amazon/hit.rb', line 369

def full(full_hit=nil)
  if @full.nil?
    @full = full_hit || Full.new(at_amazon)
  end
  @full
end

#ours?Boolean

Returns true if this HIT is associated with Typingpool. One Amazon account can be used for many tasks, so it’s important to check whether the HIT belongs to this software. (Presently, this is determined by looking for a stashed param like url or project_id).

Returns:

  • (Boolean)


294
295
296
# File 'lib/typingpool/amazon/hit.rb', line 294

def ours?
  @ours ||= not(url.to_s.empty?)
end

#project_idObject

The Typingpool::Project::Local#id associated with this HIT. Extracted as described for the url method.



251
252
253
# File 'lib/typingpool/amazon/hit.rb', line 251

def project_id
  @project_id ||= stashed_param(self.class.id_at)
end

#project_title_from_url(url = self.url) ⇒ Object

Returns the Typingpool::Project#name associated with this HIT by parsing the #url. May be dropped in a future release.



257
258
259
260
# File 'lib/typingpool/amazon/hit.rb', line 257

def project_title_from_url(url=self.url)
  matches = Project.url_regex.match(url) or raise Error::Argument::Format, "Unexpected format to url '#{url}'"
  URI.unescape(matches[2])
end

#rejected?Boolean

Returns true if this HIT has a rejected assignment associated with it. (For an explanation of why this is not attached to Typingpool::Amazon::HIT::Assignment, see the documentation for approved?.)

Returns:

  • (Boolean)


276
277
278
# File 'lib/typingpool/amazon/hit.rb', line 276

def rejected?
  assignment_status_match?('Rejected')
end

#remove_from_amazonObject

Deletes the HIT from Amazon’s servers. Examines the HIT and assignment status to determine whether calling the DisposeHIT or DisableHIT operation is most appropriate. If the HIT has been submitted but not approved or rejected, will raise an exception of type Typingpool::Error::Amazon::UnreviewedContent. Catch this exception in your own code if you’d like to automatically approve such HITs before removing them.



348
349
350
351
352
353
354
355
356
357
# File 'lib/typingpool/amazon/hit.rb', line 348

def remove_from_amazon
  if full.status == 'Reviewable'
    if assignment.status == 'Submitted'
      raise Error::Amazon::UnreviewedContent, "There is an unreviewed submission for #{url}"
    end
    at_amazon.dispose!
  else
    at_amazon.disable!
  end
end

#stashed_param(param) ⇒ Object

private



399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
# File 'lib/typingpool/amazon/hit.rb', line 399

def stashed_param(param)
  if @assignment && assignment.answers[param]
    return assignment.answers[param]
  elsif full.annotation[param]
    #A question assigned through this software. May be
    #expensive: May result in HTTP request to fetch HIT
    #fields. We choose to fetch (sometimes) the HIT rather than
    #the assignment on the assumption it will be MORE common to
    #encounter HITs with no answers and LESS common to encounter
    #HITs assigned through the RUI (and thus lacking in an
    #annotation from this software and thus rendering the HTTP
    #request to fetch the HIT fields pointless).
    return full.annotation[param]
  elsif full.assignments_completed.to_i >= 1
    #A question assigned through Amazon's RUI, with an answer
    #submitted. If the HIT belongs to this software, this
    #assignment's answers will include our param.  We prefer
    #fetching the assignment to fetching the external question
    #(as below) because fetching the assignment will potentially
    #save us an HTTP request down the line -- for example, if we
    #need other assignment data (e.g. assignment status).
    #Fetching the external question only serves to give us
    #access to params. If the answers do not include our param,
    #we know the HIT does not belong to this software, since we
    #know the param was also not in the annotation. So we are
    #safe returning nil in that case.
    return assignment.answers[param]
  else
    #A question assigned via Amazon's RUI, with no answer
    #submitted.  Expensive: Results in HTTP request to fetch
    #external question.
    return full.external_question_param(param)
  end
end

#submitted?Boolean

Returns true if this HIT has a submitted assignment associated with it. (For an explanation of why this is not attached to Typingpool::Amazon::HIT::Assignment, see the documentation for approved?.)

Returns:

  • (Boolean)


284
285
286
# File 'lib/typingpool/amazon/hit.rb', line 284

def 
  assignment_status_match?('Submitted')
end

#to_cacheObject

If this HIT is cacheable, serializes it to the cache file specified in the config passed to Amazon.setup, or specified in the default config file. In short, a HIT is cacheable if it does not belong to Typingpool (ours? == false), if it is approved or rejected (approved? || rejected?), or if it is expired (full.expired_and_overdue?). See also cacheable? code.

When available, cached HITs are used by Typingpool::Amazon::HIT.all, Typingpool::Amazon::HIT.all_approved, and all the other class methods that retrieve HITs. These methods call to_cache for you at logical times (after downloading and filtering, when the HIT is most fleshed out), so you should not need to call this yourself. But if you have an operation that makes network calls to further flesh out the HIT, calling to_cache may be worthwhile.



325
326
327
328
329
330
331
332
333
# File 'lib/typingpool/amazon/hit.rb', line 325

def to_cache
  #any obj containing a Nokogiri object cannot be stored in pstore - do
  #not forget this (again)
  if cacheable?
    Amazon.cache.transaction do
      Amazon.cache[self.class.cache_key(@id)] = self 
    end
  end
end

#transcriptObject

Returns a Typingpool::Transcript::Chunk instance built using this HIT and its associated assignment.



300
301
302
303
304
305
306
307
# File 'lib/typingpool/amazon/hit.rb', line 300

def transcript
  transcript = Transcript::Chunk.new(assignment.body)
  transcript.url = url
  transcript.project = project_id
  transcript.worker = assignment.worker_id
  transcript.hit = @id
  transcript
end

#urlObject

URL of the audio file associated with this HIT (the audio file to be transcribed). Extracted from the annotation (when the HIT was assigned via Typingpool) or from a hidden field in the HTML form on the external question (when the HIT was assigned via the Amazon Mechanical Turk RUI).



245
246
247
# File 'lib/typingpool/amazon/hit.rb', line 245

def url
  @url ||= stashed_param(self.class.url_at)
end