Class: Typingpool::Transcript::Chunk

Inherits:
Object
  • Object
show all
Includes:
Gem::Text
Defined in:
lib/typingpool/transcript/chunk.rb

Overview

Transcript::Chunk is the model class for one transcription by one Mechanical Turk worker of one “chunk” (a file) of audio, which in turn is a portion of a larger recording (for example, one minute of a 60 minute interview). It is basically parallel and similar to an Amazon::HIT instance. Transcript is a container for these chunks, which know how to render themselves as text and HTML.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(body) ⇒ Chunk

Constructor. Takes the raw text of the transcription.



57
58
59
# File 'lib/typingpool/transcript/chunk.rb', line 57

def initialize(body)
  @body = body
end

Instance Attribute Details

#bodyObject

Get/set the raw text of the transcript



16
17
18
# File 'lib/typingpool/transcript/chunk.rb', line 16

def body
  @body
end

#filenameObject (readonly)

Returns the name of the remote audio file corresponding to this chunk. The remote file has the project ID and pseudo random characters added to it.



46
47
48
# File 'lib/typingpool/transcript/chunk.rb', line 46

def filename
  @filename
end

#filename_localObject (readonly)

Returns the name of the local audio file corresponding to this chunk.



50
51
52
# File 'lib/typingpool/transcript/chunk.rb', line 50

def filename_local
  @filename_local
end

#hitObject

Get/set the id of the Amazon::HIT associated with this chunk



23
24
25
# File 'lib/typingpool/transcript/chunk.rb', line 23

def hit
  @hit
end

#offsetObject (readonly)

Return the offset associated with the chunk, in MM:SS format. This corresponds to the associated audio file, which is a chunk of a larger recording and which starts at a particular time offset, for example from 1:00 (the offset) to 2:00 (the next offset).

This should be updated to return HH:MM:SS and MM:SS.sss when appropriate, since in Project#interval we use that format and allow audio to be divided into such units. (TODO)



38
39
40
# File 'lib/typingpool/transcript/chunk.rb', line 38

def offset
  @offset
end

#offset_secondsObject (readonly)

Returns the offset in seconds. So for an offset of 1:00 would return 60.



41
42
43
# File 'lib/typingpool/transcript/chunk.rb', line 41

def offset_seconds
  @offset_seconds
end

#projectObject

Get/set the id of the Project#local associated with this chunk



26
27
28
# File 'lib/typingpool/transcript/chunk.rb', line 26

def project
  @project
end

#urlObject

Returns the URL of the remote audio transcribed in the body of this chunk.



54
55
56
# File 'lib/typingpool/transcript/chunk.rb', line 54

def url
  @url
end

#workerObject

Get/set the Amazon ID of the Mechanical Turk worker who transcribed the audio into text



20
21
22
# File 'lib/typingpool/transcript/chunk.rb', line 20

def worker
  @worker
end

Instance Method Details

#<=>(other) ⇒ Object

Sorts by offset seconds.



62
63
64
# File 'lib/typingpool/transcript/chunk.rb', line 62

def <=>(other)
  self.offset_seconds <=> other.offset_seconds
end

#body_as_html(wrap = 72) ⇒ Object

Takes an optional count of how many characters to wrap at (default 72). Returns the body, presumed to be raw text, as HTML. Any HTML tags in the body are escaped. Text blocks separated by double newlines are converted to HTML paragraphs, while single newlines are converted to HTML BR tags. Newlines are normalized as in body_as_text, and lines in the HTML source are automatically wrapped as specified.



107
108
109
110
111
112
113
114
115
# File 'lib/typingpool/transcript/chunk.rb', line 107

def body_as_html(wrap=72)
  text = body_as_text
  text = CGI::escapeHTML(text)
  text = Utility.newlines_to_html(text)
  text = text.split("\n").map do |line| 
    wrap_text(line, 72).chomp
  end.join("\n") 
  text
end

#body_as_text(indent = nil, wrap = nil) ⇒ Object Also known as: to_s, to_str

Takes an optional specification of how many spaces to indent the text by (default 0) and an optional specification of how many characters to wrap at (default no wrapping).

Returns the text with newlines normalized to Unix format, runs of newlines shortened to a maximum of two newlines, leading and trailing whitespace removed from each line, and the text wrapped/indented as specified.



88
89
90
91
92
93
94
95
96
# File 'lib/typingpool/transcript/chunk.rb', line 88

def body_as_text(indent=nil, wrap=nil)
  text = self.body
  text = Utility.normalize_newlines(text)
  text.gsub!(/\n\n+/, "\n\n")
  text = text.split("\n").map{|line| line.strip }.join("\n")
  text = wrap_text(text, wrap) if wrap
  text = indent_text(text, indent) if indent
  text
end