Class: Typingpool::Transcript::Chunk
- Inherits:
-
Object
- Object
- Typingpool::Transcript::Chunk
- Includes:
- Gem::Text
- Defined in:
- lib/typingpool/transcript/chunk.rb
Overview
Transcript::Chunk is the model class for one transcription by one Mechanical Turk worker of one “chunk” (a file) of audio, which in turn is a portion of a larger recording (for example, one minute of a 60 minute interview). It is basically parallel and similar to an Amazon::HIT instance. Transcript is a container for these chunks, which know how to render themselves as text and HTML.
Instance Attribute Summary collapse
-
#body ⇒ Object
Get/set the raw text of the transcript.
-
#filename ⇒ Object
readonly
Returns the name of the remote audio file corresponding to this chunk.
-
#filename_local ⇒ Object
readonly
Returns the name of the local audio file corresponding to this chunk.
-
#hit ⇒ Object
Get/set the id of the Amazon::HIT associated with this chunk.
-
#offset ⇒ Object
readonly
Return the offset associated with the chunk, in MM:SS format.
-
#offset_seconds ⇒ Object
readonly
Returns the offset in seconds.
-
#project ⇒ Object
Get/set the id of the Project#local associated with this chunk.
-
#url ⇒ Object
Returns the URL of the remote audio transcribed in the body of this chunk.
-
#worker ⇒ Object
Get/set the Amazon ID of the Mechanical Turk worker who transcribed the audio into text.
Instance Method Summary collapse
-
#<=>(other) ⇒ Object
Sorts by offset seconds.
-
#body_as_html(wrap = 72) ⇒ Object
Takes an optional count of how many characters to wrap at (default 72).
-
#body_as_text(indent = nil, wrap = nil) ⇒ Object
(also: #to_s, #to_str)
Takes an optional specification of how many spaces to indent the text by (default 0) and an optional specification of how many characters to wrap at (default no wrapping).
-
#initialize(body) ⇒ Chunk
constructor
Constructor.
Constructor Details
#initialize(body) ⇒ Chunk
Constructor. Takes the raw text of the transcription.
57 58 59 |
# File 'lib/typingpool/transcript/chunk.rb', line 57 def initialize(body) @body = body end |
Instance Attribute Details
#body ⇒ Object
Get/set the raw text of the transcript
16 17 18 |
# File 'lib/typingpool/transcript/chunk.rb', line 16 def body @body end |
#filename ⇒ Object (readonly)
Returns the name of the remote audio file corresponding to this chunk. The remote file has the project ID and pseudo random characters added to it.
46 47 48 |
# File 'lib/typingpool/transcript/chunk.rb', line 46 def filename @filename end |
#filename_local ⇒ Object (readonly)
Returns the name of the local audio file corresponding to this chunk.
50 51 52 |
# File 'lib/typingpool/transcript/chunk.rb', line 50 def filename_local @filename_local end |
#hit ⇒ Object
Get/set the id of the Amazon::HIT associated with this chunk
23 24 25 |
# File 'lib/typingpool/transcript/chunk.rb', line 23 def hit @hit end |
#offset ⇒ Object (readonly)
Return the offset associated with the chunk, in MM:SS format. This corresponds to the associated audio file, which is a chunk of a larger recording and which starts at a particular time offset, for example from 1:00 (the offset) to 2:00 (the next offset).
This should be updated to return HH:MM:SS and MM:SS.sss when appropriate, since in Project#interval we use that format and allow audio to be divided into such units. (TODO)
38 39 40 |
# File 'lib/typingpool/transcript/chunk.rb', line 38 def offset @offset end |
#offset_seconds ⇒ Object (readonly)
Returns the offset in seconds. So for an offset of 1:00 would return 60.
41 42 43 |
# File 'lib/typingpool/transcript/chunk.rb', line 41 def offset_seconds @offset_seconds end |
#project ⇒ Object
Get/set the id of the Project#local associated with this chunk
26 27 28 |
# File 'lib/typingpool/transcript/chunk.rb', line 26 def project @project end |
#url ⇒ Object
Returns the URL of the remote audio transcribed in the body of this chunk.
54 55 56 |
# File 'lib/typingpool/transcript/chunk.rb', line 54 def url @url end |
#worker ⇒ Object
Get/set the Amazon ID of the Mechanical Turk worker who transcribed the audio into text
20 21 22 |
# File 'lib/typingpool/transcript/chunk.rb', line 20 def worker @worker end |
Instance Method Details
#<=>(other) ⇒ Object
Sorts by offset seconds.
62 63 64 |
# File 'lib/typingpool/transcript/chunk.rb', line 62 def <=>(other) self.offset_seconds <=> other.offset_seconds end |
#body_as_html(wrap = 72) ⇒ Object
Takes an optional count of how many characters to wrap at (default 72). Returns the body, presumed to be raw text, as HTML. Any HTML tags in the body are escaped. Text blocks separated by double newlines are converted to HTML paragraphs, while single newlines are converted to HTML BR tags. Newlines are normalized as in body_as_text, and lines in the HTML source are automatically wrapped as specified.
107 108 109 110 111 112 113 114 115 |
# File 'lib/typingpool/transcript/chunk.rb', line 107 def body_as_html(wrap=72) text = body_as_text text = CGI::escapeHTML(text) text = Utility.newlines_to_html(text) text = text.split("\n").map do |line| wrap_text(line, 72).chomp end.join("\n") text end |
#body_as_text(indent = nil, wrap = nil) ⇒ Object Also known as: to_s, to_str
Takes an optional specification of how many spaces to indent the text by (default 0) and an optional specification of how many characters to wrap at (default no wrapping).
Returns the text with newlines normalized to Unix format, runs of newlines shortened to a maximum of two newlines, leading and trailing whitespace removed from each line, and the text wrapped/indented as specified.
88 89 90 91 92 93 94 95 96 |
# File 'lib/typingpool/transcript/chunk.rb', line 88 def body_as_text(indent=nil, wrap=nil) text = self.body text = Utility.normalize_newlines(text) text.gsub!(/\n\n+/, "\n\n") text = text.split("\n").map{|line| line.strip }.join("\n") text = wrap_text(text, wrap) if wrap text = indent_text(text, indent) if indent text end |