Class: Bio::Velvet::Graph

Inherits:
Object
  • Object
show all
Includes:
Logging
Defined in:
lib/bio-velvet/graph.rb

Overview

Parser for a velvet assembler’s graph file (Graph or LastGraph) output from velvetg

The definition of this file is given in the velvet manual, at www.ebi.ac.uk/~zerbino/velvet/Manual.pdf

Defined Under Namespace

Classes: Arc, ArcArray, Node, NodeArray, NodedRead, NodedReadArray

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Logging

#log

Instance Attribute Details

#arcsObject

ArcArray of Arc objects



23
24
25
# File 'lib/bio-velvet/graph.rb', line 23

def arcs
  @arcs
end

#hash_lengthObject

Taken directly from the graph, statistics and information about the Graph i.e. from the velvet manual “$NUMBER_OF_NODES $NUMBER_OF_SEQUENCES $HASH_LENGTH”



17
18
19
# File 'lib/bio-velvet/graph.rb', line 17

def hash_length
  @hash_length
end

#nodesObject

NodeArray object of all the graph’s node objects



20
21
22
# File 'lib/bio-velvet/graph.rb', line 20

def nodes
  @nodes
end

#number_of_nodesObject

Taken directly from the graph, statistics and information about the Graph i.e. from the velvet manual “$NUMBER_OF_NODES $NUMBER_OF_SEQUENCES $HASH_LENGTH”



17
18
19
# File 'lib/bio-velvet/graph.rb', line 17

def number_of_nodes
  @number_of_nodes
end

#number_of_sequencesObject

Taken directly from the graph, statistics and information about the Graph i.e. from the velvet manual “$NUMBER_OF_NODES $NUMBER_OF_SEQUENCES $HASH_LENGTH”



17
18
19
# File 'lib/bio-velvet/graph.rb', line 17

def number_of_sequences
  @number_of_sequences
end

Class Method Details

.logObject



25
26
27
# File 'lib/bio-velvet/graph.rb', line 25

def self.log
  self.new.log
end

.parse_from_file(path_to_graph_file, options = {}) ⇒ Object

Parse a graph file from a Graph, Graph2 or LastGraph output file from velvet into a Bio::Velvet::Graph object

Options:

  • :dont_parse_noded_reads: if true, then parsing of the NR section is skipped

  • :interesting_read_ids: If not nil, is a Set of nodes that we are interested in. Reads

not of interest will not be parsed in (the NR part of the velvet LastGraph file). Regardless all nodes and edges are parsed in. Using this options saves both memory and CPU.

  • :interesting_node_ids: like :interesting_read_ids except it allows targeting of particular nodes

rather than particular reads.

  • :grep_hack: to make the parsing of read associations go even faster, a grep-based, rather

hacky method is applied to the graph file, so only NR data of interesting_read_ids is presented to the parser. This can save days of parsing time, but is a bit of a hack and its usage may not be particularly future-proof. The value of this option is the amount of context coming out of grep (the -A and -B flags). Using 500 should probably work for most circumstances, if not an Exception will be raised.



44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
# File 'lib/bio-velvet/graph.rb', line 44

def self.parse_from_file(path_to_graph_file, options={})
  graph = self.new
  state = :header

  current_node = nil
  graph.nodes = NodeArray.new
  graph.arcs = ArcArray.new
  current_node_direction = nil

  line_number = 0
  Hopcsv.foreach(path_to_graph_file,"\t") do |row|
    line_number += 1

    if state == :header
      raise "parse exception on header line, this line #{line_number}: #{row.inspect}" unless row.length >= 3
      graph.number_of_nodes = row[0].to_i
      graph.number_of_sequences = row[1].to_i
      graph.hash_length = row[2].to_i
      #Not quite sure what the function of the 4th column is
      state = :nodes_0
      log.debug "Now parsing velvet graph nodes" if log.debug?
      next
    end

    if state == :nodes_0
      # NODE $NODE_ID $COV_SHORT1 $O_COV_SHORT1 $COV_SHORT2 $O_COV_SHORT2
      # $ENDS_OF_KMERS_OF_NODE
      # $ENDS_OF_KMERS_OF_TWIN_NODE
      if row[0] == 'NODE'
        raise unless row.length > 2
        current_node = Node.new
        current_node.node_id = row[1].to_i
        current_node.length = row[2].to_i
        current_node.coverages = row[3...row.length].collect{|c| c.to_i}
        current_node.parent_graph = graph
        state = :nodes_1
        raise "Duplicate node name" unless graph.nodes[current_node.node_id].nil?
        graph.nodes[current_node.node_id] = current_node
        next
      else
        state = :arc
        log.debug "Now parsing velvet graph arcs" if log.debug?
        # No next in the loop so that this line gets parsed as an ARC further down the loop
      end
    elsif state == :nodes_1
      # Sometimes nodes can be empty
      row[0] ||= ''
      current_node.ends_of_kmers_of_node = row[0]
      raise "Unexpected nodes_1 type line on line #{line_number}: #{row.inspect}" if row.length != 1
      state = :nodes_2
      next
    elsif state == :nodes_2
      # Sometimes nodes can be empty
      row[0] ||= ''
      raise if row.length != 1
      current_node.ends_of_kmers_of_twin_node = row[0]
      state = :nodes_0
      next
    end

    if state == :arc
      if row[0] == 'ARC'
        # ARC $START_NODE $END_NODE $MULTIPLICITY
        arc = Arc.new
        raise unless row.length == 4
        arc.begin_node_id = row[1].to_i.abs
        arc.end_node_id = row[2].to_i.abs
        arc.multiplicity = row[3].to_i
        arc.begin_node_direction = (row[1].to_i > 0)
        arc.end_node_direction = (row[2].to_i > 0)
        graph.arcs.push arc
        next
      else
        state = :nr
        log.debug "Finished parsing velvet graph arcs. Now parsing the rest of the file" if log.debug?
      end
    end

    if state == :nr
      if row[0] == 'SEQ'
        log.warn "velvet graph parse warning: SEQ lines in the Graph file parsing not implemented yet, tracking of reads now not parsed either"
        break
      end

      # If short reads are tracked, for every node a block of read identifiers:
      # NR $NODE_ID $NUMBER_OF_SHORT_READS
      # $READ_ID $OFFSET_FROM_START_OF_NODE $START_COORD
      # $READ_ID2 etc.
      #p row
      if row[0] == 'NR'
        break if options[:dont_parse_noded_reads] # We are done if NR things aren't parsed
        if options[:grep_hack]
          unless options[:interesting_read_ids] or options[:interesting_node_ids]
            raise "Programming error using bio-velvet: if :grep_hack is specified, then :interesting_read_ids or :interesting_node_ids must also be"
          end
          apply_grep_hack graph, path_to_graph_file, options[:interesting_read_ids], options[:interesting_node_ids], options[:grep_hack]
          break #no more parsing is required
        else
          raise unless row.length == 3
          node_pm = row[1].to_i
          current_node_direction = node_pm > 0
          current_node = graph.nodes[node_pm.abs]
          current_node.number_of_short_reads ||= 0
          current_node.number_of_short_reads += row[2].to_i
          next
        end
      else
        raise unless row.length == 3
        read_id = row[0].to_i
        if (options[:interesting_node_ids] and !options[:interesting_node_ids].include?(current_node.node_id)) or
          (options[:interesting_read_ids] and !options[:interesting_read_ids].include?(read_id))
          # We have come across an uninteresting read. Ignore it.
          next
        end
        nr = NodedRead.new
        nr.read_id = read_id
        nr.offset_from_start_of_node = row[1].to_i
        nr.start_coord = row[2].to_i
        nr.direction = current_node_direction
        current_node.short_reads ||= NodedReadArray.new
        current_node.short_reads.push nr
        next
      end
    end
  end
  log.debug "Finished parsing velvet graph file" if log.debug?

  return graph
end

Instance Method Details

#delete_nodes_if(&block) ⇒ Object

Deletes nodes and associated arcs from the graph if the block passed evaluates to true (as in Array#delete_if). Actually the associated arcs are deleted first, and then the node, so that the graph remains sane at all times - there is never dangling arcs, as such.

Returns a [deleted_nodes, deleted_arcs] tuple, which are both enumerables, each in no particular order.



225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
# File 'lib/bio-velvet/graph.rb', line 225

def delete_nodes_if(&block)
  deleted_nodes = []
  deleted_arcs = []
  nodes.each do |node|
    if yield(node)
      deleted_nodes.push node

      # delete associated arcs
      arcs_to_del = @arcs.get_arcs_by_node_id(node.node_id)
      deleted_arcs.push arcs_to_del
      arcs_to_del.each do |arc|
        @arcs.delete arc
      end

      # delete the arc itself
      nodes.delete node
    end
  end
  return deleted_nodes, deleted_arcs.flatten
end

#get_arcs_by_node(node1, node2) ⇒ Object

Return an array of Arc objects between two nodes (specified by node objects), or an empty array if none exists. There is four possible arcs between two nodes, connecting their beginnings and ends



184
185
186
# File 'lib/bio-velvet/graph.rb', line 184

def get_arcs_by_node(node1, node2)
  @arcs.get_arcs_by_node_id(node1.node_id, node2.node_id)
end

#get_arcs_by_node_id(node_id1, node_id2) ⇒ Object

Return an array of Arc objects between two nodes (specified by integer IDs), or an empty array if none exists. There is four possible arcs between two nodes, connecting their beginnings and ends



177
178
179
# File 'lib/bio-velvet/graph.rb', line 177

def get_arcs_by_node_id(node_id1, node_id2)
  @arcs.get_arcs_by_node_id(node_id1, node_id2)
end

#neighbours_into_start(node) ⇒ Object

Return the adjacent nodes in the graph that connect to the end of a node



204
205
206
207
208
209
210
211
212
213
214
215
# File 'lib/bio-velvet/graph.rb', line 204

def neighbours_into_start(node)
  # Find all arcs that include this node in the right place
  passable_nodes = []
  @arcs.get_arcs_by_node_id(node.node_id).each do |arc|
    if arc.end_node_id == node.node_id and arc.end_node_direction
      passable_nodes.push nodes[arc.begin_node_id]
    elsif arc.begin_node_id == node.node_id and !arc.begin_node_direction
      passable_nodes.push nodes[arc.end_node_id]
    end
  end
  return passable_nodes.uniq
end

#neighbours_off_end(node) ⇒ Object

Return the adjacent nodes in the graph that connect to the end of a node



189
190
191
192
193
194
195
196
197
198
199
200
201
# File 'lib/bio-velvet/graph.rb', line 189

def neighbours_off_end(node)
  # Find all arcs that include this node in the right place
  passable_nodes = []
  @arcs.get_arcs_by_node_id(node.node_id).each do |arc|
    if arc.begin_node_id == node.node_id and arc.begin_node_direction
      # The most intuitive case
      passable_nodes.push nodes[arc.end_node_id]
    elsif arc.end_node_id == node.node_id and !arc.end_node_direction
      passable_nodes.push nodes[arc.begin_node_id]
    end
  end
  return passable_nodes.uniq
end

#parse_additional_noded_reads(path_to_graph_file, options) ⇒ Object

Add more noded reads to this already parsed graph. There is no gaurantee that old NodedRead information is preserved, or removed.

Options:

  • :interesting_read_ids: If not nil, is a Set of nodes that we are interested in. Reads

not of interest will not be parsed in (the NR part of the velvet LastGraph file). Regardless all nodes and edges are parsed in. Using this options saves both memory and CPU.

  • :interesting_node_ids: like :interesting_read_ids except it allows targeting of particular nodes

rather than particular reads.

  • :grep_hack: to make the parsing of read associations go even faster, a grep-based, rather

hacky method is applied to the graph file, so only NR data of interesting_read_ids is presented to the parser. This can save days of parsing time, but is a bit of a hack and its usage may not be particularly future-proof. The value of this option is the amount of context coming out of grep (the -A and -B flags). Using 500 should probably work for most circumstances, if not an Exception will be raised.



260
261
262
263
264
265
266
267
268
269
270
271
# File 'lib/bio-velvet/graph.rb', line 260

def parse_additional_noded_reads(path_to_graph_file, options)
  grep_context = options[:grep_hack]
  if grep_context.nil?
    raise "Calling Graph#parse_additional_noded_reads without specifying :grep_hack is currently not implemented"
  end
  self.class.apply_grep_hack(self,
    path_to_graph_file,
    options[:interesting_read_ids],
    options[:interesting_node_ids],
    grep_context
    )
end