bio-vcf

Build Status

Yet another VCF parser. This one may give better performance because of lazy parsing and useful combinations of (fancy) command line filtering. For example, to filter somatic data

  bio-vcf --filter 'rec.alt.size==1 and rec.tumor.bq[rec.alt]>30 and rec.tumor.mq>20' < file.vcf

The VCF format is commonly used for variant calling between NGS samples. The fast parser needs to carry some state, recorded for each file in VcfHeader, which contains the VCF file header. Individual lines (variant calls) first go through a raw parser returning an array of fields. Further (lazy) parsing is handled through VcfRecord.

At this point the filter is pretty generic with multi-sample support. If something is not working, check out the feature descriptions and the source code. It is not hard to add features. Otherwise, send me a short example of a VCF statement you need to work on.

Installation

gem install bio-vcf

Quick start

Command line interface (CLI)

Get the version of the VCF file

  bio-vcf -q --eval-once header.version < file.vcf
  4.1

Get the column headers

  bio-vcf -q --eval-once 'header.column_names.join(",")' < file.vcf
  CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,NORMAL,TUMOR

Get the sample names

  bio-vcf -q --eval-once 'header.samples.join(",")' < file.vcf
  NORMAL,TUMOR

The 'fields' array contains unprocessed data (strings). Print first five raw fields

  bio-vcf --eval 'fields[0..4].join("\t")' < file.vcf 

Add a filter to display the fields on chromosome 12

  bio-vcf --filter 'fields[0]=="12"' --eval 'fields[0..4].join("\t")' < file.vcf 

It gets better when we start using processed data, represented by an object named 'rec'. Position is a value, so we can filter a range

  bio-vcf --filter 'rec.chrom=="12" and rec.pos>96_641_270 and rec.pos<96_641_276' < file.vcf 

Info fields are referenced by

  bio-vcf --filter 'rec.info.dp>100 and rec.info.readposranksum<=0.815' < file.vcf 

With subfields defined by rec.format

  bio-vcf --filter 'rec.tumor.ss != 2' < file.vcf 

Output

  bio-vcf --filter 'rec.tumor.gq>30' 
    --eval '[rec.ref,rec.alt,rec.tumor.bcount,rec.tumor.gq,rec.normal.gq].join("\t")' 
    < file.vcf

Show the count of the bases that were scored as somatic

  bio-vcf --eval 'rec.alt+"\t"+rec.tumor.bcount.split(",")[["A","C","G","T"].index(rec.alt)]+
    "\t"+rec.tumor.gq.to_s' < file.vcf

Actually, we have a convenience implementation for bcount, so this is the same

  bio-vcf --eval 'rec.alt+"\t"+rec.tumor.bcount[rec.alt].to_s+"\t"+rec.tumor.gq.to_s' 
    < file.vcf

Filter on the somatic results that were scored at least 4 times

  bio-vcf --filter 'rec.alt.size==1 and rec.tumor.bcount[rec.alt]>4' < test.vcf 

Similar for base quality scores

  bio-vcf --filter 'rec.alt.size==1 and rec.tumor.amq[rec.alt]>30' < test.vcf 

If your samples have other names you can fetch genotypes for that sample with

  bio-vcf --eval "rec.sample['BIOPSY17513D'].gt" < file.vcf

Or read depth for another

  bio-vcf --eval "rec.sample['subclone46'].dp" < file.vcf

Better even, you can access samples directly with

  bio-vcf --eval "rec.sample.biopsy17513d.gt" < file.vcf
  bio-vcf --eval "rec.sample.subclone46.dp" < file.vcf

For more examples see the feature section.

API

BioVcf can also be used as an API. The following code is basically what the command line interface uses (see ./bin/bio-vcf)

  FILE.each_line do | line |
    if line =~ /^##fileformat=/
      # ---- We have a new file header
      header = VcfHeader.new
      header.add(line)
      STDIN.each_line do | headerline |
        if headerline !~ /^#/
          line = headerline
          break # end of header
        end
        header.add(headerline)
      end
    end
    # ---- Parse VCF record line
    # fields = VcfLine.parse(line,header.columns)
    fields = VcfLine.parse(line)
    rec = VcfRecord.new(fields,header)
    #
    # Do something with rec
    #
  end

Project home page

Information on the source tree, documentation, examples, issues and how to contribute, see

http://github.com/pjotrp/bioruby-vcf

Cite

If you use this software, please cite one of

Biogems.info

This Biogem is published at (http://biogems.info/index.html#bio-vcf)

Copyright (c) 2014 Pjotr Prins. See LICENSE.txt for further details.