bio-locus

Bio-locus is a tool for fast querying of genome locations. Many file formats in bioinformatics contain records that start with a chromosome name and a position for a SNP, or a start-end position for indels.

This tool essentially allows your to store this information in a Hash or database:

  bio-locus --store < one.vcf

which creates or adds to a cache file or database with unique entries for all listed positions (chr+pos) AND for all listed positions with listed alt alleles. To find positions in another dataset which match those in the database:

  bio-locus --match < two.vcf

The point is that this is a two-step process, first create the indexed database, next query it. It is also possible to remove entries with the --delete switch.

To match with alt use

  bio-locus --match --include-alt < two.vcf

Why would you use bio-locus?

To reduce the size of large SNP databases before storage/querying
To gain performance
To filter on chr+pos (default)
To filter on chr+pos+field (where field can be a VCF ALT)

Use cases are

To filter for annotated variants
To remove common variants from a set

In short a more targeted approach allowing you to work with less data. This tool is decently fast. For example, looking for 130 positions in 20 million SNPs in GoNL takes 0.11s to store and 1.5 minutes to match on my laptop:

cat my_130_variants.vcf | ./bin/bio-locus --store
  Stored 130 positions out of 130 in locus.db
  real    0m0.119s
  user    0m0.108s
  sys     0m0.012s

cat gonl.*.vcf |./bin/bio-locus --match
  Matched 3 out of 20736323 lines in locus.db!
  real    1m34.577s
  user    1m33.602s
  sys     0m1.868s

Note: for the storage the moneta gem is used, currently with localmemcache.

Note: the ALT field is split into components for matching, so A,C becomes two chr+pos records, one for A and one for C.

Installation

gem install bio-locus

Command line

In addition to --store and --match mentioned above there are a number of options available through

bio-locus --help

Deleting keys

To delete entries use

  bio-locus --delete < two.vcf

To match with alt use

  bio-locus --delete --include-alt < two.vcf

You may need to run both with and without alt, depending on your needs!

Parsing

It is possible to use any line based format. For example parsing the alt from

X       107976940       G/C     -1      5       5       0.75    H879D   0      IRS4     CCDS14544       Cat/Gat rs1801164       missense_variant        ENST00000372129.2:c.2635C>G

can be done with

bio-locus --store --eval-alt 'field[2].split(/\//)[1]'

COSMIC

COSMIC is pretty large, so it can be useful to cut the database down to the variants that you have. The locus information is combined in the before last column as chr:start-end, e.g., 19:58861911-58861911. This will work:

bio-locus -i --match --eval-chr='field[13] =~ /^([^:]+)/ ; $1' --eval-pos='field[13] =~ /:(\d+)-/ ; $1 ' < CosmicMutantExportIncFus_v68.tsv

Note the -i switch is needed to skip records that lack position information.

Usage

require 'bio-locus'

The API doc is online. For more code examples see the test files in the source tree.

Project home page

Information on the source tree, documentation, examples, issues and how to contribute, see

http://github.com/pjotrp/bioruby-locus

The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.

Cite

If you use this software, please cite one of

Biogems.info

This Biogem is published at (http://biogems.info/index.html#bio-locus)