bio-locus
Bio-locus is a tool for fast querying of genome locations. Many file formats in bioinformatics contain records that start with a chromosome name and a position for a SNP, or a start-end position for indels.
This tool essentially allows your to store this information in a Hash or database:
bio-locus --store < one.vcf
which creates or adds to a cache file or database with unique entries for all listed positions (chr+pos) AND for all listed positions with listed alt alleles. To find positions in another dataset which match those in the database:
bio-locus --match < two.vcf
The point is that this is a two-step process, first create the indexed database, next query it. It is also possible to remove entries with the --delete switch.
To match with alt use
bio-locus --match --include-alt < two.vcf
Why would you use bio-locus?
- To reduce the size of large SNP databases before storage/querying
- To gain performance
- To filter on chr+pos (default)
- To filter on chr+pos+field (where field can be a VCF ALT)
Use cases are
- To filter for annotated variants
- To remove common variants from a set
In short a more targeted approach allowing you to work with less data. This tool is decently fast. For example, looking for 130 positions in 20 million SNPs in GoNL takes 0.11s to store and 1.5 minutes to match on my laptop:
cat my_130_variants.vcf | ./bin/bio-locus --store
Stored 130 positions out of 130 in locus.db
real 0m0.119s
user 0m0.108s
sys 0m0.012s
cat gonl.*.vcf |./bin/bio-locus --match
Matched 3 out of 20736323 lines in locus.db!
real 1m34.577s
user 1m33.602s
sys 0m1.868s
Note: for the storage the moneta gem is used, currently with localmemcache.
Note: the ALT field is split into components for matching, so A,C becomes two chr+pos records, one for A and one for C.
Installation
gem install bio-locus
Command line
In addition to --store and --match mentioned above there are a number of options available through
bio-locus --help
Deleting keys
To delete entries use
bio-locus --delete < two.vcf
To match with alt use
bio-locus --delete --include-alt < two.vcf
You may need to run both with and without alt, depending on your needs!
Parsing
It is possible to use any line based format. For example parsing the alt from
X 107976940 G/C -1 5 5 0.75 H879D 0 IRS4 CCDS14544 Cat/Gat rs1801164 missense_variant ENST00000372129.2:c.2635C>G
can be done with
bio-locus --store --eval-alt 'field[2].split(/\//)[1]'
COSMIC
COSMIC is pretty large, so it can be useful to cut the database down to the variants that you have. The locus information is combined in the before last column as chr:start-end, e.g., 19:58861911-58861911. This will work:
bio-locus -i --match --eval-chr='field[13] =~ /^([^:]+)/ ; $1' --eval-pos='field[13] =~ /:(\d+)-/ ; $1 ' < CosmicMutantExportIncFus_v68.tsv
Note the -i switch is needed to skip records that lack position information.
Usage
require 'bio-locus'
The API doc is online. For more code examples see the test files in the source tree.
Project home page
Information on the source tree, documentation, examples, issues and how to contribute, see
http://github.com/pjotrp/bioruby-locus
The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.
Cite
If you use this software, please cite one of
- BioRuby: bioinformatics software for the Ruby programming language
- Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics
Biogems.info
This Biogem is published at (http://biogems.info/index.html#bio-locus)
Copyright
Copyright (c) 2014 Pjotr Prins. See LICENSE.txt for further details.