GnCrossmap
This gem crossmaps a checklist of scientific names to names from a data source in GN Resolver.
Checklist has to be in a CSV format.
Compatibility
This gem is compatible with Ruby versions higher or equal to 2.1.0
Installation
Add this line to your application's Gemfile:
gem 'gn_crossmap'
And then execute:
bundle
Or install it yourself as:
gem install gn_crossmap
Usage
Usage as a Web Application
see [gn_crossmap_web] project
Usage from command line
# to see help
crossmap --help
# to compare with default source (Catalogue of Life)
crossmap -i my_list.csv -o my_list_col.csv
# to compare with other source (Index Fungorum in this example)
crossmap -i my_list.csv -o my_list_if.csv -d 5
# to use standard intput and/or output
cat my_list.csv | crossmap -i - -o - > output
# to keep only taxonID (if given) from original input
# no original fields will be kept without taxonID
cat my_list.csv | crossmap -i my_list.csv -s
# to show classification from the source
cat my_list.csv | crossmap -i my_list.csv -w
Usage as Ruby Library (API description)
GnCrossmap.run
Compares an input list to a data source from GN Resolver and writes result into an output file.
opts = { input: input, output: output, data_source_id: 1 ,
skip_original: true, alt_headers: [] }
GnCrossmap.run(opts)
input
: (string) Either a path to a csv file with list of names, or "-" which
designates STDIN
output
: (string) Either a path to the output file, or "-" which designates STDOUT
data_source_id
: (integer) id of a data source from GN resolver
skip_original
: (boolean) if true only taxonID
(if given) is preserved
from original data. Otherwise all original data is preserved. If there is no
taxonID
, no original data will be preserved.
alt_headers
: (array) empty array by default. If alt_headers
are not empty they are used
instead of the headers supplied with the file
resolver_url
: URL to globalnames' resolver. Default is http://resolver.globalnames.org
with_classification
: (boolean) if true, adds classification path to the output
GnCrossmap.logger=
Allows to set logger to a custom logger (default is STDERR
)
Usage Example
require "gn_crossmap"
# If you want to change logger -- default Logging is to standard error
GnCrossmap.logger = MyCustomLogger.new
opts = { input: "path/to/input.csv", output: "path/to/output.csv,
data_source_id: 5 , skip_original: true }
GnCrossmap.run("path/to/input.csv", "path/to/output.csv", 5, true)
# if you want to use alternative headers instead of ones supplied in a file
opts = { input: "path/to/input.csv", output: "path/to/output.csv,
data_source_id: 5 , skip_original: true,
alt_headers: %w(taxonId, scientificName, rank) }
GnCrossmap.run(opts)
If you want to get intermediate statistics for each resolution cycle use a block:
GnCrossmap.run(opts) do |stats|
puts stats
puts "Matches:"
stats[:matches].each do |key, value|
puts "#{GnCrossmap::MATCH_TYPES[key]}: #{value}"
end
end
To trigger termination of the resolution before it is completed
GnCrossmap.run(opts) do
# do something and then return STOP string from the block
"STOP"
end
Intermediate stats format
Field | Description |
---|---|
status | current phase: (init, ingested |
total_records | total number of names in original list |
ingestion_start | time when the reading from csv started |
ingestion_span | time of intermediate checkpoint of reading csv |
ingested_records | number of ingested records at an intermediate checkpoint |
resolution_start | time when resolution of names started |
resolution_stop | time when resolution of names stopped |
resolution_span | time of intermediate checkpoint of resolving names |
resolved_records | number of names already processed |
speed | weighted speed average |
current_speed | speed of resolution for the last batch |
matches | Distribution of processed data by match type (see below) |
errors | First 0-10 errors found during the csv file processing |
Match types
Match types dictionary can be accessed with GnCrossmap::MATCH_TYPES
constant
Match code | Match type |
---|---|
0 | No match |
1 | Exact string match |
2 | Canonical form exact match |
3 | Canonical form fuzzy match |
4 | Partial canonical form match |
5 | Partial canonical form fuzzy match |
6 | Genus part match |
7 | Error in matching |
Input file format
- Comma Separated File with names of fields in first row.
- Columns can be separated by tab, comma or semicolon
- At least some columns should have recognizable fields
taxonID
kingdom
phylum
class
order
family
genus
species
subspecies
variety
form scientificNameAuthorship
scientificName
taxonRank
simplest Example -- only scientificName
scientificName |
---|
Animalia |
Macrobiotus echinogenitus subsp. areolatus Murray, 1907 |
taxonID and scientificName Example
taxonID;scientificName
1;Macrobiotus echinogenitus subsp. areolatus Murray, 1907
...
taxonID | scientificName |
---|---|
1 | Animalia |
2 | Macrobiotus echinogenitus subsp. areolatus Murray, 1907 |
Rank Example
taxonID;scientificName;taxonRank
1;Macrobiotus echinogenitus f. areolatus Murray, 1907;form
...
taxonID | scientificName | taxonRank |
---|---|---|
1 | Animalia | kingdom |
2 | Macrobiotus echinogenitus subsp. areolatus Murray, 1907 | subspecies |
Family and Authorship Example
taxonID;family;scientificName;scientificNameAuthorship
1;Macrobiotidae;Macrobiotus echinogenitus subsp. areolatus;Murray, 1907
...
taxonID | family | scientificName | scientificNameAuthorship |
---|---|---|---|
1 | Animalia | ||
2 | Macrobiotidae | Macrobiotus echinogenitus | Murray |
Fine-grained Example
TaxonId;kingdom;subkingdom;phylum;subphylum;superclass;class;subclass;cohort;superorder;order;suborder;infraorder;superfamily;family;subfamily;tribe;subtribe;genus;subgenus;section;species;subspecies;variety;form;ScientificNameAuthorship
1;Animalia;;Tardigrada;;;Eutardigrada;;;;Parachela;;;Macrobiotoidea;Macrobiotidae;;;;Macrobiotus;;;harmsworthi;obscurus;;;Dastych, 1985
TaxonId | kingdom | subkingdom | phylum | subphylum | superclass | class | subclass | cohort | superorder | order | suborder | infraorder | superfamily | family | subfamily | tribe | subtribe | genus | subgenus | section | species | subspecies | variety | form | ScientificNameAuthorship |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
136021 | Animalia | Pogonophora | |||||||||||||||||||||||
136022 | Animalia | Pogonophora | Frenulata | Webb, 1969 | |||||||||||||||||||||
565443 | Animalia | Tardigrada | Eutardigrada | Parachela | Macrobiotoidea | Macrobiotidae | Macrobiotus | harmsworthi | obscurus | Dastych, 1985 |
More examples can be found in spec/files directory
Output file format
Output includes following fields:
Field | Description |
---|---|
classification | classification path of the data source (if available) |
editDistance | for fuzzy-matching -- how many characters differ between checklist and data source name |
matchSize | number of returned matches for a name |
matchType | what kind of match it is |
matchedCanonicalForm | canonical form of the matched name |
matchedRank | corresponding rank from the data source |
matchedScientificName | name matched from the GN Reolver data source |
rank | rank from the source (if it was given/inferred) |
scientificName | name from the checklist |
score | heuristic score from 0 to 1 where 1 is a good match, 0.5 match requires further human investigation |
taxonID | original ID attached to a name in the checklist |
Types of Matches
The output fomat returns 7 possible types of matches:
- Exact string match - The exact name was matched (but ignoring non-ascii characters)
- Exact match by canonical form of a name - The canonical form of the name (a version of a scientific name that contains complete versions of the latin words, but lacks insertions of subtaxa, annotations, or authority information) was matched
- Fuzzy match by canonical form - The canonical form gave a fuzzy (detecting lexical or spelling variations of a name using Tony Rees' algorithm TAXAMATCH) match
- Partial exact match by species part of canonical form - The canonical form returned a partial but exact match
- Partial fuzzy match by species part of canonical form - The canonical form returned a partial, fuzzy match
- Exact match by genus part of a canonical form - The genus part of the canonical form of the species name returned an exact match
- [Blank] - No match
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run
bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To
release a new version, update the version number in version.rb
, and then run
bundle exec rake release
to create a git tag for the version, push git
commits and tags, and push the .gem
file to
rubygems.org
Contributing
- Fork it (
https://github.com/[my-github-username]/gn_crossmap/fork
) - Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
Copyright
Author -- Dmitry Mozzherin
Copyright (c) 2015-2016 Marine Biological Laboratory. See LICENSE for details.