FULL-LENGTHERNEXT is a tool adapted to NGS technologies, able to work in parallel and in a distributed way to minimise computing time. It is able to classify unigenes to full-length, 5’-end, 3’-end and internal, suggesting which unknown genes are coding or not. It will be also shown that FULL-LENGTHERNEXT fixes frame shifts, one of the main mistake found in wrong entries of full-length sequences databases, and it is a fast tool to compare different transcriptome assemblies.
FULL-LENGTHERNEXT uses scbi_mapreduce and thus is able to exploit all the benefits of a cluster environment. It also works in multi-core machines big shared-memory servers.
It is able to classify unigenes to full-length, 5’-end, 3’-end and internal.
FULL-LENGTHERNEXT fixes frame shifts.
It returns the translated protein sequence for the complete genes and the nucleotide sequence with frame shift fixed and highlighting the start and end codon for an easier finding of the gene and the UTR regions.
FULL-LENGTHERNEXT suggests putative new genes analysing what of the genes classified as unknown are probably coding and what are putative non coding RNA sequences.
It produces a HTML file with statistics useful for assemblies comparison.
FULL-LENGTHERNEXT must be fed with a multifasta file containing all unigenes to analyse and which group belongs the organism under study among fungi, human, invertebrates, mammals, plants, rodents or vertebrates, to use the most appropriate databases. Furthermore, it is possible parametrizing the number of cpus to be used (workers), the minimum identity percent (default = 45%) and minimum e value (default = 1e-25) thresholds, the maximum distance between query and subject gene limits (default = 15 amino acids) and a user database of complete proteins if desired.
full_lengther_next -f input.fasta -g [fungi|human|invertebrates|mammals|plants|rodents|vertebrates] -d user_db [options]
Full-LengthNext results files appear at the end of program execution, grouped in a folder called fl2_results, where the following files can be found:
alignments.txt: Displays the BLASTx alignment between our query sequence translated into amino acids and the protein sequence from the Full-LengthNext database.
annotations.txt: in this file, the main information for each query sequence can be found; status, subject accession number, subject description, warning messages, protein obtained and indices provided by BLASTx alignment.
nc_rna.txt: Putative non coding RNA sequences detected using BLAST.
nt_seq.txt: It contains the nucleotide sequence, marking when possible the start codon with hyphen and underscore and hyphen (-_-) and the stop codon with three underscores. Useful to find UTRs and gene sequence.
proteins.fasta: fasta format file with the complete proteins.
summary_stats.html: summary statistics of the results obtained by Full-LengthNext for the set of query unigenes. It is useful for assemblies comparison.
tcode_result.txt: It is equivalent to annotations.txt file, but it is used for sequences with no similarity in databases. Possible status are: coding, non-coding or unknown
To install FULL-LENGTHERNEXT into a cluster, you need to have the software available on all machines. By installing it on a shared location, or installing it on each cluster node. Once installed, you need to create a init_file where your environment is correctly setup (paths, BLASTDB, etc):
export PATH=/apps/blast+/bin:/apps/cd-hit/bin export BLASTDB=/var/DB/formatted export FULL_LENGTHER_NEXT_INIT=path_to_init_file And initialize the FULL_LENGTHER_NEXT_INIT environment variable on your main node (from where FULL-LENGTHERNEXT will be initially launched):
export FULL_LENGTHER_NEXT_INIT=path_to_init_file If you use any queue system like PBS Pro or Moab/Slurm, be sure to initialize the variables on each submission script.
NOTE: all nodes on the cluster should use ssh keys to allow FULL-LENGTHERNEXT to launch workers without asking for a password.
SAMPLE INIT FILES FOR CLUSTERED INSTALLATION: Init file $> cat fln_init_env
source ~ruby19/init_env source ~blast_plus/init_env
export BLASTDB=~full_lenghter_next/DB/formatted/ export FULL_LENGTHER_NEXT_INIT=~full_lenghter_next/fln_init_env
PBS Submission script
$> cat sample_work.sh
# 12 distributed workers and 1 GB memory per worker: #PBS -l select=12:ncpus=1:mpiprocs=1:mem=1gb # request 10 hours of walltime: #PBS -l walltime=10:00:00 # cd to working directory (from where job was submitted) cd $PBS_O_WORKDIR
# create workers file with assigned node names
cat $PBS_NODEFILE > workers
# init full-lengthernext source ~full_lenghter_next/init_env
time full_lenghter_next -f input.fasta -g group -d user_db -w workers -s 10.0.0 Once this submission script is created, you only need to launch it with:
Blast plus 2.24 or greater (prior versions have bugs that produces bad results)
*Download the latest version of Blast+ from ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/ *You can also use a precompiled version if you like *To install from source, decompress the downloaded file, cd to the decompressed folder, and issue the following commands:
./configure make sudo make install
Installing Ruby 1.9
*You can use RVM to install ruby:
Download latest certificates (maybe you don’t need them):
$ curl -O curl.haxx.se/ca/cacert.pem $ export CURL_CA_BUNDLE=`pwd`/cacert.pem # add this to your .bashrc or equivalent
$ bash < <(curl -k rvm.beginrescueend.com/install/rvm) Setup environment:
$ echo '[[ -s “$HOME/.rvm/scripts/rvm” ]] && . “$HOME/.rvm/scripts/rvm” # Load RVM function' >> ~/.bash_profile Install ruby 1.9.2 (this can take a while):
$ rvm install 1.9.2 Set it as the default:
$ rvm use 1.9.2 –default
Full-LengtherNEXT is very easy to install. It is distributed as a ruby gem. The next command will install Full-LengtherNEXT and all the required gems:
gem install full_lengther_next
Install and rebuild Full-LengthNEXT databases
Full-LengthNEXT needs some databases to work. You can use the BLASTDB environment variable to to change the default database location. To set the path for storing databases, execute next line in your terminal or add it to your .bash_profile:
To install databases execute:
In addition, Full-LengthNEXT is able to use a customised database. It can be created executing:
This script only needs two parameters, a database division among fungi, human, invertebrates, mammals, plants, rodents or vertebrates, and a ‘taxon’ corresponds to a specific taxonomic group such as genus, family, or order. For example, if our organism under study is a pine, the database division will be ‘plants’ and the taxon may be Pinus, Pinaceae or even Coniferales order, which includes pines and the other of conifers. Therefore, the command line will be:
$ mk_user_db.rb plants Coniferales.
Otherwise, this database must contain only with complete proteins, and formatted with the BLAST command:
makeblastdb -in sequences.fasta -dbtype 'prot' -parse_seqids
(The MIT License)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the 'Software'), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.