egor

www-cryst.bioc.cam.ac.uk/egor

Description

‘egor’ is a program for calculating environment-specific substitution tables from user providing environmental class definitions and sequence alignments with the annotations of the environment classes.

Features

  • Environment-specific substitution table generation based on user providing environmental class definition

  • Entropy-based smoothing procedures to cope with sparse data problem

  • BLOSUM-like weighting procedures using PID threshold

  • Heat Map generation for substitution tables

Requirements

Following RubyGems will be automatically installed if you have rubygems installed on your machine

Installation

~user $ sudo gem install egor

Basic Usage

It’s pretty much the same as Kenji’s subst (www-cryst.bioc.cam.ac.uk/~kenji/subst/), so in most cases, you can swap ‘subst’ with ‘egor’.

~user $ egor -l TEMLIST-file -c classdef.dat
            or
~user $ egor -f TEM-file -c classdef.dat

Options

--tem-file (-f) FILE: a tem file
--tem-list (-l) FILE: a list for tem files
--classdef (-c) FILE: a file for the defintion of environments (default: 'classdef.dat')
--outfile (-o) FILE: output filename (default 'allmat.dat')
--weight (-w) INTEGER: clustering level (PID) for the BLOSUM-like weighting (default: 60)
--noweight: calculate substitution count with no weights
--smooth (-s) INTEGER:
    0 for partial smoothing (default)
    1 for full smoothing
--p1smooth: perform smoothing for p1 probability calculation when partial smoothing
--nosmooth: perform no smoothing operation
--cys (-y) INTEGER:
    0 for using C and J only for structure (default)
    1 for both structure and sequence
    2 for using only C for both (must be set when you have no 'disulphide' or 'disulfide' annotation in templates)
--output INTEGER:
    0 for raw count (no smoothing performed)
    1 for probabilities
    2 for log odds ratios (default)
--noroundoff: do not round off log odds ratio
--scale INTEGER: log odds ratio matrices in 1/n bit units (default 3)
--sigma DOUBLE: change the sigma value for smoothing (default 5.0)
--autosigma: automatically adjust the sigma value for smoothing
--add DOUBLE: add this value to raw count when deriving log odds ratios without smoothing (default 1/#classes)
--pidmin DOUBLE: count substitutions only for pairs with PID equal to or greater than this value (default none)
--pidmax DOUBLE: count substitutions only for pairs with PID smaller than this value (default none)
--heatmap INTEGER:
    0 create a heat map file for each substitution table
    1 create one big file containing all heat maps from substitution tables
    2 do both 0 and 1
--heatmap-format INTEGER:
    0 for Portable Network Graphics (PNG) Format (default)
    1 for Graphics Interchange Format (GIF)
    2 for Joint Photographic Experts Group (JPEG) Format
    3 for Microsoft Windows bitmap (BMP) Format
    4 for Portable Document Format (PDF)
--heatmap-columns INTEGER: number of tables to print in a row when --heatmap 1 or 2 set (default: sqrt(no. of tables))
--heatmap-stem STRING: stem for a file name when --heatmap 1 or 2 set (default: 'heatmap')
--heatmap-values: print values in the cells when generating heat maps
--verbose (-v) INTEGER
    0 for ERROR level
    1 for WARN or above level (default)
    2 for INFO or above level
    3 for DEBUG or above level
--version: print version
--help (-h): show help

Usage

  1. Prepare an environmental class definition file. For more details, please check this notes (www-cryst.bioc.cam.ac.uk/~kenji/subst/NOTES).

    ~user $ cat classdef.dat
    #
    # name of feature (string); values adopted in .tem file (string); class labels assigned for each value (string);
    # constrained or not (T or F); silent (used as masks)? (T or F)
    #
    secondary structure and phi angle;HEPC;HEPC;T;F
    solvent accessibility;TF;Aa;F;F
    
  2. Prepare structural alignments and their annotations of above environmental classes in PIR format.

    ~user $ cat sample1.tem
    >P1;1mnma
    sequence
    QKERRKIEIKFIENKTRRHVTFSKRKHGIMKKAFELSVLTGTQVLLLVVSETGLVYTFSTPKFEPIVTQQEGRNL
    IQACLNAPDD*
    >P1;1egwa
    sequence
    --GRKKIQITRIMDERNRQVTFTKRKFGLMKKAYELSVLCDCEIALIIFNSSNKLFQYASTDMDKVLLKYTEY--
    ----------*
    >P1;1mnma
    secondary structure and phi angle
    CPCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHPCCCEEEEECCCPCEEEEECCCCCHHHHCHHHHHH
    HHHHHCCCCP*
    >P1;1egwa
    secondary structure and phi angle
    --CCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHCPCCCEEEEECCCPCEEEEECCCHHHHHHHHHHC--
    ----------*
    >P1;1mnma
    solvent accessibility
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTFTTTTTTTTTTTTTTTT
    TTTTTTTTTT*
    >P1;1egwa
    solvent accessibility
    --TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTFTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT--
    ----------*
    ...
    
  3. When you have two or more alignment files, you should make a separate file containing all the paths for the alignment files.

    ~user $ ls -1 *.tem > TEMLIST
    ~user $ cat TEMLIST
    sample1.tem
    sample2.tem
    ...
    
  4. To produce substitution count matrices, type

    ~user $ egor -l TEMLIST --output 0 -o substcount.mat
    
  5. To produce substitution probability matrices, type

    ~user $ egor -l TEMLIST --output 1 -o substprob.mat
    
  6. To produce log odds ratio matrices, type

    ~user $ egor -l TEMLIST --output 2 -o substlogo.mat
    
  7. To produce substitution data only from the sequence pairs within a given PID range, type (if you don’t provide any name for output, ‘allmat.dat’ will be used.)

    ~user $ egor -l TEMLIST --pidmin 60 --pidmax 80 --output 1
    
  8. To change the clustering level (default 60), type

    ~user $ egor -l TEMLIST --weight 80 --output 2
    
  9. In case any positions are masked with the character ‘X’ in any environmental features will be excluded from the calculation of substitution counts.

  10. Then, it will produce a file containing all the matrices, which will look like the one below. For more details, please check this notes (www-cryst.bioc.cam.ac.uk/~kenji/subst/NOTES).

    # Environment-specific amino acid substitution matrices
    # Creator: egor version 0.0.5
    # Creation Date: 05/02/2009 17:29
    #
    # Definitions for structural environments:
    # 2 features used
    #
    # secondary structure and phi angle;HEPC;HEPC;F;F
    # solvent accessibility;TF;Aa;F;F
    # (read in from classdef.dat)
    #
    # Number of alignments: 1187
    # (list of .tem files read in from TEMLIST)
    #
    # Total number of environments: 8
    #
    # There are 21 amino acids considered.
    # ACDEFGHIKLMNPQRSTVWYJ
    # 
    # C: Cystine (the disulfide-bonded form)
    # J: Cysteine (the free thiol form)
    #
    # Weighting scheme: clustering at PID 60 level
    # ...
    #
    >HA 0
    #        A      C      D      E      F      G      H      I      K      L      M      N      P      Q      R      S      T      V      W      Y      J
    A        3     -5      0      0     -1      2      0      0      1      0      0      0      1      1      0      1      1      1     -1      0      2
    C      -16     19    -16    -18    -11    -14    -13    -13    -14    -14    -14    -11    -17    -16    -13    -16    -14    -12    -12    -10     -4
    D        1     -7      6      3     -3      1      0     -3      1     -3     -2      2      1      2      0      1      0     -2     -3     -2     -2
    E        3     -7      5      7     -1      2      2      0      3      0      0      3      2      4      3      3      2      1     -1      0     -1
    F       -4     -4     -6     -6      7     -5     -1      0     -4      1      0     -5     -5     -4     -4     -4     -3     -1      3      3      0
    G       -2     -6     -3     -4     -5      5     -4     -5     -4     -5     -4     -2     -3     -4     -4     -2     -3     -5     -6     -4     -3
    H        0     -6      0      0      1      0      8     -1      0      0      0      1     -2      1      1      0      0      0      1      3      0
    I       -3     -7     -6     -5      0     -5     -3      4     -4      1      1     -5     -4     -4     -3     -5     -2      2     -2     -1      0
    K        2     -6      2      2     -1      1      2      0      5      1      1      2      0      3      4      2      2      0     -2      0     -1
    L       -2     -6     -5     -4      1     -4     -2      2     -3      4      2     -3     -4     -3     -2     -4     -2      1      0      0      1
    M       -2     -7     -4     -3      1     -2     -1      2     -2      2      6     -3     -4     -2     -1     -2     -1      1      0      0      1
    N        0     -5      1      0     -3      1      1     -3      0     -2     -2      6     -2      0      0      1      1     -2     -3     -1     -1
    P       -1     -7     -1     -2     -4     -1     -3     -3     -2     -3     -4     -2      9     -2     -3      0     -1     -2     -4     -4     -4
    Q        2     -7      2      2     -1      1      2     -1      2      0      0      2      0      5      2      1      1      0     -2     -1      0
    R        1     -6      1      1     -1      0      2      0      3      0      1      1     -1      2      6      1      1      0     -1      0      0
    S        0     -6     -1     -1     -3      0     -2     -3     -1     -3     -3      0      0     -1     -1      3      1     -2     -4     -3      0
    T       -1     -7     -2     -2     -3     -2     -2     -2     -2     -2     -2     -1     -2     -2     -2      0      3     -1     -3     -3      0
    V       -3     -6     -6     -5     -1     -4     -3      1     -4      0      0     -5     -3     -4     -4     -4     -2      2     -2     -2      0
    W       -4     -6     -6     -5      2     -6     -2     -2     -5     -1     -2     -5     -5     -4     -4     -5     -4     -2     12      2     -3
    Y       -3     -5     -5     -5      3     -4      1     -1     -3     -1     -1     -3     -5     -3     -3     -4     -3     -2      3      7     -1
    J       -2      0     -4     -5      0     -2     -1      0     -3      0      0     -3     -6     -2     -2     -1     -1      0     -1      0      9
    U       -5     16     -7     -8     -3     -5     -4     -3     -6     -3     -3     -5     -9     -6     -5     -4     -4     -3     -4     -3      6
    ...
    
  11. To generate a heat map for each table with values in it,

    ~user $ egor -l TEMLIST --heatmap 0 --heatmap-values
    

    which will look like this,

  12. To generate one big figure, ‘myheatmaps.gif’ containing all the heat maps (4 maps in a row),

    ~user $ egor -l TEMLIST --heatmap 1 --heatmap-stem myheatmaps --heatmap-format 1 --heatmap-columns 4
    

    which will look like this,

Repository

You can download a pre-built RubyGems package from

or, You can fetch the source from

Contact

Comments are welcome, please send an email to me (seminlee at gmail dot com).

License

This work is licensed under a Creative Commons Attribution-Noncommercial 2.0 UK: England & Wales License.