Feature: Join alignment blocks with reference data In order to produce FASTA output with one sequence per species For use in downstream tools We need to join adjacent MAF blocks together And fill gaps in the reference sequence from reference data
Scenario: Non-overlapping MAF blocks in region of interest Given MAF data: """ ##maf version=1 a score=20.0 s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
a score=21.0
s sp1.chr1 30 10 + 50 AGGGCGGTCC
s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
"""
And chromosome reference sequence:
"""
>sp1.chr1
CCAGGATGCT
GGGCTGAGGG
CAGTTGTGTC
AGGGCGGTCC
GGTGCAGGCA
"""
When I open it with a MAF reader
And build an index on the reference sequence
And tile sp1.chr1:0-50 with the chromosome reference
And tile with species [sp1, sp2, sp3]
And write the tiled data as FASTA
Then the FASTA data obtained should be:
"""
>sp1
CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
>sp2
**********GGGCTGACGGC--AG*******AGGGCGGTGC**********
>sp3
**********AGGTTTAGGGCAGAG***************************
"""
Scenario: Non-overlapping MAF blocks with species map Given MAF data: """ ##maf version=1 a score=20.0 s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
a score=21.0
s sp1.chr1 30 10 + 50 AGGGCGGTCC
s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
"""
And chromosome reference sequence:
"""
>sp1.chr1
CCAGGATGCT
GGGCTGAGGG
CAGTTGTGTC
AGGGCGGTCC
GGTGCAGGCA
"""
When I open it with a MAF reader
And build an index on the reference sequence
And tile sp1.chr1:0-50 with the chromosome reference
And tile with species [sp1, sp2, sp3]
And map species sp1 as mouse
And map species sp2 as hippo
And map species sp3 as squid
And write the tiled data as FASTA
Then the FASTA data obtained should be:
"""
>mouse
CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
>hippo
**********GGGCTGACGGC--AG*******AGGGCGGTGC**********
>squid
**********AGGTTTAGGGCAGAG***************************
"""
Scenario: Subset of non-overlapping MAF blocks in region Given MAF data: """ ##maf version=1 a score=20.0 s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
a score=21.0
s sp1.chr1 30 10 + 50 AGGGCGGTCC
s sp2.chr5 53030 10 + 65536 AGGGCGGTGC
"""
And chromosome reference sequence:
"""
>sp1.chr1
CCAGGATGCT
GGGCTGAGGG
CAGTTGTGTC
AGGGCGGTCC
GGTGCAGGCA
"""
When I open it with a MAF reader
And build an index on the reference sequence
And tile sp1.chr1:12-36 with the chromosome reference
And tile with species [sp1, sp2, sp3]
And write the tiled data as FASTA
Then the FASTA data obtained should be:
"""
>sp1
GCTGAGGGC--AGTTGTGTCAGGGCG
>sp2
GCTGACGGC--AG*******AGGGCG
>sp3
GTTTAGGGCAGAG*************
"""
Scenario: Overlapping MAF blocks in region of interest Given MAF data: """ ##maf version=1 a score=20.0 s sp1.chr1 10 13 + 50 GGGCTGAGGGC--AG s sp2.chr5 53010 13 + 65536 GGGCTGACGGC--AG s sp3.chr2 33010 15 + 65536 AGGTTTAGGGCAGAG
a score=21.0
s sp1.chr1 20 10 + 50 AGGGCGGTCC
s sp2.chr5 53020 10 + 65536 AGGGCGGTGC
"""
And chromosome reference sequence:
"""
>sp1.chr1
CCAGGATGCT
GGGCTGAGGG
CAGTTGTGTC
AGGGCGGTCC
GGTGCAGGCA
"""
When I open it with a MAF reader
And build an index on the reference sequence
And tile sp1.chr1:0-50 with the chromosome reference
And tile with species [sp1, sp2, sp3]
And write the tiled data as FASTA
Then the FASTA data obtained should be:
"""
>sp1
CCAGGATGCTGGGCTGAGGGAGGGCGGTCCAGGGCGGTCCGGTGCAGGCA
>sp2
**********GGGCTGACGGAGGGCGGTGC********************
>sp3
**********AGGTTTAGGG******************************
"""
@no_jruby
Scenario: Tile with CLI tool and reference seq
Given test files:
| gap-sp1.fa.gz |
| gap-1.maf |
| gap-1.kct |
When I run maf_tile --reference gap-sp1.fa.gz --interval 0-50 -s sp1:mouse -s sp2:nautilus -s sp3:jaguar gap-1.maf gap-1.kct
Then it should pass with:
"""
>mouse
CCAGGATGCTGGGCTGAGGGC--AGTTGTGTCAGGGCGGTCCGGTGCAGGCA
>nautilus
*******GGGCTGACGGC--AG*AGGGCGGTGC*******
>jaguar
*******AGGTTTAGGGCAGAG************************
"""
@no_jruby
Scenario: Tile with CLI tool and no reference seq
Given test files:
| gap-1.maf |
| gap-1.kct |
When I run maf_tile --interval 0-50 -s sp1:mouse -s sp2:nautilus -s sp3:jaguar gap-1.maf gap-1.kct
Then it should pass with:
"""
>mouse
NNNNNNNNNNGGGCTGAGGGC--AGNNNNNNNAGGGCGGTCCNNNNNNNNNN
>nautilus
*******GGGCTGACGGC--AG*AGGGCGGTGC*******
>jaguar
*******AGGTTTAGGGCAGAG************************
"""
@no_jruby
Scenario: Tile with CLI tool and BED intervals
Given test files:
| gap-1.maf |
| gap-1.kct |
| gap-sp1.fa.gz |
And a file named "example.bed" with:
"""
sp1.chr1 12 36
"""
When I run maf_tile -s sp1:mouse -s sp2:nautilus -s sp3:jaguar --output-base selected --bed example.bed --reference gap-sp1.fa.gz gap-1.maf gap-1.kct
Then it should pass with:
"""
"""
And the file "selected_12-36.fa" should contain exactly:
"""
>mouse
GCTGAGGGC--AGTTGTGTCAGGGCG
>nautilus
GCTGACGGC--AG****AGGGCG
>jaguar
GTTTAGGGCAGAG**********
"""
@no_jruby
Scenario: Tile with CLI tool and implicit index
Given test files:
| mm8_chr7_tiny.maf |
| mm8_chr7_tiny.kct |
When I run maf_tile -s mm8 -s rn4 -s hg18 --interval 80082334-80082344 mm8_chr7_tiny.maf
Then it should pass with:
"""
>mm8
GGGCTGAGGG
>rn4
GGGCTGAGGG
>hg18
--------GG
"""
@no_jruby
Scenario: Tile with CLI tool and directory
Given test files:
| mm8_chr7_tiny.maf |
| mm8_chr7_tiny.kct |
When I run maf_tile -s mm8 -s rn4 -s hg18 --interval mm8.chr7:80082334-80082344 .
Then it should pass with:
"""
>mm8
GGGCTGAGGG
>rn4
GGGCTGAGGG
>hg18
--------GG
"""
@no_jruby
Scenario: Tile with CLI tool and directory, 1-based
Given test files:
| mm8_chr7_tiny.maf |
| mm8_chr7_tiny.kct |
When I run maf_tile -s mm8 -s rn4 -s hg18 --one-based --interval mm8.chr7:80082335-80082344 .
Then it should pass with:
"""
>mm8
GGGCTGAGGG
>rn4
GGGCTGAGGG
>hg18
--------GG
"""