[EMBOSS] union and splitter changes
Kim Rutherford
kmr at sanger.ac.uk
Tue Sep 2 17:18:20 UTC 2003
Hi. We've modified union and splitter to cope with features (using the
standard -feature flag). Feature coordinates are changed to the
appropriate coordinates on the joined sequence.
We've also made changes to both programs so that splitter is (somewhat)
able to reverse the action of union. union was changed to add a source
feature to the joined sequence to record the ID, length and position of
the original sequence. splitter uses the source feature to extract the
original sequence position.
We have made these changes because it's sometimes useful for us to be
able to be merge an ordered and oriented stream of embl entries into
one entry for analysis (such as running gene finders and similarity
searches). It also allows all the sequence and features from an
unfinished genome to be view and edited simultaneously in Artemis and
ACT.
One further change to union is the -findoverlap option which searches
pairs of sequences for exact base overlaps and will join the sequence
using the overlap information. Our group uses this option for cosmid
based sequencing projects. Pairs of cosmids in the tiling path are
sequenced with an overlap, annotated and then joined together for
submission to EMBL.
I've put the modified files on the web in case they are of use to
someone else:
http://www.sanger.ac.uk/Users/kmr/emboss/union.c
http://www.sanger.ac.uk/Users/kmr/emboss/union.acd
http://www.sanger.ac.uk/Users/kmr/emboss/splitter.c
http://www.sanger.ac.uk/Users/kmr/emboss/splitter.acd
The programs should be drop-in replacements for the standard programs
if the new options aren't used. We've been using the latest CVS
version of EMBOSS for development, so you'll probably need a CVS
check-out to compile them.
Kim.
Here is an example using an EMBL format file:
ID entry1 standard; DNA; UNC; 120 BP.
FH Key Location/Qualifiers
FH
FT CDS join(5..100,102..103)
SQ Sequence 120 BP; 40 A; 27 C; 18 G; 35 T; 0 other;
gatctgcttt atttgcaaca catattgagg acttacacaa catcacaagc aatcaactgt 60
atgaaactta tcgaactgaa aagctttcaa cctcacagtt gcttttagac agtactgtcg 120
//
ID entry2 standard; DNA; UNC; 120 BP.
FH Key Location/Qualifiers
FH
FT CDS 10..20
FT CDS 110..120
SQ Sequence 120 BP; 33 A; 25 C; 11 G; 51 T; 0 other;
attgtatgtt tctttttttt aaatttcaac ttcatctgct tactctacag atcccccaat 60
ttttgtaaaa attgtcgatg tatcccttaa aattttattc aactgggacc tatccaacat 120
//
ID entry3 standard; DNA; UNC; 120 BP.
FH Key Location/Qualifiers
FH
FT CDS complement(1..110)
FT CDS 10..20
SQ Sequence 120 BP; 35 A; 22 C; 18 G; 45 T; 0 other;
tttcaatagg ctcacttgaa agttcgttat ttacgaaaga taaagcttcc tctgcttttc 60
tttgatcaat taatgagctt tctgaattta tgctgtatat gcaatcggaa ctcaaaccat 120
//
Using
union -feature -source -osf embl
gives:
ID entry1 standard; DNA; UNC; 360 BP.
FH Key Location/Qualifiers
FH
FT source 1..120
FT /origid="entry1"
FT source 121..240
FT /origid="entry2"
FT source 241..360
FT /origid="entry3"
FT CDS join(5..100,102..103)
FT CDS 130..140
FT CDS 230..240
FT CDS complement(241..350)
FT CDS 250..260
SQ Sequence 360 BP; 108 A; 74 C; 47 G; 131 T; 0 other;
gatctgcttt atttgcaaca catattgagg acttacacaa catcacaagc aatcaactgt 60
atgaaactta tcgaactgaa aagctttcaa cctcacagtt gcttttagac agtactgtcg 120
attgtatgtt tctttttttt aaatttcaac ttcatctgct tactctacag atcccccaat 180
ttttgtaaaa attgtcgatg tatcccttaa aattttattc aactgggacc tatccaacat 240
tttcaatagg ctcacttgaa agttcgttat ttacgaaaga taaagcttcc tctgcttttc 300
tttgatcaat taatgagctt tctgaattta tgctgtatat gcaatcggaa ctcaaaccat 360
//
Then
splitter -feature -source -osf embl
gives:
ID entry1 standard; DNA; UNC; 120 BP.
FH Key Location/Qualifiers
FH
FT source 1..120
FT /origid="entry1"
FT CDS join(5..100,102..103)
SQ Sequence 120 BP; 40 A; 27 C; 18 G; 35 T; 0 other;
gatctgcttt atttgcaaca catattgagg acttacacaa catcacaagc aatcaactgt 60
atgaaactta tcgaactgaa aagctttcaa cctcacagtt gcttttagac agtactgtcg 120
//
ID entry2 standard; DNA; UNC; 120 BP.
FH Key Location/Qualifiers
FH
FT source 1..120
FT /origid="entry2"
FT CDS 10..20
FT CDS 110..120
SQ Sequence 120 BP; 33 A; 25 C; 11 G; 51 T; 0 other;
attgtatgtt tctttttttt aaatttcaac ttcatctgct tactctacag atcccccaat 60
ttttgtaaaa attgtcgatg tatcccttaa aattttattc aactgggacc tatccaacat 120
//
ID entry3 standard; DNA; UNC; 120 BP.
FH Key Location/Qualifiers
FH
FT source 1..120
FT /origid="entry3"
FT CDS complement(1..110)
FT CDS 10..20
SQ Sequence 120 BP; 35 A; 22 C; 18 G; 45 T; 0 other;
tttcaatagg ctcacttgaa agttcgttat ttacgaaaga taaagcttcc tctgcttttc 60
tttgatcaat taatgagctt tctgaattta tgctgtatat gcaatcggaa ctcaaaccat 120
//
More information about the EMBOSS
mailing list