[EMBOSS] union and splitter changes

Kim Rutherford kmr at sanger.ac.uk
Tue Sep 2 17:18:20 UTC 2003


Hi.  We've modified union and splitter to cope with features (using the
standard -feature flag).  Feature coordinates are changed to the
appropriate coordinates on the joined sequence.

We've also made changes to both programs so that splitter is (somewhat)
able to reverse the action of union.  union was changed to add a source
feature to the joined sequence to record the ID, length and position of
the original sequence.  splitter uses the source feature to extract the
original sequence position.

We have made these changes because it's sometimes useful for us to be
able to be merge an ordered and oriented stream of embl entries into
one entry for analysis (such as running gene finders and similarity
searches).  It also allows all the sequence and features from an
unfinished genome to be view and edited simultaneously in Artemis and
ACT.

One further change to union is the -findoverlap option which searches
pairs of sequences for exact base overlaps and will join the sequence
using the overlap information.  Our group uses this option for cosmid
based sequencing projects.  Pairs of cosmids in the tiling path are
sequenced with an overlap, annotated and then joined together for
submission to EMBL.

I've put the modified files on the web in case they are of use to
someone else: 
  http://www.sanger.ac.uk/Users/kmr/emboss/union.c
  http://www.sanger.ac.uk/Users/kmr/emboss/union.acd
  http://www.sanger.ac.uk/Users/kmr/emboss/splitter.c
  http://www.sanger.ac.uk/Users/kmr/emboss/splitter.acd

The programs should be drop-in replacements for the standard programs
if the new options aren't used.  We've been using the latest CVS
version of EMBOSS for development, so you'll probably need a CVS
check-out to compile them.

Kim.



Here is an example using an EMBL format file:

ID   entry1     standard; DNA; UNC; 120 BP.
FH   Key             Location/Qualifiers
FH
FT   CDS             join(5..100,102..103)
SQ   Sequence 120 BP; 40 A; 27 C; 18 G; 35 T; 0 other;
     gatctgcttt atttgcaaca catattgagg acttacacaa catcacaagc aatcaactgt        60
     atgaaactta tcgaactgaa aagctttcaa cctcacagtt gcttttagac agtactgtcg       120
//
ID   entry2     standard; DNA; UNC; 120 BP.
FH   Key             Location/Qualifiers
FH
FT   CDS             10..20
FT   CDS             110..120
SQ   Sequence 120 BP; 33 A; 25 C; 11 G; 51 T; 0 other;
     attgtatgtt tctttttttt aaatttcaac ttcatctgct tactctacag atcccccaat        60
     ttttgtaaaa attgtcgatg tatcccttaa aattttattc aactgggacc tatccaacat       120
//
ID   entry3     standard; DNA; UNC; 120 BP.
FH   Key             Location/Qualifiers
FH
FT   CDS             complement(1..110)
FT   CDS             10..20
SQ   Sequence 120 BP; 35 A; 22 C; 18 G; 45 T; 0 other;
     tttcaatagg ctcacttgaa agttcgttat ttacgaaaga taaagcttcc tctgcttttc        60
     tttgatcaat taatgagctt tctgaattta tgctgtatat gcaatcggaa ctcaaaccat       120
//


Using 
  union -feature -source -osf embl  
gives:


ID   entry1     standard; DNA; UNC; 360 BP.
FH   Key             Location/Qualifiers
FH
FT   source          1..120
FT                   /origid="entry1"
FT   source          121..240
FT                   /origid="entry2"
FT   source          241..360
FT                   /origid="entry3"
FT   CDS             join(5..100,102..103)
FT   CDS             130..140
FT   CDS             230..240
FT   CDS             complement(241..350)
FT   CDS             250..260
SQ   Sequence 360 BP; 108 A; 74 C; 47 G; 131 T; 0 other;
     gatctgcttt atttgcaaca catattgagg acttacacaa catcacaagc aatcaactgt        60
     atgaaactta tcgaactgaa aagctttcaa cctcacagtt gcttttagac agtactgtcg       120
     attgtatgtt tctttttttt aaatttcaac ttcatctgct tactctacag atcccccaat       180
     ttttgtaaaa attgtcgatg tatcccttaa aattttattc aactgggacc tatccaacat       240
     tttcaatagg ctcacttgaa agttcgttat ttacgaaaga taaagcttcc tctgcttttc       300
     tttgatcaat taatgagctt tctgaattta tgctgtatat gcaatcggaa ctcaaaccat       360
//


Then
  splitter -feature -source -osf embl
gives:


ID   entry1     standard; DNA; UNC; 120 BP.
FH   Key             Location/Qualifiers
FH
FT   source          1..120
FT                   /origid="entry1"
FT   CDS             join(5..100,102..103)
SQ   Sequence 120 BP; 40 A; 27 C; 18 G; 35 T; 0 other;
     gatctgcttt atttgcaaca catattgagg acttacacaa catcacaagc aatcaactgt        60
     atgaaactta tcgaactgaa aagctttcaa cctcacagtt gcttttagac agtactgtcg       120
//
ID   entry2     standard; DNA; UNC; 120 BP.
FH   Key             Location/Qualifiers
FH
FT   source          1..120
FT                   /origid="entry2"
FT   CDS             10..20
FT   CDS             110..120
SQ   Sequence 120 BP; 33 A; 25 C; 11 G; 51 T; 0 other;
     attgtatgtt tctttttttt aaatttcaac ttcatctgct tactctacag atcccccaat        60
     ttttgtaaaa attgtcgatg tatcccttaa aattttattc aactgggacc tatccaacat       120
//
ID   entry3     standard; DNA; UNC; 120 BP.
FH   Key             Location/Qualifiers
FH
FT   source          1..120
FT                   /origid="entry3"
FT   CDS             complement(1..110)
FT   CDS             10..20
SQ   Sequence 120 BP; 35 A; 22 C; 18 G; 45 T; 0 other;
     tttcaatagg ctcacttgaa agttcgttat ttacgaaaga taaagcttcc tctgcttttc        60
     tttgatcaat taatgagctt tctgaattta tgctgtatat gcaatcggaa ctcaaaccat       120
//



More information about the EMBOSS mailing list