[Bioperl-l] translating a GenBank file

Mon Mar 13 09:36:13 UTC 2006

Dear BioPerlers

I have a general strategy question for the following situation.  I 
want to take GenBank files of viral genomes (~100-200kb only), and 
produce a translation around the sequence in a format like:

  TAAACCTGTCTTTCAGACCTTGTTGGACATCCCGTACAATCAAGATGTTCCTGTATGTTG
                                      S  R  C  S  C  M  L

  TTTGCAGTCTGGCGGTTTGCTTTCGAGGACTATTAAGCCTTTCTCTGCAATCGTCTCCAA
      F  A  V  W  R  F  A  F  E  D  Y  M  A  F  L  C  N  R  L  Q

  ATCTCTGCCCTGGAGTGATTTCAACGCCTTACACGTTGACCTGTCCGTCTAATACATCCT
      I  S  A  L  E  M

where the translation is above the DNA for forward strand and below 
for complementary strand ORFs.  I initially attempted this using 
EMBOSS, where there are a couple of utilities called "showseq" and 
"prettyseq" that will take a range of start and stop points and 
produce a translation of the type above.  However, it turns out that 
they are not quite up to the job for translating whole genomes 
because showseq throws an exception when the ORFs are overlapping (a 
deliberate feature), and both showseq and prettyseq seem to have 
trouble with a combination of forward and reverse translations on the 
same sequence (not officially confirmed as a bug yet, but certainly 
not a feature).

So, before I start trying to hack EMBOSS, is there a better way to do 
it in BioPerl?  It occurs to me that the above format is not a 
"standard", although it is seen quite commonly in publications etc, 
which may be the major difficulty.

All suggestions gratefully appreciated
Derek

_________________________

Derek Gatherer Ph.D. Cert.Ed.

Computer Officer

Institute of Virology

Church Street

Glasgow G11 5JR

Tel:  0141-330-6268

Fax: 0141-337-2236