[Biojava-l] Re: Biojava-l digest, Vol 1 #334 - 2 msgs

Sarath sarath@decodon.com
Mon, 11 Jun 2001 20:17:55 +0200 (MEST)


hi thomas
  It was so nice to hear from you the early response and it did work the
way you said but i just had to include a set of dummy characters to
mislead the program but is this  the only way i could manage with
such files as the files i have suggested as a reference were the newly
sequenced ones i.e the sequencing of these genomes was completed on 1st
june  so what have u to say for this ? I dont exactly know the purpose of
GI in the genbank format but do u think this level of rigidity is
neccessary for genbankformat reading 
from sarath 
  

On Mon, 11 Jun 2001, Thomas Down wrote:

> On Mon, Jun 11, 2001 at 06:32:08PM +0200, Sarath wrote:
> >   Hello everyone
> >     I have been using the biojava package for around a month and today
> > surprisingly i have met with a strange circumstance of simple program not
> > able to compute the gc content from a file in the gene bank format.I would
> > be very glad if some body can tell me the bug in the program to find the
> > gc content from the file (AE000783.gbk)  from the url
> >   ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Borrelia_burgdorferi/  
> >     I am pasting the source code here i hope this is not an inconvenience
> > to some people.The code is exactly the same as that given in the tutorials
> > .Probably there is a bug in biojava ...who knows   or should i blame the
> > genbank people???? 
> 
> Bug?  No, never... ;)
> 
> I've taken a quick look at this.  The problem is with
> the VERSION line of the file.  Most Genbank entries look
> like:
> 
>   VERSION     AE000784.1  GI:2690041
> 
> The BioJava parser is explicitly expecting to see two tokens
> folowing the VERSION keyword.
> 
> The file you are trying to read has a line like:
> 
> 
>   VERSION     AE000783
> 
> The only format documentation I can find for Genbank is at:
> 
>   ftp://ncbi.nlm.nih.gov/genbank/docs/
> 
> This concentrates on the feature tables, and doesn't seem
> to give a normative description of the headers.  However,
> the example given includes the two-token VERSION string,
> and this seems to be found in the vast majority of entries.
> 
> That said, we probably ought to go the `strict in what you
> produce, tolerant in what you accept' approach.  I'll leave
> the final decision to people who use the Genbank parser more
> regularly, though (Greg, are you listening?)
> 
> Is there a normative specification for the Genbank header
> lines anywhere?  If so, maybe it is worth complaining about
> that entry...
> 
>    Thomas.
>